27
lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Embed Size (px)

Citation preview

Page 1: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

lis508 lecture 1: bits, bytes and characters

Thomas Krichel

2003-09-30

Page 2: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Structure

• Numbers– Bits– Bytes

• Character sets– Coded character set– Character endcoding

Page 3: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Literature, no need to read…

• Norton “new inside the PC” chapter 4

• http://www.danbbs.dk/~erikoest/bb_terms.htm

• http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html

• http://www.cl.cam.ac.uk/~mgk25/unicode.html

Page 4: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Information

• Information is best understood as “what it takes to answer a question”.

• The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information.

• Term first used by John Turkey in 1946.

• Concatenation of “binary digit”.

Page 5: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Usage of bits

• Computers are sometimes classified by the number of bits they can process at one time. "32 bit processor"

• Graphics are also often described by the number of bits used to represent each dot.

Page 6: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

bits and bytes

• a bit can take the values 0 or 1, thus it can describe 2 possibilities

• two bits can take the value 00, 01, 10, 11, thus it can describe four 2×2 possibilities

• n bits can encode 2 power n possibilities.• The first chips used to process 8 bits at a time. It

become customary to refer to them as a byte. It can encode 2 power 8 possibilities.

• We can use binary numbers just as decimal numbers.

Page 7: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

application of bytes

• IP (Internet Protocol) numbers are used as the addresses of computers on the Internet.

• In IP version 4 (the one that is most commonly used), each IP number has 4 bytes.

• It is represented as x.x.x.x where x is a number between 0 and 255 (why?)

• how many computers can there be on the Internet at any one time?

Page 8: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

decimal/binary numbers

• 0 0• 1 1• 2 10• 3 11• 4 100• 5 101• 6 110• 7 111

• 8 1000• 9 1001• 10 1010• 11 1011• 12 1100• 13 1101• 14 1110• 15 1111

Page 9: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Many bytes

• Larger units are– Kilo byte is 2 power 10 bytes (=1024 bytes)– Mega bytes is 2 power 20 bytes– Giga bytes is 2 power 30 bytes– Tera byte is 2 power 40 bytes

• From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.

Page 10: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Hex numbers• A byte is often represented by two hex

numbers.

• Each hex number can encode 16 values

• Written 0 to 9, then A B C D E F. F is 15.

• Conventionally prefixed with 0x

• Use Microsoft calculator with scientific notation to convert.

Page 11: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

application of hex numbers

• Media Access Control (mac) addresses of hardware that allows access to computer networks. They are 6-byte numbers, each byte written as 2 hex numbers, e.g. 00:60:08:F5:20:A9

• character numbers that you see when you are inserting a special symbol in Microsoft software, e.g. powerpoint.

Page 12: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Characters

• Much of the information processed by computers is in the form of characters.

• A character only makes sense for a human user of a minimum cultural level.

• A character is not a glyph.– ligatures

Page 13: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Information in a computer file

• A file is a piece of data on a stored on a computer.

• Any file contains a sequence of 0s and 1s, like 1010100101010011110101010101…

• For a computer to make sense of a file, it has to know what type of file it is.

Page 14: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

executable files

• Files that are executable are files that make the computer do something. For example the file starts a program, say powerpoint. An executable on one computer may not run on another

• Non-executable files hold data that is used by an executable file. We will call them data files. Example: powerpoint slides file.

Page 15: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

text files

• Many data files contain textual data. • Textual data is a sequence of characters.• A character is an elementary symbol that

has some meaning– alphabet letter– hieroglyph

• Example: email file• Text files can be read by many computer

programs.

Page 16: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

non-text files

• Examples for non-text files are – graphics files– movie files– sound files

• non-text files are not very important in library settings– there is not way to organize information

retrieval for non-text files. They have to be retrieved using a textual surrogate.

– traditional library material are textual

• will talk about this later.

Page 17: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Representing characters

• Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a character set.

• Examples for characters are – a– c– ë– €

Page 18: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Legacy character sets

• In early days, computers were a lot less powerful than they are today.

• Could only deal with the characters that are most commonly used.

• Such sets are– ascii– ISO-8859-1– cp1252

Page 19: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

ASCII

• American Standard Code for Information Interchange

• 7-bit character set. There is no such thing as 8-bit ASCII

• 95 printable symbols

• 33 control characters (0-31, 127)

• http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list up to 127

Page 20: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

some ASCII control characters

• CR (13, ^M) is the carriage return

• LF (10, ^J) is the linefeed

• FF (12, ^L) is the form feed (new page)

• BS (8, ^H) is the backspace

• DEL (127, ALT-127) is delete

• ESC (27, ^[) escape

Page 21: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

ISO-8859-1

• ISO-8859-1, aka ISO-latin-1 extends ASCII with characters that are commonly used by the western European languages.

• It is the default character set of html.

• Positions 128 to 159 are not used.

• Cp1252 fills these with graphic chars. It is as Microsoft character set.

Page 22: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

This is not enough

• There are around 6800 different languages around.

• Some of these languages use characters sets that are not finite, i.e. folks can make up now characters out of existing ones!

• Setting up a character set for all languages is almost impossible.

Page 23: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

ISO 10646-1

• Defines the Universal Character Set (UCS)

• UCS contains the characters required to represent characters used by many known languages, even the likes of Oriya, Telugu, Bopomofo, Runic.

• ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars.

• Not finished.

.

Page 24: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Unicode

• ISO is a inter-government agency. Slow and bureaucratic.

• Industry has come together to work on Unicode, a 2-byte character set.

• With some minor exceptions, the Unicode characters are the some as the first 65536 characters in UCS.

• Much better documented standard.

Page 25: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Unicode and legacy sets

• The first 128 characters are identical to those in ASCII

• The next 128 characters are identical to ISO 8859-1 (Latin-1).

• Unicode is well documented and the Unicode book can be downloaded from the Internet. A must-have for the serious digital librarian.

Page 26: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

Politics…

• Does it make sense to use Unicode rather than, say, ISO-latin-1?

• Many commercial pieces of software have data files that contain character data interspersed with non-character data. Is that good?

Page 27: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30

http://openlib.org/home/krichel

Thank you for your attention!