Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
LING 408/508: Programming for Linguists
Lecture 2 August 26th
Today’s Topics
• con$nuing on from last $me … • Homework 1
Adminstrivia • No class on
– Monday September 7th (Labor Day) – Wednesday November 11th (Veterans Day) – Week a5er September 11th (out of town), plus Monday 21st – Monday October 12th
IntroducTon: data types • what if you want to store even larger numbers than 32 bits? – Binary Coded Decimal (BCD) – 1 byte can code two digits (0-‐9 requires 4 bits) – 1 nibble (4 bits) codes the sign (+/-‐), e.g. hex C/D 23 22 21 20
0 0 0 0
23 22 21 20
0 0 0 1
23 22 21 20
1 0 0 1
0
1
9
2 0 1 4
2 bytes (= 4 nibbles)
+ 2 0 1 4
2.5 bytes (= 5 nibbles)
23 22 21 20
1 1 0 0 C 23 22 21 20
1 1 0 1 D credit (+) debit (-‐)
IntroducTon: data types
• Typically, 64 bits (8 bytes) are used to represent floaTng point numbers (double precision) – c = 2.99792458 x 108 (m/s) – coefficient: 52 bits (implied 1, therefore treat as 53) – exponent: 11 bits (usually not 2’s complement, unsigned with bias 2(10-‐1)-‐1 = 511)
– sign: 1 bit (+/-‐)
C: float double
wikipedia
x86 CPUs have a built-‐in floaTng point coprocessor (x87) 80 bit long registers
e.g. probabiliTes
IntroducTon: data types
• Next Tme, we'll talk about the representaTon of characters (leeers, symbols, etc.)
Example 1
• Recall the speed of light: • c = 2.99792458 x 108 (m/s)
1. Can a 4 byte integer be used to represent c exactly? – 4 bytes = 32 bits – 32 bits in 2’s complement format – Largest posiTve number is – 231-‐1 = 2,147,483,647 – c = 299,792,458
Example 2
• Recall the speed of light: • c = 2.99792458 x 108 (m/s)
2. How much memory would you need to encode c using BCD notaTon? – 9 digits – each digit requires 4 bits (a nibble) – BCD notaTon includes a sign nibble – total is 5 bytes
Example 3
• Recall the speed of light: • c = 2.99792458 x 108 (m/s)
3. Can the 64 bit floaTng point representaTon (double) encode c without loss of precision? – Recall significand precision: 53 bits (52 explicitly stored)
– 253-‐1 = 9,007,199,254,740,991 – almost 16 digits
Example 4 • Recall the speed of light: • c = 2.99792458 x 108 (m/s)
• The 32 bit floaTng point representaTon (float) – someTmes called single precision -‐ is composed of 1 bit sign, 8 bits exponent (unsigned with bias 2(8-‐1)-‐1), and 23 bits coefficient (24 bits effecTve).
• Can it represent c without loss of precision? – 224-‐1 = 16,777,215 – Nope
Homework 1
• For both soluTons, show your work, i.e. how you derived your answer
• Pi (𝛑) is an irraTonal number – can't be represented precisely!
wikipedia
Homework 1
1. Encode Pi as accurately as possible using both the 64 and 32 bit floaTng point representaTons InstrucBon: draw the diagram and fill in the 1's and 0's
2. How many decimal places of precision is provided by each of the 64 and 32 bit floaTng point representaTons?
Homework 1 Hints • How to encode 1: (bias: 01111 + 0 = 20, frac: 1000… remember: there is an implicit leading 1,
• = 1.000… in binary)
Homework 1 Hints
• How to encode 2: (exp: 10000 = bias 01111 + 1 = 21, frac: 1000…) = 10.00… in binary
Homework 1 Hints
• How to encode 3: (exp: 10000 = bias 01111 + 1 = 21, frac: 1100…) = 11.000… in binary
Homework 1 Hints
• How to encode 4: (exp: 10001 = bias 01111 + 10 = 22, frac: 1000…) = 100.0… in binary
Homework 1 Hints
• How to encode 5: (exp: 10001 = bias 01111 + 10 = 22, frac: 1010…) = 101.0… in binary
Homework 1 Hints
• How to encode 6: (exp: 10001 = bias 01111 + 10 = 22, frac: 1100…) = 110.0… in binary
Homework 1 Hints
• How to encode 7: (exp: 10001 = bias 01111 + 10 = 22, frac: 1110…) = 111.0… in binary
Homework 1 Hints
• How to encode 8: (exp: 10001 = bias 01111 + 100 = 23, frac: 1000…) = 1000.0… in binary
Homework 1 Hints
• Decimal 3.5 is 1.11 x 21 = 11.1 in binary
Homework 1 Hints
• Decimal 3.25 is 1.101 x 21 = 11.01 in binary
Homework 1 Hints
• Decimal 3.125 is 1.1001 x 21 = 11.001 in binary
Homework 1
• Due Friday night – (by midnight in my emailbox)
• Required format (for all homeworks unless otherwise specified): – Plain text or PDF formats only
• (no .doc, .docx etc.) – Single file only – cut and paste into one document
• (no mulTple aeachments) – Subject line: 408/508 Homework 1 – First line: your full name
IntroducTon: data types • How about leeers, punctuaTon, etc.? • ASCII
– American Standard Code for InformaTon Interchange – Based on English alphabet (upper and lower case) + space + digits +
punctuaTon + control (Teletype Model 33) – QuesBon: how many bits do we need? – 7 bits + 1 bit parity – Remember everything is in binary …
C: char
Teletype Model 33 ASR Teleprinter (Wikipedia)
IntroducTon: data types order is important in sorTng!
0-‐9: there’s a connecTon with BCD. NoBce: code 30 (hex) through 39 (hex)
IntroducTon: data types • Parity bit:
– transmission can be noisy – parity bit can be added to ASCII code – can spot single bit transmission errors – even/odd parity:
• receiver understands each byte should be even/odd – Example:
• 0 (zero) is ASCII 30 (hex) = 011000 • even parity: 0110000, odd parity: 0110001
– Checking parity: • Exclusive or (XOR): basic machine instrucTon
– A xor B true if either A or B true but not both – Example:
• (even parity 0) 0110000 xor bit by bit • 0 xor 1 = 1 xor 1 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0
x86 assemby language: 1. PF: even parity flag set by arithmeTc ops. 2. TEST: AND (don’t store
result), sets PF 3. JP: jump if PF set Example: MOV al,<char> TEST al, al JP <locaTon if even> <go here if odd>
IntroducTon: data types • UTF-‐8
– standard in the post-‐ASCII world – backwards compaTble with ASCII – (previously, different languages had mul$-‐byte character sets that
clashed) – Universal Character Set (UCS) TransformaTon Format 8-‐bits
(Wikipedia)
IntroducTon: data types
• Example: – あ Hiragana leeer A: UTF-‐8: E38182 – Byte 1: E = 1110, 3 = 0011 – Byte 2: 8 = 1000, 1 = 0001 – Byte 3: 8 = 1000, 2 = 0010 – い Hiragana leeer I: UTF-‐8: E38184
Shis-‐JIS (Hex): あ: 82A0 い: 82A2
IntroducTon: data types • How can you tell what encoding your file is using? • DetecTng UTF-‐8
– Microsos: • 1st three bytes in the file is EF BB BF • (not all so=ware understands this; not everybody uses it)
– HTML: • <meta hep-‐equiv="Content-‐Type" content="text/html;charset=UTF-‐8" >
• (not always present) – Analyze the file:
• Find non-‐valid UTF-‐8 sequences: if found, not UTF-‐8… • InteresTng paper:
– hep://www-‐archive.mozilla.org/projects/intl/UniversalCharsetDetecTon.html
IntroducTon: data types • Filesystem:
– different on different computers: some$mes a problem if you mount filesystems across different systems
• Examples: – FAT32 (File AllocaTon Table) DOS, Windows, memory cards – ExFAT (Extended FAT) SD cards (> 4GB files) – NTFS (New Technology File System) Windows – ext4 (Fourth Extended Filesystem) Linux – HFS+ (Hierarchical File System Plus) Macs
limited to 4GB max file size
IntroducTon: data types • Filesystem:
– different on different computers: some$mes a problem if you mount filesystems across different systems
• Files: – Name (Path from / root) – Type (e.g. .docx, .pptx, .pdf, .html, .txt) – Owner (usually the Creator) – Permissions (for the Owner, Group, or Everyone) – need to be opened (to read from or write to) – Mode: read/write/append – Binary/Text in all programming languages:
open command
IntroducTon: data types • Text files:
– text files have lines: how do we mark the end of a line? – End of line (EOL) control character(s):
• LF 0x0A (Mac/Linux), • CR 0x0D (Old Macs), • CR+LF 0x0D0A (Windows)
– End of file (EOF) control character: • (EOT) 0x04 (aka Control-‐D)
binaryvision.nl
programming languages: NUL used to mark the end of a string