13
Unicode Day 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Embed Size (px)

Citation preview

Page 1: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

UnicodeDay 12 - 9/22/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

22-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/CompCu

ltEN/

Page 3: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The quiz was the review.

Review of Lists

22-Sept-2014

3

NLP, Prof. Howard, Tulane University

Page 4: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

22-Sept-2014

4

NLP, Prof. Howard, Tulane University

Page 5: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

6. Non-English characters: one code to rule them all

22-Sept-2014

5

NLP, Prof. Howard, Tulane University

Page 6: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Did you know …

1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a']

22-Sept-2014NLP, Prof. Howard, Tulane University

6

Page 7: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Introduction So your program is humming along, and it hits the string 'cañón' and chokes. For

instance, it may try to find out the length of cañón:

1. >>> S = 'cañón'

2. >>> len(S)

3. >>> from re import findall

4. >>> findall(r'\w{5}',S)

5. >>> T = findall(r'.{5}',S)

6. >>> T

7. ['ca\xc3\xb1\xc3']

8. >>> U = ''.join(T)

9. >>> print U

10. >>> findall(r'.{7}',S)

11. ['ca\xc3\xb1\xc3\xb3n']

12. >>> T = findall(r'.{7}',S)

13. >>> U = ''.join(T)

14. >>> print U

15. cañón

22-Sept-2014NLP, Prof. Howard, Tulane University

7

Page 8: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

6.1. English characters and ASCII Computers were originally designed to use the

English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced /ˈæski/ or “ass-kee”, see ASCII in Wikipedia.

ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete.

22-Sept-2014NLP, Prof. Howard, Tulane University

8

Page 9: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

ASCII characters

  0 1 2 3 4 5 6 7 8 9 A B C D E F

0 – – – – – – – – – – – – – – – –

1 – – – – – – – – – – – – – – – –

2   ! “ # $ % & ‘ ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ –

22-Sept-2014NLP, Prof. Howard, Tulane University

9

Page 10: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

So now you know …

1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a'] 4. >>> ord(' ') 5. >>> ord('!') 6. >>> ord('~') 7. >>> chr(32) 8. >>> chr(33) 9. >>> chr(126) 10. >>> chr(127)

22-Sept-2014NLP, Prof. Howard, Tulane University

10

Page 11: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Background

6.2. Unicode and UTF-8

22-Sept-2014

11

NLP, Prof. Howard, Tulane University

Page 12: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

6.2.1. Character encoding in Python

22-Sept-2014NLP, Prof. Howard, Tulane University

12

Page 13: UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7. NLTK and Internet corporabut I am going to fold this chapter into §1 & §2, so the chapter numbering will change.

Next time

22-Sept-2014NLP, Prof. Howard, Tulane University

13