Download ppt - Python tutorial

Transcript
Page 1: Python tutorial

Introduction to Python

Chen LinChen [email protected]@brandeis.edu

COSI 134aCOSI 134aVolen 110Volen 110

Office Hour: Thurs. 3-5Office Hour: Thurs. 3-5

Page 2: Python tutorial

For More Information?

http://python.org/ - documentation, tutorials, beginners guide, core

distribution, ...Books include: Learning Python by Mark Lutz Python Essential Reference by David Beazley Python Cookbook, ed. by Martelli, Ravenscroft and

Ascher (online at

http://code.activestate.com/recipes/langs/python/) http://wiki.python.org/moin/PythonBooks

Page 3: Python tutorial

Python VideosPython Videos

http://showmedo.com/videotutorials/python“5 Minute Overview (What Does Python

Look Like?)”“Introducing the PyDev IDE for Eclipse”“Linear Algebra with Numpy”And many more

Page 4: Python tutorial

4 Major Versions of Python4 Major Versions of Python

“Python” or “CPython” is written in C/C++

- Version 2.7 came out in mid-2010

- Version 3.1.2 came out in early 2010

“Jython” is written in Java for the JVM“IronPython” is written in C# for the .Net

environmentGo To Website

Page 5: Python tutorial

Development EnvironmentsDevelopment Environmentswhat IDE to use?what IDE to use? http://stackoverflow.com/questions/81584http://stackoverflow.com/questions/81584

1. PyDev with Eclipse 2. Komodo3. Emacs4. Vim5. TextMate6. Gedit7. Idle8. PIDA (Linux)(VIM Based)9. NotePad++ (Windows)10.BlueFish (Linux)

Page 6: Python tutorial

Pydev with EclipsePydev with Eclipse

Page 7: Python tutorial

Python Interactive ShellPython Interactive Shell% python% pythonPython 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)[GCC 4.2.1 (Apple Inc. build 5646)] on darwin[GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.Type "help", "copyright", "credits" or "license" for more information.>>>>>>

You can type things directly into a running Python sessionYou can type things directly into a running Python session>>> 2+3*4>>> 2+3*41414>>> name = "Andrew">>> name = "Andrew">>> name>>> name'Andrew''Andrew'>>> print "Hello", name>>> print "Hello", nameHello AndrewHello Andrew>>>>>>

Page 8: Python tutorial

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Page 9: Python tutorial

ListList

A compound data type:A compound data type:[0][0][2.3, 4.5][2.3, 4.5][5, "Hello", "there", 9.8][5, "Hello", "there", 9.8][][]Use len() to get the length of a listUse len() to get the length of a list>>> names = [“Ben", “Chen", “Yaqin"]>>> names = [“Ben", “Chen", “Yaqin"]>>> len(names)>>> len(names)33

Page 10: Python tutorial

Use [ ] to index items in the listUse [ ] to index items in the list>>> names[0]>>> names[0]‘‘Ben'Ben'>>> names[1]>>> names[1]‘‘Chen'Chen'>>> names[2]>>> names[2]‘‘Yaqin'Yaqin'>>> names[3]>>> names[3]Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>IndexError: list index out of rangeIndexError: list index out of range>>> names[-1]>>> names[-1]‘‘Yaqin'Yaqin'>>> names[-2]>>> names[-2]‘‘Chen'Chen'>>> names[-3]>>> names[-3]‘‘Ben'Ben'

[0] is the first item.[1] is the second item...

Out of range valuesraise an exception

Negative valuesgo backwards fromthe last element.

Page 11: Python tutorial

Strings share many features with listsStrings share many features with lists

>>> smiles = "C(=N)(N)N.C(=O)(O)O">>> smiles = "C(=N)(N)N.C(=O)(O)O">>> smiles[0]>>> smiles[0]'C''C'>>> smiles[1]>>> smiles[1]'(''('>>> smiles[-1]>>> smiles[-1]'O''O'>>> smiles[1:5]>>> smiles[1:5]'(=N)''(=N)'>>> smiles[10:-4]>>> smiles[10:-4]'C(=O)''C(=O)'

Use “slice” notation toget a substring

Page 12: Python tutorial

String Methods: find, splitString Methods: find, split

smiles = "C(=N)(N)N.C(=O)(O)O"smiles = "C(=N)(N)N.C(=O)(O)O">>> smiles.find("(O)")>>> smiles.find("(O)")1515>>> smiles.find(".")>>> smiles.find(".")99>>> smiles.find(".", 10)>>> smiles.find(".", 10)-1-1>>> smiles.split(".")>>> smiles.split(".")['C(=N)(N)N', 'C(=O)(O)O']['C(=N)(N)N', 'C(=O)(O)O']>>>>>>

Use “find” to find thestart of a substring.

Start looking at position 10.

Find returns -1 if it couldn’tfind a match.

Split the string into partswith “.” as the delimiter

Page 13: Python tutorial

String operators: in, not inString operators: in, not in

if "Br" in “Brother”:if "Br" in “Brother”:

print "contains brother“print "contains brother“

email_address = “clin”email_address = “clin”

if "@" not in email_address:if "@" not in email_address:

email_address += "@brandeis.edu“email_address += "@brandeis.edu“

Page 14: Python tutorial

String Method: “strip”, “rstrip”, “lstrip” are ways toString Method: “strip”, “rstrip”, “lstrip” are ways toremove whitespace or selected charactersremove whitespace or selected characters

>>> line = " # This is a comment line \n">>> line = " # This is a comment line \n">>> line.strip()>>> line.strip()'# This is a comment line''# This is a comment line'>>> line.rstrip()>>> line.rstrip()' # This is a comment line'' # This is a comment line'>>> line.rstrip("\n")>>> line.rstrip("\n")' # This is a comment line '' # This is a comment line '>>>>>>

Page 15: Python tutorial

More String methodsMore String methods

email.startswith(“c") endswith(“u”)email.startswith(“c") endswith(“u”)True/FalseTrue/False

>>> "%[email protected]" % "clin">>> "%[email protected]" % "clin"'[email protected]''[email protected]'

>>> names = [“Ben", “Chen", “Yaqin"]>>> names = [“Ben", “Chen", “Yaqin"]>>> ", ".join(names)>>> ", ".join(names)‘‘Ben, Chen, Yaqin‘Ben, Chen, Yaqin‘

>>> “chen".upper()>>> “chen".upper()‘‘CHEN'CHEN'

Page 16: Python tutorial

Unexpected things about stringsUnexpected things about strings

>>> s = "andrew">>> s = "andrew">>> s[0] = "A">>> s[0] = "A"Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>TypeError: 'str' object does not support item TypeError: 'str' object does not support item

assignmentassignment>>> s = "A" + s[1:]>>> s = "A" + s[1:]>>> s>>> s'Andrew‘'Andrew‘

Strings are read only

Page 17: Python tutorial

““\” is for special characters\” is for special characters

\n -> newline\n -> newline

\t -> tab\t -> tab

\\ -> backslash\\ -> backslash

......

But Windows uses backslash for directories!filename = "M:\nickel_project\reactive.smi" # DANGER!

filename = "M:\\nickel_project\\reactive.smi" # Better!

filename = "M:/nickel_project/reactive.smi" # Usually works

Page 18: Python tutorial

Lists are mutable - some useful Lists are mutable - some useful methodsmethods

>>> ids = ["9pti", "2plv", "1crn"]>>> ids = ["9pti", "2plv", "1crn"]>>> ids.append("1alm")>>> ids.append("1alm")>>> ids>>> ids['9pti', '2plv', '1crn', '1alm']['9pti', '2plv', '1crn', '1alm']>>>ids.extend(L)>>>ids.extend(L) Extend the list by appending all the items in the given list; equivalent to a[len(a):] = L.Extend the list by appending all the items in the given list; equivalent to a[len(a):] = L.>>> del ids[0]>>> del ids[0]>>> ids>>> ids['2plv', '1crn', '1alm']['2plv', '1crn', '1alm']>>> ids.sort()>>> ids.sort()>>> ids>>> ids['1alm', '1crn', '2plv']['1alm', '1crn', '2plv']>>> ids.reverse()>>> ids.reverse()>>> ids>>> ids['2plv', '1crn', '1alm']['2plv', '1crn', '1alm']>>> ids.insert(0, "9pti")>>> ids.insert(0, "9pti")>>> ids>>> ids['9pti', '2plv', '1crn', '1alm']['9pti', '2plv', '1crn', '1alm']

append an element

remove an element

sort by default order

reverse the elements in a list

insert an element at somespecified position.(Slower than .append())

Page 19: Python tutorial

Tuples: Tuples: sort of an immutable list

>>> yellow = (255, 255, 0) # r, g, b>>> yellow = (255, 255, 0) # r, g, b>>> one = (1,)>>> one = (1,)>>> yellow[0]>>> yellow[0]>>> yellow[1:]>>> yellow[1:](255, 0)(255, 0)>>> yellow[0] = 0>>> yellow[0] = 0Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>TypeError: 'tuple' object does not support item assignmentTypeError: 'tuple' object does not support item assignment

Very common in string interpolation:>>> "%s lives in %s at latitude %.1f" % ("Andrew", "Sweden", 57.7056)'Andrew lives in Sweden at latitude 57.7'

Page 20: Python tutorial

zipping lists togetherzipping lists together

>>> names>>> names['ben', 'chen', 'yaqin']['ben', 'chen', 'yaqin']

>>> gender =>>> gender = [0, 0, 1] [0, 0, 1]

>>> zip(names, gender)>>> zip(names, gender)[('ben', 0), ('chen', 0), ('yaqin', 1)][('ben', 0), ('chen', 0), ('yaqin', 1)]

Page 21: Python tutorial

DictionariesDictionaries Dictionaries are lookup tables. They map from a “key” to a “value”.

symbol_to_name = {"H": "hydrogen","He": "helium","Li": "lithium","C": "carbon","O": "oxygen","N": "nitrogen"

} Duplicate keys are not allowed Duplicate values are just fine

Page 22: Python tutorial

Keys can be any immutable valueKeys can be any immutable valuenumbers, strings, tuples, frozensetnumbers, strings, tuples, frozenset, ,

not list, dictionary, set, ...not list, dictionary, set, ...atomic_number_to_name = {atomic_number_to_name = {1: "hydrogen"1: "hydrogen"6: "carbon",6: "carbon",7: "nitrogen"7: "nitrogen"8: "oxygen",8: "oxygen",}}nobel_prize_winners = {nobel_prize_winners = {(1979, "physics"): ["Glashow", "Salam", "Weinberg"],(1979, "physics"): ["Glashow", "Salam", "Weinberg"],(1962, "chemistry"): ["Hodgkin"],(1962, "chemistry"): ["Hodgkin"],(1984, "biology"): ["McClintock"],(1984, "biology"): ["McClintock"],}}

A set is an unordered collection with no duplicate elements.

Page 23: Python tutorial

DictionaryDictionary

>>> symbol_to_name["C"]>>> symbol_to_name["C"]'carbon''carbon'>>> "O" in symbol_to_name, "U" in symbol_to_name>>> "O" in symbol_to_name, "U" in symbol_to_name(True, False)(True, False)>>> "oxygen" in symbol_to_name>>> "oxygen" in symbol_to_nameFalseFalse>>> symbol_to_name["P"]>>> symbol_to_name["P"]Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>KeyError: 'P'KeyError: 'P'>>> symbol_to_name.get("P", "unknown")>>> symbol_to_name.get("P", "unknown")'unknown''unknown'>>> symbol_to_name.get("C", "unknown")>>> symbol_to_name.get("C", "unknown")'carbon''carbon'

Get the value for a given key

Test if the key exists(“in” only checks the keys,not the values.)

[] lookup failures raise an exception.Use “.get()” if you wantto return a default value.

Page 24: Python tutorial

Some useful dictionary methodsSome useful dictionary methods

>>> symbol_to_name.keys()>>> symbol_to_name.keys()['C', 'H', 'O', 'N', 'Li', 'He']['C', 'H', 'O', 'N', 'Li', 'He']

>>> symbol_to_name.values()>>> symbol_to_name.values()['carbon', 'hydrogen', 'oxygen', 'nitrogen', 'lithium', 'helium']['carbon', 'hydrogen', 'oxygen', 'nitrogen', 'lithium', 'helium']

>>> symbol_to_name.update( {"P": "phosphorous", "S": "sulfur"} )>>> symbol_to_name.update( {"P": "phosphorous", "S": "sulfur"} )>>> symbol_to_name.items()>>> symbol_to_name.items()[('C', 'carbon'), ('H', 'hydrogen'), ('O', 'oxygen'), ('N', 'nitrogen'), ('P', [('C', 'carbon'), ('H', 'hydrogen'), ('O', 'oxygen'), ('N', 'nitrogen'), ('P',

'phosphorous'), ('S', 'sulfur'), ('Li', 'lithium'), ('He', 'helium')]'phosphorous'), ('S', 'sulfur'), ('Li', 'lithium'), ('He', 'helium')]

>>> del symbol_to_name['C']>>> del symbol_to_name['C']>>> symbol_to_name>>> symbol_to_name{'H': 'hydrogen', 'O': 'oxygen', 'N': 'nitrogen', 'Li': 'lithium', 'He': 'helium'}{'H': 'hydrogen', 'O': 'oxygen', 'N': 'nitrogen', 'Li': 'lithium', 'He': 'helium'}

Page 25: Python tutorial

BackgroundBackgroundData Types/StructureData Types/Structure

list, string, tuple, dictionarylist, string, tuple, dictionaryControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Page 26: Python tutorial

Control FlowControl Flow

Things that are FalseThings that are False The boolean value False The numbers 0 (integer), 0.0 (float) and 0j (complex). The empty string "". The empty list [], empty dictionary {} and empty set set().Things that are TrueThings that are True The boolean value TrueThe boolean value True All non-zero numbers.All non-zero numbers. Any string containing at least one character.Any string containing at least one character. A non-empty data structure.A non-empty data structure.

Page 27: Python tutorial

IfIf

>>> smiles = "BrC1=CC=C(C=C1)NN.Cl">>> smiles = "BrC1=CC=C(C=C1)NN.Cl">>> bool(smiles)>>> bool(smiles)TrueTrue>>> not bool(smiles)>>> not bool(smiles)FalseFalse>>> if not smiles>>> if not smiles::... print "The SMILES string is empty"... print "The SMILES string is empty"...... The “else” case is always optional

Page 28: Python tutorial

Use “elif” to chain subsequent testsUse “elif” to chain subsequent tests

>>> mode = "absolute">>> mode = "absolute">>> if mode == "canonical":>>> if mode == "canonical":... ... smiles = "canonical"smiles = "canonical"... elif mode == "isomeric":... elif mode == "isomeric":... ... smiles = "isomeric”smiles = "isomeric”... ... elif mode == "absolute": elif mode == "absolute":... ... smiles = "absolute"smiles = "absolute"... else:... else:... ... raise TypeError("unknown mode")raise TypeError("unknown mode")......>>> smiles>>> smiles' absolute '' absolute '>>>>>>

“raise” is the Python way to raise exceptions

Page 29: Python tutorial

Boolean logicBoolean logic

Python expressions can have “and”s and Python expressions can have “and”s and “or”s:“or”s:

if (ben if (ben <=<= 5 and chen 5 and chen >=>= 10 or 10 or

chen chen ==== 500 and ben 500 and ben !=!= 5): 5):

print “Ben and Chen“print “Ben and Chen“

Page 30: Python tutorial

Range TestRange Test

if (3 if (3 <= Time <=<= Time <= 5): 5):

print “Office Hour"print “Office Hour"

Page 31: Python tutorial

ForFor

>>> names = [“Ben", “Chen", “Yaqin"]>>> names = [“Ben", “Chen", “Yaqin"]

>>> for name in names:>>> for name in names:

... ... print smilesprint smiles

......

BenBen

ChenChen

YaqinYaqin

Page 32: Python tutorial

Tuple assignment in for loopsTuple assignment in for loops

data = [ ("C20H20O3", 308.371),data = [ ("C20H20O3", 308.371),("C22H20O2", 316.393),("C22H20O2", 316.393),("C24H40N4O2", 416.6),("C24H40N4O2", 416.6),("C14H25N5O3", 311.38),("C14H25N5O3", 311.38),("C15H20O2", 232.3181)]("C15H20O2", 232.3181)]

for for (formula, mw)(formula, mw) in data: in data:print "The molecular weight of %s is %s" % (formula, mw)print "The molecular weight of %s is %s" % (formula, mw)

The molecular weight of C20H20O3 is 308.371The molecular weight of C20H20O3 is 308.371The molecular weight of C22H20O2 is 316.393The molecular weight of C22H20O2 is 316.393The molecular weight of C24H40N4O2 is 416.6The molecular weight of C24H40N4O2 is 416.6The molecular weight of C14H25N5O3 is 311.38The molecular weight of C14H25N5O3 is 311.38The molecular weight of C15H20O2 is 232.3181The molecular weight of C15H20O2 is 232.3181

Page 33: Python tutorial

Break, continueBreak, continue

>>> for value in [3, 1, 4, 1, 5, 9, 2]:>>> for value in [3, 1, 4, 1, 5, 9, 2]:... ... print "Checking", value print "Checking", value... ... if value > 8: if value > 8:... ... print "Exiting for loop"print "Exiting for loop"... ... breakbreak... ... elif value < 3: elif value < 3:... ... print "Ignoring"print "Ignoring"... ... continuecontinue... ... print "The square is", value**2 print "The square is", value**2......

Use “break” to stopUse “break” to stopthe for loopthe for loop

Use “continue” to stopUse “continue” to stopprocessing the current itemprocessing the current item

Checking 3Checking 3The square is 9The square is 9Checking 1Checking 1IgnoringIgnoringChecking 4Checking 4The square is 16The square is 16Checking 1Checking 1IgnoringIgnoringChecking 5Checking 5The square is 25The square is 25Checking 9Checking 9Exiting for loopExiting for loop>>>>>>

Page 34: Python tutorial

Range()Range() ““range” creates a list of numbers in a specified rangerange” creates a list of numbers in a specified range range([start,] stop[, step]) -> list of integersrange([start,] stop[, step]) -> list of integers When step is given, it specifies the increment (or decrement).When step is given, it specifies the increment (or decrement).>>> range(5)>>> range(5)[0, 1, 2, 3, 4][0, 1, 2, 3, 4]>>> range(5, 10)>>> range(5, 10)[5, 6, 7, 8, 9][5, 6, 7, 8, 9]>>> range(0, 10, 2)>>> range(0, 10, 2)[0, 2, 4, 6, 8][0, 2, 4, 6, 8]

How to get every second element in a list?for i in range(0, len(data), 2):

print data[i]

Page 35: Python tutorial

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Page 36: Python tutorial

Reading filesReading files

>>> f = open(“names.txt")>>> f = open(“names.txt")

>>> f.readline()>>> f.readline()

'Yaqin\n''Yaqin\n'

Page 37: Python tutorial

Quick WayQuick Way

>>> lst= [ x for x in open("text.txt","r").readlines() ]>>> lst= [ x for x in open("text.txt","r").readlines() ]>>> lst>>> lst['Chen Lin\n', '[email protected]\n', 'Volen 110\n', 'Office ['Chen Lin\n', '[email protected]\n', 'Volen 110\n', 'Office

Hour: Thurs. 3-5\n', '\n', 'Yaqin Yang\n', Hour: Thurs. 3-5\n', '\n', 'Yaqin Yang\n', '[email protected]\n', 'Volen 110\n', 'Offiche Hour: '[email protected]\n', 'Volen 110\n', 'Offiche Hour: Tues. 3-5\n']Tues. 3-5\n']

Ignore the header?Ignore the header?for (i,line) in enumerate(open(‘text.txt’,"r").readlines()):for (i,line) in enumerate(open(‘text.txt’,"r").readlines()): if i == 0: continueif i == 0: continue print lineprint line

Page 38: Python tutorial

Using dictionaries to count Using dictionaries to count occurrencesoccurrences

>>> for line in open('names.txt'):>>> for line in open('names.txt'):... ... name = line.strip()name = line.strip()... ... name_count[name] = name_count.get(name,0)+ name_count[name] = name_count.get(name,0)+

11... ... >>> for (name, count) in name_count.items():>>> for (name, count) in name_count.items():... ... print name, countprint name, count... ... Chen 3Chen 3Ben 3Ben 3Yaqin 3Yaqin 3

Page 39: Python tutorial

File OutputFile Output

input_file = open(“in.txt")input_file = open(“in.txt")

output_file = open(“out.txt", "w")output_file = open(“out.txt", "w")

for line in input_file:for line in input_file:

output_file.write(line)output_file.write(line)“w” = “write mode”“a” = “append mode”“wb” = “write in binary”“r” = “read mode” (default)“rb” = “read in binary”“U” = “read files with Unixor Windows line endings”

Page 40: Python tutorial

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Page 41: Python tutorial

ModulesModules

When a Python program starts it only has access to a basic functions and classes.

(“int”, “dict”, “len”, “sum”, “range”, ...)“Modules” contain additional functionality.Use “import” to tell Python to load a

module.

>>> import math

>>> import nltk

Page 42: Python tutorial

import the math moduleimport the math module>>> import math>>> import math>>> math.pi>>> math.pi3.14159265358979313.1415926535897931>>> math.cos(0)>>> math.cos(0)1.01.0>>> math.cos(math.pi)>>> math.cos(math.pi)-1.0-1.0>>> dir(math)>>> dir(math)['__doc__', '__file__', '__name__', '__package__', 'acos', 'acosh',['__doc__', '__file__', '__name__', '__package__', 'acos', 'acosh','asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos','asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos','cosh', 'degrees', 'e', 'exp', 'fabs', 'factorial', 'floor', 'fmod','cosh', 'degrees', 'e', 'exp', 'fabs', 'factorial', 'floor', 'fmod','frexp', 'fsum', 'hypot', 'isinf', 'isnan', 'ldexp', 'log', 'log10','frexp', 'fsum', 'hypot', 'isinf', 'isnan', 'ldexp', 'log', 'log10','log1p', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan','log1p', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan','tanh', 'trunc']'tanh', 'trunc']>>> help(math)>>> help(math)>>> help(math.cos)>>> help(math.cos)

Page 43: Python tutorial

““import” and “from ... import ...”import” and “from ... import ...”

>>> import math>>> import math

math.cosmath.cos

>>> from math import cos, pi

cos

>>> from math import *

Page 44: Python tutorial

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Page 45: Python tutorial

ClassesClassesclass ClassName(object): class ClassName(object):

<statement-1> <statement-1> . . . . . . <statement-N> <statement-N>

class MyClass(object): class MyClass(object): """A simple example class""" """A simple example class""" i = 12345 12345 def f(self): def f(self): return self.i return self.i

class DerivedClassName(BaseClassName): class DerivedClassName(BaseClassName): <statement-1> <statement-1> . . . . . . <statement-N> <statement-N>

Page 46: Python tutorial

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Page 47: Python tutorial

http://www.nltk.org/bookhttp://www.nltk.org/bookNLTK is on berry patch machines!NLTK is on berry patch machines!

>>>from nltk.book import * >>>from nltk.book import * >>> text1 >>> text1 <Text: Moby Dick by Herman Melville 1851><Text: Moby Dick by Herman Melville 1851>>>> text1.name>>> text1.name'Moby Dick by Herman Melville 1851''Moby Dick by Herman Melville 1851'>>> text1.concordance("monstrous") >>> text1.concordance("monstrous") >>> dir(text1)>>> dir(text1)>>> text1.tokens>>> text1.tokens>>> text1.index("my")>>> text1.index("my")46474647>>> sent2>>> sent2['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in',

'Sussex', '.'] 'Sussex', '.']

Page 48: Python tutorial

Classify TextClassify Text

>>> def gender_features(word): >>> def gender_features(word):

... ... return {'last_letter': word[-1]} return {'last_letter': word[-1]}

>>> gender_features('Shrek') >>> gender_features('Shrek')

{'last_letter': 'k'} {'last_letter': 'k'}

>>> from nltk.corpus import names >>> from nltk.corpus import names

>>> import random >>> import random >>> names = ([(name, 'male') for name in names.words('male.txt')] + >>> names = ([(name, 'male') for name in names.words('male.txt')] +

... [(name, 'female') for name in names.words('female.txt')])... [(name, 'female') for name in names.words('female.txt')])

>>> random.shuffle(names) >>> random.shuffle(names)

Page 49: Python tutorial

Featurize, train, test, predictFeaturize, train, test, predict

>>> featuresets = [(gender_features(n), g) for (n,g) in names] >>> featuresets = [(gender_features(n), g) for (n,g) in names]

>>> train_set, test_set = featuresets[500:], featuresets[:500] >>> train_set, test_set = featuresets[500:], featuresets[:500]

>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> print nltk.classify.accuracy(classifier, test_set) >>> print nltk.classify.accuracy(classifier, test_set)

0.7260.726

>>> classifier.classify(gender_features('Neo')) >>> classifier.classify(gender_features('Neo'))

'male''male'

Page 50: Python tutorial

from from nltknltk.corpus import .corpus import reutersreuters

Reuters Corpus:Reuters Corpus:10,788 news10,788 news 1.3 million words.1.3 million words. Been classified into Been classified into 9090 topics topicsGrouped into 2 sets, "training" and "test“Grouped into 2 sets, "training" and "test“Categories overlap with each other Categories overlap with each other

http://nltk.googlecode.com/svn/trunk/doc/http://nltk.googlecode.com/svn/trunk/doc/book/ch02.htmlbook/ch02.html

Page 51: Python tutorial

ReutersReuters

>>> from nltk.corpus import reuters >>> from nltk.corpus import reuters

>>> reuters.fileids() >>> reuters.fileids()

['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]

>>> reuters.categories() >>> reuters.categories() ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut',

'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-

oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]