41
Python 3 March 15, 2011

Python 3 March 15, 2011. NLTK import nltk nltk.download()

Embed Size (px)

Citation preview

Page 1: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Python 3

March 15, 2011

Page 2: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltknltk.download()

Page 3: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltkfrom nltk.book import *

texts()

1. Look at the lists of available texts

Page 4: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltkfrom nltk.book import *

print text1[0:50]

2. Check out what the text1 (Moby Dick) object looks like

Page 5: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltkfrom nltk.book import *

print text1[0:50]Looks like a list of

word tokens

2. Check out what the text1 (Moby Dick) object looks like

Page 6: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK3. Get list of top most frequent word TOKENS

import nltkfrom nltk.book import *

fd=FreqDist(text1)print fd.keys()[0:10]

Page 7: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltkfrom nltk.book import *

fd=FreqDist(text1)print fd.keys()[0:10]

FreqDist is an object defined by NLTKhttp://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html

Give it a list of word tokens

It will be automatically sorted. Print the first 10 keys

3. Get list of top most frequent word TOKENS

Page 8: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltkfrom nltk.book import *

text1.concordance("and")

4. Now get a concordance of the third most common word

Page 9: Python 3 March 15, 2011. NLTK import nltk nltk.download()

NLTK

import nltkfrom nltk.book import *

text1.concordance("and")

concordance is method defined for an nltk texthttp://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance

concordance(self, word, width=79, lines=25)Print a concordance for word with the specified context window.

4. Now get a concordance of the third most common word

Page 10: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

String Operations

Page 11: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokens

String Operations

Page 12: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokensCall it mobyDick

String Operations

Page 13: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokensCall it mobyDick

For each token x in the original list…

String Operations

Page 14: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokensCall it mobyDick

For each token x in the original list…

Copy the token into the new list, except replace

each , with nothing

String Operations

Page 15: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokensCall it mobyDick

For each token x in the original list…

Copy the token into the new list, except replace

each , with nothing

Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Operations

Page 16: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokensCall it mobyDick

For each token x in the original list…

Copy the token into the new list, except replace

each , with nothing

Make a new FreqDist with the new list of tokens, call it fd

Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Operations

Page 17: Python 3 March 15, 2011. NLTK import nltk nltk.download()

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Make a new list of tokensCall it mobyDick

For each token x in the original list…

Copy the token into the new list, except replace

each , with nothing

Print it like before

Make a new FreqDist with the new list of tokens, call it fd

Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Operations

Page 18: Python 3 March 15, 2011. NLTK import nltk nltk.download()

String Operations

import nltkfrom nltk.book import *

mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick)print fd.keys()[0:10]

5. What if you don't want punctuation in your list?First, simple way to fix it:

Page 19: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Page 20: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Import regular expression module

Page 21: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Compile a regular expression

Page 22: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

The RegEx will match any of the characters

inside the brackets

Page 23: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Call the “sub” function associated with the RegEx

named punctuation

Page 24: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Replace anything that matches the RegEx with nothing

Page 25: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

As before, do this to each token in the text1 list

Page 26: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Call this new list punctuationRemoved

Page 27: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Get a FreqDist of all tokens with length >1

Page 28: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Print the top 10 word tokens as usual

Page 29: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Regular Expressions

import nltkfrom nltk.book import *import re

punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]

fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]

6. Now the more complicated, but less typing way:

Regular Expressions are Really Powerful and Useful!

Page 30: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Quick Diversion

import nltkfrom nltk.book import *import re

print fd.keys()[-10:]

7. What if you wanted to see the least common word tokens?

Page 31: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Quick Diversion

import nltkfrom nltk.book import *import re

print fd.keys()[-10:]

7. What if you wanted to see the least common word tokens?

Print the tokens from position -10 to the end

Page 32: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Quick Diversion

import nltkfrom nltk.book import *import re

print [(k, fd[k]) for k in fd.keys()[0:10]]

8. And what if you wanted to see the frequencies with the words?

For each key “k” in the FreqDist, print it and look up

its value (fd[k])

Page 33: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)

9. Another simple example

Page 34: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)

9. Another simple example

Looks similar to the RegEx that matched punctuation before

Page 35: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)

9. Another simple example

This RegEx matches the substring “blue” or the substring “red” or the

substring “green”

Page 36: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)

9. Another simple example

Here, substitute anything that matches the RegEx with the string “color”

Page 37: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

10. A more interesting example

What if we wanted to identify all of the phone numbers in the string?

Page 38: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)

10. A more interesting example

Note that \d is a digit, and {11} matches 11

digits in a row

This is a start. Output: ['18005551234']

Page 39: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)

10. A more interesting example

findall will return a list of all substrings of myString that

match the RegEx

Page 40: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)

10. A more interesting example

Also will need to know:

“?” will match 0 or 1 repetitions of the previous element

Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html

Page 41: Python 3 March 15, 2011. NLTK import nltk nltk.download()

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”

phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'')print phoneNumbersRegEx.findall(myString)

10. A more interesting example

Answer is here, but let’s derive it together