Upload
dominique-lycett
View
252
Download
0
Embed Size (px)
Citation preview
Python 3
March 15, 2011
NLTK
import nltknltk.download()
NLTK
import nltkfrom nltk.book import *
texts()
1. Look at the lists of available texts
NLTK
import nltkfrom nltk.book import *
print text1[0:50]
2. Check out what the text1 (Moby Dick) object looks like
NLTK
import nltkfrom nltk.book import *
print text1[0:50]Looks like a list of
word tokens
2. Check out what the text1 (Moby Dick) object looks like
NLTK3. Get list of top most frequent word TOKENS
import nltkfrom nltk.book import *
fd=FreqDist(text1)print fd.keys()[0:10]
NLTK
import nltkfrom nltk.book import *
fd=FreqDist(text1)print fd.keys()[0:10]
FreqDist is an object defined by NLTKhttp://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-class.html
Give it a list of word tokens
It will be automatically sorted. Print the first 10 keys
3. Get list of top most frequent word TOKENS
NLTK
import nltkfrom nltk.book import *
text1.concordance("and")
4. Now get a concordance of the third most common word
NLTK
import nltkfrom nltk.book import *
text1.concordance("and")
concordance is method defined for an nltk texthttp://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-class.html#concordance
concordance(self, word, width=79, lines=25)Print a concordance for word with the specified context window.
4. Now get a concordance of the third most common word
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokens
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokensCall it mobyDick
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokensCall it mobyDick
For each token x in the original list…
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokensCall it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace
each , with nothing
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokensCall it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace
each , with nothing
Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokensCall it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace
each , with nothing
Make a new FreqDist with the new list of tokens, call it fd
Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Make a new list of tokensCall it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace
each , with nothing
Print it like before
Make a new FreqDist with the new list of tokens, call it fd
Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Operations
String Operations
import nltkfrom nltk.book import *
mobyDick=[x.replace(",","") for x in text1]mobyDick=[x.replace(";","") for x in mobyDick]mobyDick=[x.replace(".","") for x in mobyDick]mobyDick=[x.replace("'","") for x in mobyDick]mobyDick=[x.replace("-","") for x in mobyDick]mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick)print fd.keys()[0:10]
5. What if you don't want punctuation in your list?First, simple way to fix it:
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Import regular expression module
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Compile a regular expression
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
The RegEx will match any of the characters
inside the brackets
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Call the “sub” function associated with the RegEx
named punctuation
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Replace anything that matches the RegEx with nothing
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
As before, do this to each token in the text1 list
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Call this new list punctuationRemoved
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Get a FreqDist of all tokens with length >1
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Print the top 10 word tokens as usual
Regular Expressions
import nltkfrom nltk.book import *import re
punctuation = re.compile("[,.; '-]")punctuationRemoved=[punctuation.sub("",x) for x in text1]
fd=FreqDist([x for x in punctuationRemoved if len(x)>1])print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Regular Expressions are Really Powerful and Useful!
Quick Diversion
import nltkfrom nltk.book import *import re
print fd.keys()[-10:]
7. What if you wanted to see the least common word tokens?
Quick Diversion
import nltkfrom nltk.book import *import re
print fd.keys()[-10:]
7. What if you wanted to see the least common word tokens?
Print the tokens from position -10 to the end
Quick Diversion
import nltkfrom nltk.book import *import re
print [(k, fd[k]) for k in fd.keys()[0:10]]
8. And what if you wanted to see the frequencies with the words?
For each key “k” in the FreqDist, print it and look up
its value (fd[k])
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)
9. Another simple example
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)
9. Another simple example
Looks similar to the RegEx that matched punctuation before
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)
9. Another simple example
This RegEx matches the substring “blue” or the substring “red” or the
substring “green”
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")print colorsRegEx.sub("color",myString)
9. Another simple example
Here, substitute anything that matches the RegEx with the string “color”
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
10. A more interesting example
What if we wanted to identify all of the phone numbers in the string?
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)
10. A more interesting example
Note that \d is a digit, and {11} matches 11
digits in a row
This is a start. Output: ['18005551234']
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)
10. A more interesting example
findall will return a list of all substrings of myString that
match the RegEx
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile('\d{11}')print phoneNumbersRegEx.findall(myString)
10. A more interesting example
Also will need to know:
“?” will match 0 or 1 repetitions of the previous element
Note: find lots more information on regular expressions here: http://docs.python.org/library/re.html
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-565-7568 and my cell number is 1-800-123-4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile(''1?-?\(?\d{3}\)?-?\d{3}-?\d{4}'')print phoneNumbersRegEx.findall(myString)
10. A more interesting example
Answer is here, but let’s derive it together