Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
DataCamp NaturalLanguageProcessingFundamentalsinPython
Wordcountswithbag-of-words
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
DataCamp NaturalLanguageProcessingFundamentalsinPython
Bag-of-wordsBasicmethodforfindingtopicsinatextNeedtofirstcreatetokensusingtokenization...andthencountupallthetokensThemorefrequentaword,themoreimportantitmightbeCanbeagreatwaytodeterminethesignificantwordsinatext
DataCamp NaturalLanguageProcessingFundamentalsinPython
Bag-of-wordsexample
Text:"Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."
Bagofwords(strippedpunctuation):
"The":3,"box":3"cat":3,"the":3"is":2"in":1,"likes":1,"over":1
DataCamp NaturalLanguageProcessingFundamentalsinPython
Bag-of-wordsinPythonIn[1]:fromnltk.tokenizeimportword_tokenize
In[2]:fromcollectionsimportCounter
In[3]:Counter(word_tokenize("""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""))Out[3]:Counter({'.':3,'The':3,'box':3,'cat':3,'in':1,...'the':3})
In[4]:counter.most_common(2)Out[4]:[('The',3),('box',3)]
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
DataCamp NaturalLanguageProcessingFundamentalsinPython
Simpletextpreprocessing
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whypreprocess?Helpsmakeforbetterinputdata
WhenperformingmachinelearningorotherstatisticalmethodsExamples:
TokenizationtocreateabagofwordsLowercasingwords
Lemmatization/StemmingShortenwordstotheirrootstems
Removingstopwords,punctuation,orunwantedtokensGoodtoexperimentwithdifferentapproaches
DataCamp NaturalLanguageProcessingFundamentalsinPython
Preprocessingexample
Inputtext:Cats,dogsandbirdsarecommonpets.Soarefish.
Outputtokens:cat,dog,bird,common,pet,fish
DataCamp NaturalLanguageProcessingFundamentalsinPython
TextpreprocessingwithPythonIn[1]:fromntlk.corpusimportstopwords
In[2]:text="""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""
In[3]:tokens=[wforwinword_tokenize(text.lower())ifw.isalpha()]
In[4]:no_stops=[tfortintokensiftnotinstopwords.words('english')]
In[5]:Counter(no_stops).most_common(2)Out[5]:[('cat',3),('box',3)]
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
DataCamp NaturalLanguageProcessingFundamentalsinPython
Introductiontogensim
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whatisgensim?Popularopen-sourceNLPlibraryUsestopacademicmodelstoperformcomplextasks
BuildingdocumentorwordvectorsPerformingtopicidentificationanddocumentcomparison
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whatisawordvector?
DataCamp NaturalLanguageProcessingFundamentalsinPython
GensimExample
(Source:)
http://tlfvincent.github.io/2015/10/23/presidential-speech-topics
DataCamp NaturalLanguageProcessingFundamentalsinPython
CreatingagensimdictionaryIn[1]:fromgensim.corpora.dictionaryimportDictionary
In[2]:fromnltk.tokenizeimportword_tokenize
In[3]:my_documents=['Themoviewasaboutaspaceshipandaliens.',...:'Ireallylikedthemovie!',...:'Awesomeactionscenes,butboringcharacters.',...:'Themoviewasawful!Ihatealienfilms.',...:'Spaceiscool!Ilikedthemovie.',...:'Morespacefilms,please!',]
In[4]:tokenized_docs=[word_tokenize(doc.lower())...:fordocinmy_documents]
In[5]:dictionary=Dictionary(tokenized_docs)
In[6]:dictionary.token2idOut[6]:{'!':11,',':17,'.':7,'a':2,'about':4,...}
DataCamp NaturalLanguageProcessingFundamentalsinPython
Creatingagensimcorpus
gensimmodelscanbeeasilysaved,updated,andreused
OurdictionarycanalsobeupdatedThismoreadvancedandfeaturerichbag-of-wordscanbeusedinfutureexercises
In[7]:corpus=[dictionary.doc2bow(doc)fordocintokenized_docs]
In[8]:corpusOut[8]:[[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1)],[(0,1),(1,1),(9,1),(10,1),(11,1),(12,1)],...]
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
DataCamp NaturalLanguageProcessingFundamentalsinPython
Tf-idfwithgensim
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whatistf-idf?Termfrequency-inversedocumentfrequencyAllowsyoutodeterminethemostimportantwordsineachdocumentEachcorpusmayhavesharedwordsbeyondjuststopwordsThesewordsshouldbedown-weightedinimportanceExamplefromastronomy:"Sky"Ensuresmostcommonwordsdon'tshowupaskeywordsKeepsdocumentspecificfrequentwordsweightedhigh
DataCamp NaturalLanguageProcessingFundamentalsinPython
Tf-idfformula
w = tf ∗ log( )
w = tf-idfweightfortokeniindocumentj
tf = numberofoccurencesoftokeniindocumentj
df = numberofdocumentsthatcontaintokeni
N = totalnumberofdocuments
i,j i,jdfi
N
i,j
i,
i
DataCamp NaturalLanguageProcessingFundamentalsinPython
Tf-idfwithgensimIn[10]:fromgensim.models.tfidfmodelimportTfidfModel
In[11]:tfidf=TfidfModel(corpus)
In[12]:tfidf[corpus[1]]Out[12]:[(0,0.1746298276735174),(1,0.1746298276735174),(9,0.29853166221463673),(10,0.7716931521027908),...]
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON