22
DataCamp Natural Language Processing Fundamentals in Python Word counts with bag- of-words NATURAL LANGUAGE PROCESSING FUNDAMENTALS IN PYTHON Katharine Jarmul Founder, kjamistan

Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Wordcountswithbag-of-words

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

Page 2: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Bag-of-wordsBasicmethodforfindingtopicsinatextNeedtofirstcreatetokensusingtokenization...andthencountupallthetokensThemorefrequentaword,themoreimportantitmightbeCanbeagreatwaytodeterminethesignificantwordsinatext

Page 3: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Bag-of-wordsexample

Text:"Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."

Bagofwords(strippedpunctuation):

"The":3,"box":3"cat":3,"the":3"is":2"in":1,"likes":1,"over":1

Page 4: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Bag-of-wordsinPythonIn[1]:fromnltk.tokenizeimportword_tokenize

In[2]:fromcollectionsimportCounter

In[3]:Counter(word_tokenize("""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""))Out[3]:Counter({'.':3,'The':3,'box':3,'cat':3,'in':1,...'the':3})

In[4]:counter.most_common(2)Out[4]:[('The',3),('box',3)]

Page 5: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

Page 6: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Simpletextpreprocessing

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

Page 7: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whypreprocess?Helpsmakeforbetterinputdata

WhenperformingmachinelearningorotherstatisticalmethodsExamples:

TokenizationtocreateabagofwordsLowercasingwords

Lemmatization/StemmingShortenwordstotheirrootstems

Removingstopwords,punctuation,orunwantedtokensGoodtoexperimentwithdifferentapproaches

Page 8: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Preprocessingexample

Inputtext:Cats,dogsandbirdsarecommonpets.Soarefish.

Outputtokens:cat,dog,bird,common,pet,fish

Page 9: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

TextpreprocessingwithPythonIn[1]:fromntlk.corpusimportstopwords

In[2]:text="""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""

In[3]:tokens=[wforwinword_tokenize(text.lower())ifw.isalpha()]

In[4]:no_stops=[tfortintokensiftnotinstopwords.words('english')]

In[5]:Counter(no_stops).most_common(2)Out[5]:[('cat',3),('box',3)]

Page 10: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

Page 11: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Introductiontogensim

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

Page 12: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatisgensim?Popularopen-sourceNLPlibraryUsestopacademicmodelstoperformcomplextasks

BuildingdocumentorwordvectorsPerformingtopicidentificationanddocumentcomparison

Page 13: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatisawordvector?

Page 14: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

GensimExample

(Source:)

http://tlfvincent.github.io/2015/10/23/presidential-speech-topics

Page 15: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

CreatingagensimdictionaryIn[1]:fromgensim.corpora.dictionaryimportDictionary

In[2]:fromnltk.tokenizeimportword_tokenize

In[3]:my_documents=['Themoviewasaboutaspaceshipandaliens.',...:'Ireallylikedthemovie!',...:'Awesomeactionscenes,butboringcharacters.',...:'Themoviewasawful!Ihatealienfilms.',...:'Spaceiscool!Ilikedthemovie.',...:'Morespacefilms,please!',]

In[4]:tokenized_docs=[word_tokenize(doc.lower())...:fordocinmy_documents]

In[5]:dictionary=Dictionary(tokenized_docs)

In[6]:dictionary.token2idOut[6]:{'!':11,',':17,'.':7,'a':2,'about':4,...}

Page 16: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Creatingagensimcorpus

gensimmodelscanbeeasilysaved,updated,andreused

OurdictionarycanalsobeupdatedThismoreadvancedandfeaturerichbag-of-wordscanbeusedinfutureexercises

In[7]:corpus=[dictionary.doc2bow(doc)fordocintokenized_docs]

In[8]:corpusOut[8]:[[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1)],[(0,1),(1,1),(9,1),(10,1),(11,1),(12,1)],...]

Page 17: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

Page 18: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Tf-idfwithgensim

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

Page 19: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatistf-idf?Termfrequency-inversedocumentfrequencyAllowsyoutodeterminethemostimportantwordsineachdocumentEachcorpusmayhavesharedwordsbeyondjuststopwordsThesewordsshouldbedown-weightedinimportanceExamplefromastronomy:"Sky"Ensuresmostcommonwordsdon'tshowupaskeywordsKeepsdocumentspecificfrequentwordsweightedhigh

Page 20: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Tf-idfformula

w = tf ∗ log( )

w = tf-idfweightfortokeniindocumentj

tf = numberofoccurencesoftokeniindocumentj

df = numberofdocumentsthatcontaintokeni

N = totalnumberofdocuments

i,j i,jdfi

N

i,j

i,

i

Page 21: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Tf-idfwithgensimIn[10]:fromgensim.models.tfidfmodelimportTfidfModel

In[11]:tfidf=TfidfModel(corpus)

In[12]:tfidf[corpus[1]]Out[12]:[(0,0.1746298276735174),(1,0.1746298276735174),(9,0.29853166221463673),(10,0.7716931521027908),...]

Page 22: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON