73
Introduction to Natural Language Processing Source: Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit

Introduction to Natural Language Processing

  • Upload
    coty

  • View
    66

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Natural Language Processing. Source: Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit. Status. We have progressed with Object -Oriented Programming in Python Simple I/O, File I/O Lists, Strings, Tuples , and their methods - PowerPoint PPT Presentation

Citation preview

Slide 1

Introduction to Natural Language ProcessingSource: Natural Language Processing with Python --- Analyzing Text with the Natural Language ToolkitStatusWe have progressed with Object-Oriented Programming in PythonSimple I/O, File I/OLists, Strings, Tuples, and their methodsNumeric types and operationsControl structures: if, for, whileFunction definition and useParameters for defining the function, arguments for calling the functionApplying what we haveWe have looked at some of the NLTK book.Chapter 1 of the NLTK book repeats much of what we see in the other text.Now in the context of an application domain: Natural Language ProcessingNote: there are similar packages for other domainsBook examples in chapter 1 are all done with the interactive python shellReasonsWhat can we achieve by combining simple programming techniques with large quantities of text?How can we automatically extract key words and phrases that sum up the style and content of a text?What tools and techniques does the Python programming language provide for such work?What are some of the interesting challenges of natural language processing?Quote from nltk bookSince text can cover any subject area, it is a general interest area to explore in some depth.The NLTKThe natural language tool kit modulesdatasetstutorialsContains: align, app (package), book, ccg (package), chat (package, chunk (package), classify (package), cluster (package), collocations, compat, containers, corpus (package), data, decorators, downloader, draw (package), etree (package), evaluate, examples (package), featstruct, grammar), help, inference (package), internals, lazyimport, metrics (package), misc (package), model (package), olac, parse (package), probability, sem (package), sourcedstring, stem (package), tag (package), text, tokenize (package), toolbox (package), tree, treetransforms, util, yamltags

We will not have time to explore all of them, but this gives a full list for further exploration.Recall - the NLTK

>>> import nltk>>> nltk.download()opens a window showing this:Do it now, if you have not done soGetting data from the downloaded filesPreviously, we used from math import pito get something specific from a moduleNow, from the nltk.book, we will get the text files we will usefrom nltk.book import *

Import the data files>>> import nltk>>> from nltk.book import **** Introductory Examples for the NLTK Book ***Loading text1, ..., text9 and sent1, ..., sent9Type the name of the text or sentence to view it.Type: 'texts()' or 'sents()' to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908Do it now.Then type sent1 at a python prompt to see the fist sentence of Moby Dick

Repeat for sent2 .. sent9 to see the first sentence of each text.

Take note of the collection of texts. Great variety. Different ones will be useful for different types of explorationWhat type of data is each first sentence? The sentence is represented as a list, with list elements being tokens usually words, but some punctuation.8Searching the texts>>> text9.concordance("sunset")Building index...Displaying 14 of 14 matches:E suburb of Saffron Park lay on the sunset side of London , as red and ragged n , as red and ragged as a cloud of sunset . It was built of a bright brick thbered in that place for its strange sunset . It looked like the end of the worival ; it was upon the night of the sunset that his solitude suddenly ended . he Embankment once under a dark red sunset . The red river reflected the red sst seemed of fiercer flame than the sunset it mirrored . It looked like a strehe passionate plumage of the cloudy sunset had been swept away , and a naked mder the sea . The sealed and sullen sunset behind the dark dome of St . Paul 'ming with the colour and quality of sunset . The Colonel suggested that , befogold . Up this side street the last sunset light shone as sharp and narrow as of gas , which in the full flush of sunset seemed coloured like a sunset cloudsh of sunset seemed coloured like a sunset cloud . " After all ," he said , " y and quietly , like a long , low , sunset cloud , a long , low house , mellowhouse , mellow in the mild light of sunset . All the six friends compared noteA concordance shows a word in contextSame word in different texts>>> text1.concordance("monstrous")Building index...Displaying 11 of 11 matches:ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have rll over with a heathenish array of monstrous clubs and spears . Some were thickd as you gazed , and wondered what monstrous cannibal and savage could ever havthat has survived the flood ; most monstrous and most mountainous ! That Himmalthey might scout at Moby Dick as a monstrous fable , or still worse and more deth of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere ling Scenes . In connexion with the monstrous pictures of whales , I am stronglyere to enter upon those still more monstrous stories of them which are to be foght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u>>> text2.concordance("monstrous")Building index...Displaying 11 of 11 matches:. " Now , Palmer , you shall see a monstrous pretty girl ." He immediately wentyour sister is to marry him . I am monstrous glad of it , for then I shall haveou may tell your sister . She is a monstrous lucky girl to get him , upon my hok how you will like them . Lucy is monstrous pretty , and so good humoured and Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could nt however , as it turns out , I am monstrous glad there was never any thing in so scornfully ! for they say he is monstrous fond of her , as well he may . I spossible that she should ." " I am monstrous glad of it . Good gracious ! I havthing of the kind . So then he was monstrous happy , and talked on some time abe very genteel people . He makes a monstrous deal of money , and they keep thei>>> Moby DickSense and Sensibility>>> text1.similar("monstrous")abundant candid careful christian contemptible curious delightfullydetermined doleful domineering exasperate fearless few gamesomehorrible impalpable imperial lamentable lazy loving>>> >>> text2.similar("monstrous")Building word-context index...very exceedingly heartily so a amazingly as extremely good greatremarkably sweet vast>>> Note different sense of the word in the two texts.Looking at vocabulary>>> len(set(text3))2789>>> len(set(text2))6833>>> >>> len(text3)44764>>> Total number of tokens, includes non words and repeated wordsWhat do these numbers mean?No repeats, but still has non words and things are counted twice if they are capitalized some places12>>> float(len(text2))/float(len(set(text2)))20.719449729255086>>> What does this tell us?On average, a word is used > 20 timesA rough measure of lexical richness>>> from __future__ import division>>> 100*text2.count("money")/len(text2)0.018364694581002431>>> Note two ways to get floating point results when dividing integersWhat does this tell us?Making life easier>>> lexical_diversity(text2)20.719449729255086>>> percentage(text2.count('money'),len(text2))0.018364694581002431>>> >>> def lexical_diversity(text):... return len(text) / len(set(text))... >>> def percentage(count,total):... return 100*count/total... Spot checkModify the function percentage so that you only have to pass it the name of the text and the word to countthe new call will look like this:percentage(text2, money)In which of the texts is money most dominant?Where is it least dominant?What are the percentages for each text?Indexing the textsEach of the texts is a list, and so all our list methods work, including slicing:

>>> text2[0:100]['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER', '1', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.', 'Their', 'estate', 'was', 'large', ',', 'and', 'their', 'residence', 'was', 'at', 'Norland', 'Park', ',', 'in', 'the', 'centre', 'of', 'their', 'property', ',', 'where', ',', 'for', 'many', 'generations', ',', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'a', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', '.', 'The', 'late', 'owner', 'of', 'this', 'estate', 'was', 'a', 'single', 'man', ',', 'who', 'lived', 'to', 'a', 'very', 'advanced', 'age', ',', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', ',', 'had', 'a', 'constant', 'companion']>>> The first 101 elements in the list for text2 (Sense and Sensibility) Note that the first element is itself a list.Text indexWe can see what is at a position:>>> text2[302]'devolved

And where a word appears:>>> text2.index('marriage')255>>>

Remember that indexing begins at 0 and the index tells how far removed you are from the initial element.StringsEach of the elements in each of the text lists is a string, and all the string methods apply.Frequency distributions>>> fdist1=FreqDist(text1)>>> fdist1

>>> vocabulary1=fdist1.keys()

>>> vocabulary1[:50][',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']>>> These are the 50 most common tokens in the text of Moby Dick. Many of these are not useful in characterizing the text. We call them stop words and will see how to eliminate them from consideration later.More precise specificationConsider the mathematical expression

Python implementation is[w for w in V if p(w)]

>>> AustenVoc=set(text2)>>> long_words_2=[w for w in AustenVoc if len(w) >15]>>> long_words_2['incomprehensible', 'disqualifications', 'disinterestedness', 'companionableness']>>>

List comprehension we saw it first last weekAdd to the conditionfdist2=FreqDist(text2)>>> long_words_2=sorted([w for w in AustenVoc if len(w) >12 and fdist2[w]>5])>>> long_words_2['Somersetshire', 'accommodation', 'circumstances', 'communication', 'consciousness', 'consideration', 'disappointment', 'distinguished', 'embarrassment', 'encouragement', 'establishment', 'extraordinary', 'inconvenience', 'indisposition', 'neighbourhood', 'unaccountable', 'uncomfortable', 'understanding', 'unfortunately']So, our if p(w) can be as complex as we needSpot checkFind all the words longer than 12 characters, which occur at least 5 times, in each of the texts.How well do they give you a sense of the texts?Collocations and BigramsSometimes a word by itself is not representative of its role in a text. It is only with a companion word that we get the intended sense.red winehigh horsesign of hopeBigrams are two word combinationsnot all bigrams are useful, of courselen(bigrams(text2)) == 141575including and among, they could , Collocations provides bigrams that include uncommon words words that might be significant in the text.text2.collocations has 20 pairs>>> colloc2=text2.collocations()Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing;thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele;every body; John Dashwood; great deal; Harley Street; Berkeley Street;Miss Dashwoods; young man; Combe Magna; every day; next morning

>>> [len(w) for w in text2][1, 5, 3, 11, 2, 4, 6, 4, 1, 7, 1, 3, 6, 2, 8, 3, 4, 4, 7, 2, 6, 1, 5, 6, 3, 5, 1, 3, 5, 9, 3, 2, 7, 4, 1, 2, 3, 6, 2, 5, 8, 1, 5, 1, 3, 4, 11, 1, 4, 3, 5, 2, 2, 11, 1, 6, 2, 2, 6, 3, 7, 4, 7, 2, 5, 11, 12, 1, 3, 4, 5, 2, 4, 6, 3, 1, 6, 3, 1, 3, 5, 2, 1, 4, 8, 3, 1, 3, 3, 3, 4, 5, 2, 3, 4, 1, 3, 1, 8, 9, 3, 11, 2, 3, 6, 1, 3, 3, 5, 1, 5, 8, 3, 5, 6, 3, 3, 1, 8, For each word in text2, return its length>>> fdist2=FreqDist([len(w) for w in text2])>>> fdist2

>>> fdist2.keys()[3, 2, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 16]>>> There are 141,576 words, each with a length. But there are only 17 different word lengths.>>> fdist2.items()[(3, 28839), (2, 24826), (1, 23009), (4, 21352), (5, 11438), (6, 9507), (7, 8158), (8, 5676), (9, 3736), (10, 2596), (11, 1278), (12, 711), (13, 334), (14, 87), (15, 24), (17, 3), (16, 2)]>>> There are 28,839 3-letter words in Sense and Sensibility (not unique words, necessarily)>>> fdist2.keys()[3, 2, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 16]>>> fdist2.items()[(3, 28839), (2, 24826), (1, 23009), (4, 21352), (5, 11438), (6, 9507), (7, 8158), (8, 5676), (9, 3736), (10, 2596), (11, 1278), (12, 711), (13, 334), (14, 87), (15, 24), (17, 3), (16, 2)]>>> fdist2.max()3>>> fdist2[3]28839>>> fdist2[13]334>>> There are 28,839 3-letter words and 334 13-letter words in Sense and SensibilityTable 1.2 FreqDist functionsExampleDescripitonfdist = FreqDist(samples)create a frequency distribution containing the given samplesfdist.inc(sample)increment the count for this samplefdist['monstrous']count of the number of times a given sample occurredfdist.freq(monstrous)frequency of a given samplefdist.N()total number of samplesfdist.keys()The samples sorted in order of decreasing frequencyfor sample in fdist:iterate over the samples, in order of decreasing frequencyfdist.max()sample with the greatest countfdist.tabulate()tabulate the frequency distributionfdist.plot()graphical plot of the frequency distributionfdist.plot(cumulative=True)cumulative plot of the frequency distributionfdist1>> sorted([w for w in set(text7) if '-' in w and 'index' in w])>>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10])>>> sorted([w for w in set(sent7) if not w.islower()])>>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])From the NLTK book: Run the following examples and explain what is happening. Then make up some tests of your own.Ending the double count of words The count of words from the various texts was flawed. How?We had

Whats the problem? How do we fix it?

>>> len(text1)260819>>> len(set(text1))19317>>> len(set([word.lower() for word in text1]))17231>>>>>> len(set([word.lower() for word in text1 if word.isalpha()]))16948>>>Nested loops and loops with conditionsFollow what happens. >>> for token in sent1:... if token.islower():... print token, 'is a lowercase word'... elif token.istitle():... print token, 'is a titlecase word'... else:... print token, 'is punctuation'...Call is a titlecase wordme is a lowercase wordIshmael is a titlecase word. is punctuation>>>Another example>>> tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w])>>> for word in tricky:... print word,ancient ceiling conceit conceited conceive conscienceconscientious conscientiously deceitful deceive ...>>>Automatic Text UnderstandingSee section 1.5Some realistic, interesting problems associated with Natural Language Processing

Word sense disambiguation

a.The lost children were found by the searchers (agentive)b.The lost children were found by the mountain (locative)c.The lost children were found by the afternoon (temporal)Pronoun resolution

a.The thieves stole the paintings. They were subsequently sold.b.The thieves stole the paintings. They were subsequently caught.c.The thieves stole the paintings. They were subsequently found.Generating text!>>> text4.generate()Building ngram index...Fellow - Citizens : Under Providence I have given freedom new reach ,and maintain lasting peace -- based on righteousness and justice .There was this reason only why the cotton - producing States should bepromoted by just and abundant society , on just principles . Theselater years have elapsed , and civil war . More than this , we affirma new beginning is a destiny . May Congress prohibit slavery in theworkshop , in translating humanity ' s strongest , but we have adopted, and fear of God . And , in each>>> An inaugural address?? -- MIT hoax conference submissionTranslationBabel> How long before the next flight to Alice Springs?Babel> germanBabel> run0> How long before the next flight to Alice Springs?1> Wie lang vor dem folgenden Flug zu Alice Springs?2> How long before the following flight to Alice jump?3> Wie lang vor dem folgenden Flug zu Alice springen Sie?4> How long before the following flight to Alice do you jump?5> Wie lang, bevor der folgende Flug zu Alice tun, Sie springen?6> How long, before the following flight to Alice does, do you jump?7> Wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?8> How long before the following flight to Alice does, do you jump?9> Wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?10> How long, before the following flight does to Alice, do do you jump?11> Wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?12> How long before the following flight does leap to Alice, does you?Babel> Jeopardy and Watsonhttp://www.youtube.com/watch?v=xm8iUjzgPTg&feature=relatedhttp://www.youtube.com/watch?v=7h4baBEi0iA&feature=related -- the strange responsehttp://www.youtube.com/watch?src_vid=7h4baBEi0iA&feature=iv&v=lI-M7O_bRNg&annotation_id=annotation_383798#t=3m11sExplanation of the strange responseThe ultimate example of a machine and languageText corporaA collection of text entitiesUsually there is some unifying characteristic, but not alwaysTypical examplesAll issues of a newspaper for a period of timeA collection of reports from a particular industry or standards bodyMore recentThe whole collection of posts to twitterAll the entries in a blog or set of blogsCheck it outGo to http://www.gutenberg.org/Take a few minutes to explore the site.Look at the top 100 downloads of yesterdayCan you characterize them? What do you think of this list? Corpora in nltkThe nltk includes part of the Gutenberg collectionFind out which ones by>>>nltk.corpus.gutenberg.fileids()These are the texts of the Gutenberg collection that are downloaded with the nltk package.

Accessing other textsWe will explore the files loaded with nltkYou may want to explore other texts also. From the help(nltk.corpus):If C{item} is one of the unique identifiers listed in the corpus module's C{items} variable, then the corresponding document will be loaded from the NLTK corpus package.If C{item} is a filename, then that file will be read.

For now just a note that we can use these tools on other texts that we download or acquire from any source.Using the tools we saw beforeThe particular texts we saw in chapter 1 were accessed through aliases that simplified the interaction.Now, more general case, we have to do more.To get the list of words in a text:>>>emma = nltk.corpus.gutenberg.words('austen-emma.txt')Now we have the form we had for the texts of Chapter 1 and can use the tools found there. Try:>>> len(emma)

Note the frequency of use of Jane Austen books ???Shortened referenceGlobal contextInstead of citing the gutenberg corpus for each resource, >>> from nltk.corpus import gutenberg>>> gutenberg.fileids()['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]>>> emma = gutenberg.words('austen-emma.txt')So, nltk.corpus.gutenberg.words('austen-emma.txt')becomes just gutenberg.words('austen-emma.txt')

Other access optionsgutenberg.words('austen-emma.txt')the words of the textgutenberg.raw('austen-emma.txt')the original text, no separation into tokens (words). One long string.gutenberg.sents('austen-emma.txt')the text divided into sentences

Some code to runEnter and run the code for counting characters, words, sentences and finding the lexical diversity score of each text in the corpus.import nltkfrom nltk.corpus import gutenbergfor fileid in gutenberg.fileids(): num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) print int(num_chars/num_words), int(num_words/num_sents), \int(num_words/num_vocab), fileidShort, simple code. Already seeing some noticeable time to executeModify the codeSimple change print out the total number of characters, words, sentences for each text.The text corpusTake a look at your directory of nltk_data to see the variety of text materials accessible to you. Some are not plain text and we cannot use them yet but willOf the plain text, note the diversityClassic published materialsNews feeds, movie reviewsOverheard conversations, internet chatAll categories of language are needed to understand the language as it is defined and as it is used.The Brown CorpusFirst 1 million word corpusExplore what are the categories?Access words or sentences from one or more categories or fileids

>>> from nltk.corpus import brown>>> brown.categories()>>> brown.fileids(categories=")SylisticsEnter that code and run it.What does it give you?What does it mean?>>> from nltk.corpus import brown>>> news_text = brown.words(categories='news')>>> fdist = nltk.FreqDist([w.lower() for w in news_text])>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']>>> for m in modals:... print m + ':', fdist[m],Spot checkRepeat the previous code, but look for the use of those same words in the categories for religion, governmentNow analyze the use of the wh words in the news category and one other of your choice. (Who, What, Where, When, Why)One step comparisonConsider the following code:import nltkfrom nltk.corpus import browncfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre))genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']modals = ['can', 'could', 'may', 'might', 'must', 'will']cfd.tabulate(conditions=genres, samples=modals)Enter and run it. What does it do? 49Other corporaThere is some information about the Reuters and Inaugural address corpora also. Take a look at them with the online site. (5 minutes or so)Spot CheckTake a look at Table 2-2 for a list of some of the material available from the nltk project. (I cannot fit it on a slide in any meaningful way)Confirm that you have downloaded all of these (when you did the nltk.download, if you selected all)Find them in your directory and explore.How many languages are represented?How would you describe the variety of content? What do you find most interesting/unusual/strange/fun?LanguagesThe Universal Declaration of Human Rights is available in 300 languages.>>>udhr.fileids()Organization of CorporaThe organization will vary according to the type of corpus. Knowing the organization may be important for using the corpus.

ExampleDescriptionfileids()the files of the corpusfileids([categories])the files of the corpus corresponding to these categoriescategories()the categories of the corpuscategories([fileids])the categories of the corpus corresponding to these filesraw()the raw content of the corpusraw(fileids=[f1,f2,f3])the raw content of the specified filesraw(categories=[c1,c2])the raw content of the specified categorieswords()the words of the whole corpuswords(fileids=[f1,f2,f3])the words of the specified fileidswords(categories=[c1,c2])the words of the specified categoriessents()the sentences of the whole corpussents(fileids=[f1,f2,f3])the sentences of the specified fileidssents(categories=[c1,c2])the sentences of the specified categoriesabspath(fileid)the location of the given file on diskencoding(fileid)the encoding of the file (if known)open(fileid)open a stream for reading the given corpus fileroot()the path to the root of locally installed corpusreadme()the contents of the README file of the corpusTable 2.3 Basic Corpus Functionality in NLTKfrom help(nltk.corpus.reader)Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are: - I{corpus}.words(): list of str - I{corpus}.sents(): list of (list of str) - I{corpus}.paras(): list of (list of (list of str)) - I{corpus}.tagged_words(): list of (str,str) tuple - I{corpus}.tagged_sents(): list of (list of (str,str)) - I{corpus}.tagged_paras(): list of (list of (list of (str,str))) - I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves) - I{corpus}.parsed_sents(): list of (Tree with str leaves) - I{corpus}.parsed_paras(): list of (list of (Tree with str leaves)) - I{corpus}.xml(): A single xml ElementTree - I{corpus}.raw(): unprocessed corpus contents For example, to read a list of the words in the Brown Corpus, use C{nltk.corpus.brown.words()}: >>> from nltk.corpus import brown >>> print brown.words()Types of information returned from typical functionsSpot checkChoose a corpus and exercise some of the functionsLook at raw, words, sents, categories, fileids, encodingRepeat for a source in a different language.Work in pairs and talk about what you find, what you might want to look for.Report out brieflyWorking with your own sourcesNLTK provides a great bunch of resources, but you will certainly want to access your own collections other books you download, or files you create, etc.from nltk.corpus import PlaintextCorpusReader>>> corpus_root = '/usr/share/dict' >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') >>> wordlists.fileids()['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']>>> wordlists.words('connectives')['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]You could get the list of files in any directoryOther Corpus readersThere are a number of different readers for different types of corpora. Many files in corpora are marked up in various ways and the reader needs to understand the markings to return meaningful results. We will stick to the PlaintextCorpusReader for nowConditional Frequency DistributionWhen texts in a corpus are divided into categories, we may want to look at the characteristics by category word use by author or over time, for example

Figure 2.4: Counting Words Appearing in a Text Collection (a conditional frequency distribution)Frequency DistributionsA frequency distribution counts some occurrence, such as the use of a word or phrase. A conditional frequency distribution, counts some occurrence separately for each of some number of conditions (Author, date, genre, etc.)For example:>>> genre_word = [(genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)] >>> len(genre_word)170576Think about this. What exactly is happening?What are those 170,576 things?, Run the code, then enter just >>> genre_wordFor each genre (news, romance)loop over every word in that genreproduce the pairs showing the genre and the wordWhat type of data is genre_word?>>> genre_word = [(genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)] >>> len(genre_word)170576Spot checkRefining the resultWhen you displayed genre_word, you may have noticed that some of the words are not words at all. They are punctuation marks.Refine this code to eliminate the entries in genre_word in which the word is not all alphabetic. Remove duplicate words that differ only in capitalization.Work together. Talk about what you are doing. Share your ideas and insights Conditional Frequency DistributionFrom the list of pairs we created, we can generate a conditional frequency distribution of words by genre>>> cfd = nltk.ConditionalFreqDist(genre_word)>>> cfd

>>> cfd.conditions()Run these. Look at the resultsLook at the conditional distributions >>> cfd['news']

>>> cfd['romance']

>>> list(cfd['romance'])[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]>>> cfd['romance']['could']193Presenting the resultsPlotting and tabulating concise representations of the frequency distributionsTabulateWith no parameters, simply tabulates all the conditions against all the values

cfd.tabulate()Look closely >>> from nltk.corpus import inaugural>>> cfd = nltk.ConditionalFreqDist(... (target, fileid[:4])

... for fileid in inaugural.fileids()... for w in inaugural.words(fileid)

... for target in ['america', 'citizen'] ... if w.lower().startswith(target))Get the textThe two axesNarrow the word choiceAll the words in each fileRemember List Comprehension?Three elementsFor a conditional frequency distribution:Two axescondition or event, something of interestsome connected characteristic a year, a place, an author, anything that is related in some way to the eventSomething to countFor the condition and the characteristic, what are we counting? Words? actions? what?From the previous exampleinaugural addressesspecific wordscount the number of times that a form of either of those words occurred in that addressSpot checkRun the code on the previous example.How many times was some version of citizen used in the 1909 inaugural address?How many times was america mentioned in 2009?Play with the code. What can you leave off and still get some meaningful output?Another caseSomewhat simpler specificationDistribution of length of word in languages, with restriction on languages >>> from nltk.corpus import udhr>>> languages = ['Chickasaw', 'English', 'German_Deutsch',... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']>>> cfd = nltk.ConditionalFreqDist(... (lang, len(word)) ... for lang in languages... for word in udhr.words(lang + '-Latin1'))Now tabulateOnly choose to tabulate some of the results.>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],... samples=range(10), cumulative=True) 0 1 2 3 4 5 6 7 8 9 English 0 185 525 883 997 1166 1283 1440 1558 1638German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275Plotimport matplotlibcfd.plot()

Common methods for Conditional Frequency Distributionscfdist = ConditionalFreqDist(pairs)create a conditional frequency distribution from a list of pairscfdist.conditions()alphabetically sorted list of conditionscfdist[condition]the frequency distribution for this conditioncfdist[condition][sample]frequency for the given sample for this conditioncfdist.tabulate()tabulate the conditional frequency distributioncfdist.tabulate(samples, conditions)tabulation limited to the specified samples and conditionscfdist.plot()graphical plot of the conditional frequency distributioncfdist.plot(samples, conditions)graphical plot limited to the specified samples and conditionscfdist1 < cfdist2test if samples in cfdist1 occur less frequently than in cfdist2ReferencesThis set of slides comes very directly from the book, Natural Language Processing with Python. www.nltk.org