Lab 11: Processing Corpora,
Online Data Resources
Ling 1330/2330: Intro to Computational Linguistics
Na-Rae Han
Objectives
2/9/2017 2
How to process an ARCHIVE of text, i.e., a corpus?
From NLTK Corpora page, download:
C-Span Inaugural Address Corpus
How to process data resources, downloaded from the internet?
From Norvig's data page, download:
Xkcd's simple words: words.js
Word 1-grams: count_1w.txt
Big file! Make sure the entire thing is downloaded.
Using set() function
Processing xkcd's words.js
2/9/2017 3
Download from: http://norvig.com/ngrams/
In javascript format, used for xkcd Simple Writer https://xkcd.com/simplewriter/
Let's process this file into a word list.
How to do this?
Step 1: stare at the file.
2/9/2017 4
Extra stuff at the
beginning
Extra stuff at the end
Words are separated
by |
Contracted words
Step 2: read in, shed extras.
2/9/2017 5
>>> f = open('words.js') >>> txt = f.read() >>> f.close() >>> txt[:100] '/**\n *\n * XKCD Simple Writer Word List 0.2.1\n */\nwindow.__WORDS = "understandings|understanding|conv' >>> txt[:67] '/**\n *\n * XKCD Simple Writer Word List 0.2.1\n */\nwindow.__WORDS = "' >>> txt[-10:] 'e|an|i|a";' >>> txt[-2:] '";' >>> chopped = txt[67:-2] >>> print(chopped)[:200] understandings|understanding|conversations|disappearing|informations|grandmothers|grandfathers|questionings|conversation|information|approaching|understands|immediately|positioning|quest
Middle slice without the extra stuff at
either end
May also need: encoding='utf-8'
Step 3: split away.
2/9/2017 6
>>> chopped[-200:] "t|mad|low|lot|hot|lip|how|lit|lie|kid|i'm|let|i’m|leg|i'd|i’d|ice|led|act|lay|law|ins|yes|yet|you|its|job|no|at|by|my|on|ha|do|ok|he|oh|is|tv|me|us|as|hi|go|if|of|am|up|to|we|so|in|or|it|be|an|i|a" >>>
We have to split on ' and | …
Solution: change every ' into |,
and then split on |.
Step 3: split away.
2/9/2017 7
>>> chopped[-200:] "t|mad|low|lot|hot|lip|how|lit|lie|kid|i'm|let|i’m|leg|i'd|i’d|ice|led|act|lay|law|ins|yes|yet|you|its|job|no|at|by|my|on|ha|do|ok|he|oh|is|tv|me|us|as|hi|go|if|of|am|up|to|we|so|in|or|it|be|an|i|a" >>> xkcd_words = chopped.replace("'", '|').split('|') >>> xkcd_words[-50:] ['i', 'm', 'let', 'i’m', 'leg', 'i', 'd', 'i’d', 'ice', 'led', 'act', 'lay', 'law', 'ins', 'yes', 'yet', 'you', 'its', 'job', 'no', 'at', 'by', 'my', 'on', 'ha', 'do', 'ok', 'he', 'oh', 'is', 'tv', 'me', 'us', 'as', 'hi', 'go', 'if', 'of', 'am', 'up', 'to', 'we', 'so', 'in', 'or', 'it', 'be', 'an', 'i', 'a']
SUCCESS!
But! Because we uncoupled
contracted words, 'i' is now listed
twice (or more…)
Would be nice to remove these
duplicates.
Step 4: remove duplicates.
2/9/2017 8
>>> xkcd_words.count('i') 4 >>> xkcd_words.count('he') 2 >>> len(xkcd_words) 3652 >>> xkcd_words = list(set(xkcd_words)) >>> xkcd_words.count('i') 1 >>> len(xkcd_words) 3630 >>>
Turns the list into a set (duplicates removed!) and then back to list
Last step: pickle.
2/9/2017 9
>>> xkcd_words.count('i') 4 >>> xkcd_words.count('he') 2 >>> len(xkcd_words) 3652 >>> xkcd_words = list(set(xkcd_words)) >>> xkcd_words.count('i') 1 >>> len(xkcd_words) 3630 >>> import pickle >>> f = open('xkcd_simple_words.p', 'wb') >>> pickle.dump(xkcd_words, f, -1) >>> f.close() >>>
The set data type
2/9/2017 10
set is a built-in data type in Python. Just like dictionaries, it is built with { } and is orderless.
But unlike dictionaries, it does not have key:value pairs as entries. It has single elements as entries.
Just like dictionaries do not allow duplicate keys, sets do not allow duplicate entries.
>>> cities = {'Boston', 'New York', 'Akron', 'Pittsburgh'} >>> 'Chicago' in cities False >>> medals = {'gold', 'bronze', 'silver', 'silver'} >>> medals {'silver', 'gold', 'bronze'} >>>
Duplicates are quietly ignored.
Using set() to remove duplicates
2/9/2017 11
set is useful as a type-conversion function set().
Can be used to remove duplicates!
It returns a set type, but it can then be converted to some other type: use list() or sorted() for a list type.
>>> li = [1, 2, 3, 3, 3, 4, 4, 5] >>> set(li) {1, 2, 3, 4, 5} >>> list(set(li)) [1, 2, 3, 4, 5]
>>> li2 = 'rose is a rose is a rose'.split() >>> li2 ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose'] >>> set(li2) {'rose', 'is', 'a'} >>> list(set(li2)) ['rose', 'is', 'a'] >>> sorted(set(li2)) ['a', 'is', 'rose']
output now a list
same, but in sorted order!
Processing count_1w.txt
2/9/2017 12
Download from http://norvig.com/ngrams/
Data derived from the Google Web Trillion Word Corpus
Essentially unigram frequency data
Top 333K entries, taken from Google's original data (which is much bigger)
Let's process this file into a Python data object.
How to do this?
Huge file. Wait until your browser fully loads the page before hitting "save as"!
Step 1: stare at the file.
2/9/2017 13
One word per line, followed by count
Separated by white space: most likely a TAB
Already sorted by frequency
Step 2: read in as list of lines
2/9/2017 14
>>> f = open('count_1w.txt') >>> lines = f.readlines() >>> f.close() >>> lines[0] 'the\t23135851162\n' >>> lines[1] 'of\t13151942776\n' >>> len(lines) 333333 >>>
Because of the "one entry per line" format
of the original file, .readlines() is
better suited.
May also need: encoding='utf-8'
Step 3: decide on data structure.
2/9/2017 15
>>> f = open('count_1w.txt') >>> lines = f.readlines() >>> f.close() >>> lines[0] 'the\t23135851162\n' >>> lines[1] 'of\t13151942776\n' >>> len(lines) 333333 >>> lines[1].split() ['of', '13151942776'] >>> Let's build:
a list where each item is
(word, count) tuple.
This is a string. Must turn into
integer.
Step 4: experiment with a small copy.
2/9/2017 16
>>> short = lines[:5] >>> short ['the\t23135851162\n', 'of\t13151942776\n', 'and\t12997637966\n', 'to\t12136980858\n', 'a\t9081174698\n'] >>> short[0].split() ['the', '23135851162'] >>> for s in short: li = s.split() tu = (li[0], int(li[1])) print(tu) ('the', 23135851162) ('of', 13151942776) ('and', 12997637966) ('to', 12136980858) ('a', 9081174698) >>>
Mini version of lines
Build (word, count) tuple
from each line
2/9/2017 17
>>> foo = [] >>> for s in short: li = s.split() tu = (li[0], int(li[1])) foo.append(tu) >>> foo [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)] >>> foo[0] ('the', 23135851162) >>> foo[1] ('of', 13151942776) >>>
foo looks good.
Mini version of the big list we're building
Step 5: build the real thing.
2/9/2017 18
>>> goog_list = [] >>> for s in lines: li = s.split() tu = (li[0], int(li[1])) goog_list.append(tu) >>> goog_list[:10] [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698), ('in', 8469404971), ('for', 5933321709), ('is', 4705743816), ('on', 3750423199), ('that', 3400031103)] >>> goog_list[100] ('price', 501651226) >>> goog_list[1000] ('stay', 80694073) >>> len(goog_list) 333333 >>>
DONE!
The real deal
Alternate data format: dictionary
2/9/2017 19
>>> goog_list = [] >>> for s in lines: li = s.split() tu = (li[0], int(li[1])) goog_list.append(tu) >>> goog_list[:10] [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698), ('in', 8469404971), ('for', 5933321709), ('is', 4705743816), ('on', 3750423199), ('that', 3400031103)] >>> goog_list[100] ('price', 501651226) >>> goog_list[1000] ('stay', 80694073) >>> len(goog_list) 333333 >>>
But suppose we want to know where
'platypus' ranks…
You cannot look up a word in this list!
A dictionary is better data format for this purpose.
Let's build goog_dict word as key,
(rank, count) as value
Step 4': experiment with a small copy.
2/9/2017 20
>>> short = goog_list[:5] >>> short [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)] >>> for i in range(len(short)): print(i, short[i]) 0 ('the', 23135851162) 1 ('of', 13151942776) 2 ('and', 12997637966) 3 ('to', 12136980858) 4 ('a', 9081174698) >>>
Build from goog_list this time.
Need index 0, 1, 2…!! Use range(len(li))
to produce a list of indexes
2/9/2017 21
>>> for i in range(len(short)): print(i, short[i][0], short[i][1]) 0 the 23135851162 1 of 13151942776 2 and 12997637966 3 to 12136980858 4 a 9081174698 >>> foo = {} >>> for i in range(len(short)): word = short[i][0] count = short[i][1] rank = i + 1 foo[word] = (rank, count) >>> foo {'a': (5, 9081174698), 'the': (1, 23135851162), 'to': (4, 12136980858), 'and': (3, 12997637966), 'of': (2, 13151942776)} >>> foo['and'] (3, 12997637966) >>>
Add 1 to index to get rank
foo looks good.
Mini version of the big dictionary we're building
Step 5': build the real thing.
2/9/2017 22
>>> goog_dict = {} >>> for i in range(len(goog_list)): word = goog_list[i][0] count = goog_list[i][1] rank = i + 1 goog_dict[word] = (rank, count) >>> goog_dict['important'] (573, 136103455) >>> goog_dict['platypus'] (36770, 565585) >>> goog_dict['pittsburgh'] (3733, 19654781) >>> goog_dict['philadelphia'] (2631, 30179898) >>> goog_dict['cleveland'] (3813, 19041185) >>>
DONE!
Last step: pickle both data.
2/9/2017 23
>>> import pickle >>> f = open('google_unigram_list.p', 'wb') >>> pickle.dump(goog_list, f, -1) >>> f.close() >>> >>> f2 = open('google_unigram_dict.p', 'wb') >>> pickle.dump(goog_dict, f2, -1) >>> f2.close() >>>
Beyond a single, short text
2/9/2017 24
So far, we have been handling relatively short texts, one at a time.
Going multiple
Find out what's involved in processing a text archive of multiple text files (aka corpus)
Going big
Find out what's involved in processing HUMONGUOUS text files
Let's try this today
Processing multiple texts
2/9/2017 25
From the NLTK Corpora page, download:
C-Span Inaugural Address Corpus
http://www.nltk.org/nltk_data/
The C-Span Inaugural Address Corpus
Includes 56 past presidential inaugural address, from 1789 (Washington) to 2009 (Obama).
The directory has 56 .txt files and one README file.
QUESTION: How do we effectively process this many files?
Corpus vs. sub-corpora
2/9/2017 26
Entire Corpus
Sub-corpus 1 Sub-corpus 2
Big token lists for sub-corpora
2/9/2017 27
text text text text text text text text
sub-corpus 1 TOKENS
sub-corpus 2 TOKENS
Good when individual texts don't
need separate attention.
Pools & individual token lists
2/9/2017 28
text text text text text text text text
sub-corpus 1 TOKENS
sub-corpus 2 TOKENS
tokens tokens tokens tokens tokens tokens tokens tokens
Individual token lists as well as
sub-corpus pools
Using glob
2/9/2017 29
glob: a file-name globbing utility
Returns a list of file names that match the specified pattern
>>> import glob >>> files = glob.glob(r'D:\Lab\inaugural\*.txt') >>> len(files) 56 >>> files[:5] ['D:\\Lab\\inaugural\\1789-Washington.txt', 'D:\\Lab\\inaugural\\1793-Washington.txt', 'D:\\Lab\\inaugural\\1797-Adams.txt', 'D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt'] >>> files[-1] 'D:\\Lab\\inaugural\\2009-Obama.txt' >>>
All files ending in .txt Excludes README
Using glob
2/9/2017 30
Addresses from 1800's only
>>> files2 = glob.glob(r'D:\Lab\inaugural\18*.txt') >>> len(files2) 25 >>> files2[:5] ['D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt', 'D:\\Lab\\inaugural\\1809-Madison.txt', 'D:\\Lab\\inaugural\\1813-Madison.txt', 'D:\\Lab\\inaugural\\1817-Monroe.txt'] >>> files2[-1] 'D:\\Lab\\inaugural\\1897-McKinley.txt' >>>
All files starting with '18' and ending with
'.txt'
Build dictionary of texts
2/9/2017 31
For-loop through file names and build a dictionary of key (filename): value (text content)
>>> files[0] 'D:\\Lab\\inaugural\\1789-Washington.txt' >>> files[0][12:-4] 'ural\\1789-Washington' >>> files[0][17:-4] '1789-Washington' >>> files[2][17:-4] '1797-Adams'
>>> fn2txt = {} >>> for longname in files: f = open(longname) txt = f.read() f.close() fname = longname[17:-4] fn2txt[fname] = txt >>> fn2txt['1809-Madison'][:40] 'Unwilling to depart from examples of the' >>> fn2txt['1789-Washington'][:40] 'Fellow-Citizens of the Senate and of the'
fn2txt file name as key,
text string as value
Treating files as a single corpus
32
Task: Compile word frequency of the Inaugural Speeches.
>>> import textstats >>> alltoks = [] >>> for txt in fn2txt.values(): toks = textstats.getTokens(txt) alltoks.extend(toks) >>> len(alltoks) 145774 >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] >>> alltoks[-15:] ['you', '.', 'god', 'bless', 'you', '.', 'and', 'god', 'bless', 'the', 'united', 'states', 'of', 'america', '.']
For this, we only need
to build a single pool of tokenized words.
For each text, tokenize it, and then add the result to the pool of tokenized words.
Word frequency of entire corpus
33
>>> allfreq = textstats.getFreq(alltoks) >>> allfreq['citizens'] 237 >>> allfreq['battle'] 12 >>> for k in sorted(allfreq, key=allfreq.get, reverse=True)[:10]: print(k, allfreq[k]) the 9906 of 6986 , 6862 and 5139 . 4749 to 4432 in 2749 a 2193 our 2058 that 1726 >>>
Treating files as a single corpus, take 2
34
Task: Compile word frequency of the Inaugural Speeches.
>>> alltxt = '\n'.join(fn2txt.values())
>>> alltoks = textstats.getTokens(alltxt) >>> len(alltoks) 145774 >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the']
Alternative approach: join all text strings into
a single gigantic text string…
And then, tokenize it all at once.
All speech texts, concatenated with a line break in between
Processing each text
35
Task: Compute the average sentence length for each presidential address.
We have to build separate token lists for each speech.
>>> fn2toks = {} >>> for (fn, txt) in fn2txt.items(): toks = textstats.getTokens(txt) fn2toks[fn] = toks >>> len(fn2toks) 56 >>> fn2toks['1789-Washington'] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', ... >>> fn2toks['2001-Bush'][:10] ['president', 'clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ',']
fn2toks file name as key,
token list as value
Average sentence length, per address
36
>>> for fn in sorted(fn2toks): toks = fn2toks[fn] sentcount = toks.count('.') + toks.count('!') \ + toks.count('?') avgsentlen = len(toks)/sentcount print(avgsentlen, '\t', fn) 66.9130434783 1789-Washington 36.75 1793-Washington 69.8648648649 1797-Adams 47.1951219512 1801-Jefferson 52.9777777778 1805-Jefferson 60.2380952381 1809-Madison ... 18.824742268 2001-Bush 23.3939393939 2005-Bush 24.7909090909 2009-Obama >>>
Assumes every sentence ends with
'.', '!', or '?'
HW 5A: Two Presidents
George W. Bush Barack Obama
2/9/2017 37
Thank you very much. Mr. Speaker, Vice President Cheney, Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is at war; our economy is in recession; and the civilized world faces unprecedented dangers. Yet, the state of our Union has never been stronger.
We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, saved a people from starvation, and freed a country from brutal oppression.
The American flag flies again over our …
Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans: Our Constitution declares that from time to time, the President shall give to Congress information about the state of our Union. For 220 years, our leaders have fulfilled this duty. They've done so during periods of prosperity and tranquility, and they've done so in the midst of war and depression, at moments of great strife and great struggle.
It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. …
HW 5B: Two EFL Corpora
Bulgarian Students Japanese Students
2/9/2017 38
It is time, that our society is dominated by industrialization. The prosperity of a country is based on its enormous industrial corporations that are gradually replacing men with machines. Science is highly developed and controls the economy. From the beginning of school life students are expected to master a huge amount of scientific data. Technology is part of our everyday life.
Children nowadays prefer to play with computers rather than with our parents' wooden toys. But I think that in our modern world which worships science and technology there is still a place for dreams and imagination.
There has always been a place for them in man's life. Even in the darkness of the …
I agree greatly this topic mainly because I think that English becomes an official language in the not too distant. Now, many people can speak English or study it all over the world, and so more people will be able to speak English. Before the Japanese fall behind other people, we should be able to speak English, therefore, we must study English not only junior high school students or over but also pupils. Japanese education system is changing such a program. In this way, Japan tries to internationalize rapidly. However, I think this way won't suffice for becoming international humans. To becoming international humans, we should study English not only school but also daily life. If we can do it, we are able to master English conversation. It is important for us to master English honorific words. …
Wrapping up
2/9/2017 39
Next class:
Introduction to corpora
How to do corpus analysis
Homework 5: Corpus analysis
1-week-long. You will have a choice between:
- Bush vs. Obama SOU speech corpus
- Bulgarian vs. Japanese EFL learner corpus
Recitation students: work on PART 1 before tomorrow
START EARLY!!!
Midterm exam next slide
Midterm exam
2/9/2017 40
2/21 (Tuesday)
At LMC's PC lab (CL G17)
More room!
ALL pencil-and-paper exam questions!