Natural Language Processing with...

Preview:

Citation preview

Natural Language Processing with Python

CS372: Spring, 2021

Lecture 3Accessing Text Corpora and

Lexical Resources

Jong C. ParkSchool of Computing

Korea Advanced Institute of Science and Technology

ACCESSING TEXT CORPORA AND LEXICAL RESOURCESAccessing Text CorporaConditional Frequency DistributionsMore Python: Reusing CodeLexical ResourcesWordNet

CS372: NLP with Python 22021-03-09

Questions• What are some useful text corpora and lexical

resources, and how can we access them with Python?

• Which Python constructs are most helpful for this work?

• How do we avoid repeating ourselves when writing Python code?

2021-03-09 CS372: NLP with Python 3

Introduction

Gutenberg Corpus Web and Chat Text Brown Corpus Reuters Corpus Inaugural Address Corpus Annotated Text Corpora Corpora in Other Languages Text Corpus Structure Loading Your Own Corpus

2021-03-09 CS372: NLP with Python 4

Accessing Text Corpora

The Project Gutenberg electronic text archive • contains some 25,000 electronic books• http://www.gutenberg.org/.

2021-03-09 CS372: NLP with Python 5

Gutenberg Corpus

2021-03-09 CS372: NLP with Python 6

Gutenberg Corpus

2021-03-09 CS372: NLP with Python 7

Gutenberg Corpus

Average sentence length and lexical diversityappear to be characteristics of particular authors.

2021-03-09 CS372: NLP with Python 8

Gutenberg Corpus

The sents() function divides the text into its sentences, which are lists of words.

Most NLTK corpus readers include a variety of access methods in addition to words(), raw(), and sents().

NLTK’s collection of web text includes • content from a Firefox discussion forum; • conversations overheard in New York; • the movie script of Pirates of the Carribean; • personal advertisements; and • wine reviews.

2021-03-09 CS372: NLP with Python 9

Web and Chat Text

2021-03-09 CS372: NLP with Python 10

Web and Chat Text

A corpus of instant messaging chat sessions:• originally collected by the Naval Postgraduate

School (nps) for research on automatic detection of Internet predators;

• contains over 10,000 posts, anonymized by replacing usernames with generic names of the form “UserNNN”, and manually edited to remove any other identifying information;

2021-03-09 CS372: NLP with Python 11

Web and Chat Text

• organized into 15 files, where each file contains several hundred posts collected on a given data, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).

2021-03-09 CS372: NLP with Python 12

Web and Chat Text

The Brown Corpus• the first million-word electronic corpus of

English;• created in 1961 at Brown University;• contains text from 500 sources; and• the sources have been categorized by genre,

such as news, editorial, and so on.

2021-03-09 CS372: NLP with Python 13

Brown Corpus

http://icame.uib.no/brown/bcm-los.htmlfor a complete list.

2021-03-09 CS372: NLP with Python 14

Brown Corpus

We can access the corpus as a list of words or a list of sentences. • We may optionally specify particular

categories or files to read.

2021-03-09 CS372: NLP with Python 15

Brown Corpus

It is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics.

2021-03-09 CS372: NLP with Python 16

Brown Corpus

Is there any other selection of words that one can try for similar stylistics?

2021-03-09 CS372: NLP with Python 17

Brown Corpus

Computing counts for each genre of interest. • Use NLTK’s support for conditional frequency

distributions.

What kind of observations can we make?

The Reuters Corpus • It contains 10,788 news documents totaling

1.3 million words.• The documents are classified into 90 topics,

and grouped into two sets, “training”/“test”.• For example, the text with fileid ‘test/14826’ is

a document drawn from the test set. • The split is for training and testing algorithms

that automatically detect the topic of a document.

2021-03-09 CS372: NLP with Python 18

Reuters Corpus

2021-03-09 CS372: NLP with Python 19

Reuters Corpus

Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other.

2021-03-09 CS372: NLP with Python 20

Reuters Corpus

Why?

We can specify the words or sentences we want in terms of files or categories.

2021-03-09 CS372: NLP with Python 21

Reuters Corpus

The Inaugural Address Corpus• a collection of 55 texts, one for each

presidential address;• its time dimension is an interesting property.

2021-03-09 CS372: NLP with Python 22

Inaugural Address Corpus

2021-03-09 CS372: NLP with Python 23

Inaugural Address Corpus

‘2021-Biden.txt’ is not yet available.

Looking at how the words America and citizen are used over time.

2021-03-09 CS372: NLP with Python 24

Inaugural Address Corpus

2021-03-09 CS372: NLP with Python 25

Inaugural Address Corpus

Many text corpora containing linguistic annotations represent part-of-speech tags, named entities, syntactic structures, semantic roles, and so forth. • Consult http://www.nltk.org/data for

information about downloading them.

2021-03-09 CS372: NLP with Python 26

Annotated Text Corpora

2021-03-09 CS372: NLP with Python 27

Annotated Text Corpora

2021-03-09 CS372: NLP with Python 28

Annotated Text Corpora

2021-03-09 CS372: NLP with Python 29

Annotated Text Corpora

NLTK comes with corpora for many languages, though in some cases we need to learn how to manipulate character encodings in Python.

2021-03-09 CS372: NLP with Python 30

Corpora in Other Languages

the “Floresta Sinta(c)tica Corpus http://www.linguateca.pt/Floresta/(cf. http://nltk.googlecode.com/svn/trunk/doc/howto/portuguese_en.html)

the CESS-ESP Treebank, with 6030 parsed sentences

bangla, hindi, marathi, telugu

The corpus, udhr, contains the Universal Declaration of Human Rights in over 300 languages. • The fields include information about the

character encoding used in the file, such as UTF8 or Latin1.

2021-03-09 CS372: NLP with Python 31

Corpora in Other Languages

2021-03-09 CS372: NLP with Python 32

Corpora in Other Languages

2021-03-09 CS372: NLP with Python 33

Corpora in Other Languages

2021-03-09 CS372: NLP with Python 34

Corpora in Other Languages

2021-03-09 CS372: NLP with Python 35

Corpora in Other Languages

2021-03-09 CS372: NLP with Python 36

Corpora in Other Languages

2021-03-09 CS372: NLP with Python 37

Corpora in Other Languages

Words having five or fewer letters account forabout 80% of Ibibio text, 60% of German text,and 25% of Inuktitut text.

Common structures

2021-03-09 CS372: NLP with Python 38

Text Corpus Structure

2021-03-09 CS372: NLP with Python 39

Text Corpus Structure

There is a difference between some of the corpus access methods:

2021-03-09 CS372: NLP with Python 40

Text Corpus Structure

Load your own collection of text files.

2021-03-09 CS372: NLP with Python 41

Loading Your Own Corpus

your own path to replace /usr/share/dict

Another example

2021-03-09 CS372: NLP with Python 42

Loading Your Own Corpus

corpus reader for corpora that consist of parenthesis-delineated parse trees

your own path to replace /corpora/penntreebank/parsed/mrg/wsj”

Conditions and Events Counting Words by Genre Plotting and Tabulating Distributions Generating Random Text with Bigrams

2021-03-09 CS372: NLP with Python 43

Conditional Frequency Distributions

While a frequency distribution counts observable events, a conditional frequency distribution needs to pair each event with a condition.

2021-03-09 CS372: NLP with Python 44

Conditions and Events

2021-03-09 CS372: NLP with Python 45

Counting Words by Genre

2021-03-09 CS372: NLP with Python 46

Counting Words by Genre

2021-03-09 CS372: NLP with Python 47

Plotting and Tabulating Distributions

1,638 words of the English text have nine or fewer letters.

Create a table of bigrams using a conditional frequency distribution.

2021-03-09 CS372: NLP with Python 48

Generating Random Text with Bigrams

2021-03-09 CS372: NLP with Python 49

Generating Random Text with Bigrams

Example 2-1. Generating random text

Accessing Text Corpora• Gutenberg Corpus• Web and Chat Text• Brown Corpus• Reuters Corpus• Inaugural Address Corpus• Annotated Text Corpora• Corpora in Other Languages• Text Corpus Structure• Loading Your Own Corpus

2021-03-09 CS372: NLP with Python 50

Summary (1/2)

Conditional Frequency Distributions• Conditions and Events• Counting Words by Genre• Plotting and Tabulating Distributions• Generating Random Text with Bigrams

2021-03-09 CS372: NLP with Python 51

Summary (2/2)

Recommended