11
NLTK & basic text stats Day 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Embed Size (px)

Citation preview

Page 1: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

NLTK & basic text statsDay 19 - 10/08/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

08-Oct-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Page 3: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The quiz as a function in a script

Review of scripts & functions

08-Oct-2014

3

NLP, Prof. Howard, Tulane University

Page 4: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

08-Oct-2014

4

NLP, Prof. Howard, Tulane University

Page 5: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Could you download the archive?

NLTK

08-Oct-2014

5

NLP, Prof. Howard, Tulane University

Page 6: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

08-Oct-2014NLP, Prof. Howard, Tulane University

6

Loading the book's texts

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: 'texts()' or 'sents()' to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

>>>

Page 7: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

08-Oct-2014NLP, Prof. Howard, Tulane University

7

Searching text

Show every token of a word in context, called concordance view:

>>> text1.concordance('monstrous') Show the words that appear in a similar

range of contexts:>>> text1.similar('monstrous') Show the contexts that two words share:>>> text1.common_contexts(['whale','man'])

Page 8: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

08-Oct-2014NLP, Prof. Howard, Tulane University

8

Searching text, cont.

Plot how far each token of a word is from the beginning of a text.>>> text1.dispersion_plot(['monstrous'])

Generate random text.>>> text1.generate()

Page 9: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

08-Oct-2014NLP, Prof. Howard, Tulane University

9

Counting vocabulary

Count the word and punctuation tokens in a text:>>> len(text1)

List the unique words, i.e. the word types, in a text:>>> set(text1)

Count how many types there are in a text:>>> len(set(text1))

Count the tokens of a word type:>>> text1.count('smote')

Page 10: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

08-Oct-2014NLP, Prof. Howard, Tulane University

10

Lexical richness or diversity

The lexical richness or diversity of a text can be estimated as tokens per type:>>> len(text1) / len(set(text1)

The frequency of a type can be estimated as tokens per all tokens, but '/' does integer division:>>> from __future__ import division

>>> 100 * text1.count('a') / len(text1)

Page 11: NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

There is no quiz for Monday.We will learn how to get our own text into Python & NLTK.

Next time

08-Oct-2014NLP, Prof. Howard, Tulane University

11