25
Corpora in NLTK Methods in Computational Linguistics I Lesson 3

Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Embed Size (px)

Citation preview

Page 1: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Corpora in NLTKMethods in Computational Linguistics I

Lesson 3

Page 2: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Last Time

• Text Processing

Page 3: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Today

• Text Corpora in NLTK

• for loops

• stored python scripts

• WordNet

Page 4: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

What is a corpus?

• A (relatively) unprocessed set of data

• Penn Treebank

• WSJ

• WordNet?

• Switchboard

Page 5: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

How are corpora used in CL?

• Corpora are the lifeblood of computational linguistics research.

• Collection and Dissemination of Corpora

• Descriptive Statistics

• Statistical Modeling

• Consistent Evaluation

Page 6: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Collection and Dissemination

• Collecting language material is a valuable research outcome.

• Endangered languages

• Hard-to-find data

• Authentic emotion

• Proprietary information

• fMRI

• Articulatory data

Page 7: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Descriptive Statistics

• Describe some linguistic phenomenon using statistics.

• How likely is the word “tweet” to be a verb?

• Is a sentence more likely to start with the word “the” or “baboon”?

• What language use phenomena are associated with positive product reviews?

Page 8: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Statistical Modeling

• Use statistics to make predictions about unseen language data.

• Train a statistical model based on part of a corpus.

• Evaluate on unseen material.

• Models are task specific.

Page 9: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Consistent Evaluation

• Different approaches to a common task can be evaluated on common material.

• “Shared Tasks”

• This allows for “state-of-the-art” results to be established and verified.

Page 10: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Using Corpora with NLTK and Python

• NLTK includes many corpora

• Books of the bible

• Public domain books (Project Gutenberg)

• News

• Web chat

• Blogs

• Some non-English material as well.

Page 11: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

NLTK Corpus Demos

• What corpora are included

• conditional frequency distributions

• for loops

• python scripts

• calling python from outside the interpreter

Page 12: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

NLTK resources

• Corpora often also include data annotations

• NLTK has a variety of methods to access corpus annotations.

• corpus.sents()

• Author

• Part of speech (POS) tags

• Parse tree information

• Word net annotations

• Discourse information

• etc.

• We’ll come back to these as needed.

Page 13: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Major CL tasks

• Part of Speech Tagging

• Parsing

• Word Net

• Named Entity Recognition

• Information Retrieval

• Sentiment Analysis

• Document Clustering

• Topic Segmentation

• Authoring

• Machine Translation

• Summarization

• Information Extraction

• Spoken Dialog Systems

• Natural Language Generation

• Word Sense Disambiguation

Page 14: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Part of Speech Tagging

• Task: Given a string of words, identify the parts of speech for each word.

A man walks into a bar.Det. Noun Verb Prep. Det. Noun

Page 15: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Part of Speech Tagging

• Surface level syntax.

• Primary operation

• Parsing

• Word Sense Disambiguation

• Semantic Role labeling

• Segmentation

• Discourse, Topic, Sentence

Page 16: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

How do we do it?

• Learn from Data.

• Annotated Data:

• Unlabeled Data:

A man walks into a bar.Det. Noun Verb Prep. Det. Noun

A man walks home.The pitcher issued four walks.

Page 17: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Part of speech tagging

Det Noun Verb Prep Adj

A 0.9 0.1 0.0 0.0 0.0

man 0.0 0.6 0.2 0.0 0.2

walks 0.0 0.2 0.8 0.0 0.0

into 0.0 0.0 0.0 1.0 0.0

bar 0.0 0.7 0.3 0.0 0.0

Page 18: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Limitations

• Unseen tokens

• Uncommon interpretations

• Long term dependencies

Page 19: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Parsing

• Generate a Parse Tree from:

• The surface form (words) of the text

• Part of Speech Tokens

Page 20: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Parsing Styles• Parse Trees

• Dependency Parsing<ROOT>

I

Gave

John

Address

My

main

subjdat

obj

attr

.punct

S

NP VP

Verb NP NP

Noun Pos. Noun

I

Gave

John My Address

Page 21: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Context Free Grammars for Parsing

• S → VP

• S →NP VP

• NP → Det Nom

• Nom → Noun

• Nom → Adj Nom

• VP → Verb Nom

• Det → “A”, “The”

• Noun → “I”, “John”, “Address”

• Verb → “Gave”

• Adj → “My”, “Blue”

• Adv → “Quickly”

Page 22: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Using these rules

• Construct a parse that fits the desired text.

Page 23: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Limitations

• The grammar must be built by hand.

• Can’t handle ungrammatical sentences.

• Can’t resolve ambiguity.

Page 24: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Probabilistic Parsing

• Assign each transition a probability

• Find the parse with the greatest “likelihood”

Page 25: Corpora in NLTK - eniac.cs.qc.cuny.edueniac.cs.qc.cuny.edu/andrew/ling78100-10/Lecture3.pdf · NLTK resources •Corpora often also include data annotations •NLTK has a variety

Next Time

• Functions, Lists and Tuples