47
http:// barcampbangalore.org NLTK Natural Language Processing made easy Elvis Joel D’Souza Gopikrishnan Nambiar Ashutosh Pandey

NLTK: Natural Language Processing made easy

Embed Size (px)

DESCRIPTION

Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python is introduced. It is useful for getting started with NLP and also for research/teaching.

Citation preview

Page 1: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

NLTK

Natural Language Processing made easyElvis Joel D’Souza

Gopikrishnan Nambiar

Ashutosh Pandey

Page 2: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

WHAT: Session Objective

To introduce Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python.

Page 3: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

HOW: Session Layout

This session is divided into 3 parts:• Python – The programming language• Natural Language Processing (NLP) – The concept• Natural Language Toolkit (NLTK) – The tool for NLP

implementation in Python

Page 4: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Page 5: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Why Python?

Page 6: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Data Structures

Python has 4 built-in data structures:1.List2.Tuple3.Dictionary4.Set

Page 7: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

List

• A list in Python is an ordered group of items (or elements).

• It is a very general structure, and list elements don't have to be of the same type.

listOfWords = [‘this’,’is’,’a’,’list’,’of’,’words’]

listOfRandomStuff = [1,’pen’,’costs’,’Rs.’,6.50]

Page 8: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Tuple

• A tuple in Python is much like a list except that it is immutable (unchangeable) once created.

• They are generally used for data which should not be edited.

Example: (100,10,0.01,’hundred’)

NumberSquare root

ReciprocalNumber in words

Page 9: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Return a tuple

def func(x,y): # code to compute a and breturn (a,b)

One very useful situation is returning multiple values from a function. To return multiple values in many other languages requires creating an object or container of some type.

Page 10: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Dictionary• A dictionary in python is a collection of

unordered values which are accessed by key.• Example:

• Here, the key is the character and the value is its position in the alphabet

{1: ‘one’, 2: ‘two’, 3: ‘three’}

Page 11: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Sets

• Python also has an implementation of the mathematical set. • Unlike sequence objects such as lists and tuples, in which

each element is indexed, a set is an unordered collection of objects.

• Sets also cannot have duplicate members - a given object appears in a set 0 or 1 times.

SetOfBrowsers=set([‘IE’,’Firefox’,’Opera’,’Chrome’])

Page 12: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Control Statements

Page 13: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Decision Control - If

num = 3

Page 14: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Loop Control - While

number = 10

Page 15: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Loop Control - For

Page 16: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Functions - Syntax

def functionname(arg1, arg2, ...):statement1 statement2 return variable

Page 17: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Functions - Example

Page 18: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Modules

• A module is a file containing Python definitions and statements.

• The file name is the module name with the suffix .py appended.

• A module can be imported by another program to make use of its functionality.

Page 19: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Import

import math

The import keyword is used to tell Python, that we need the ‘math’ module.

This statement makes all the functions in this module accessible in the program.

Page 20: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Using Modules – An Example

print math.sqrt(100)

sqrt is a functionmath is a module

math.sqrt(100) returns 10This is being printed to the standard output

Page 21: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Natural Language Processing

(NLP)

Page 22: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Natural Language Processing

The term natural language processing encompasses a broad set of techniques for automated generation, manipulation, and analysis of natural or human languages

Page 23: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Why NLP

• Applications for processing large amounts of texts require NLP expertise

• Index and search large texts• Speech understanding• Information extraction• Automatic summarization

Page 24: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Stemming

• Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form.

• The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

• When you apply stemming on 'cats', the result is 'cat'

Page 25: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Part of speech tagging(POS Tagging)

• Part-of-speech (POS) tag: A word can be classified into one or more lexical or part-of-speech categories

• such as nouns, verbs, adjectives, and articles, to name a few. A POS tag is a symbol representing such a lexical category, e.g., NN (noun), VB (verb), JJ (adjective), AT (article).

Page 26: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

POS tagging - continued

• Given a sentence and a set of POS tags, a common language processing task is to automatically assign POS tags to each word in the sentence.

• State-of-the-art POS taggers can achieve accuracy as high as 96%.

Page 27: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

POS Tagging – An Example

The ball is redNOUN VERB

ADJECTIVEARTICLE

Page 28: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Parsing

Parsing a sentence involves the use of linguistic knowledge of a language to discover the way in which a sentence is structured

Page 29: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Parsing– An Example

The boy went home

NOUNVERB NOUN

ARTICLE

NP VP

The boy went home

Page 30: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Challenges

• We will often imply additional information in spoken language by the way we place stress on words.

• The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.

Page 31: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Depending on which word the speaker places the stress, sentences could have several distinct meanings

Here goes an example…

Page 32: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

• "I never said she stole my money“ Someone else said it, but I didn't.

• "I never said she stole my money“ I simply didn't ever say it.

• "I never said she stole my money" I might have implied it in some way, but I never explicitly said it.

• "I never said she stole my money" I said someone took it; I didn't say it was she.

Page 33: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

• "I never said she stole my money" I just said she probably borrowed it.

• "I never said she stole my money" I said she stole someone else's money.

• "I never said she stole my money" I said she stole something, but not my money

Page 34: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

NLTK

Natural Language Toolkit

Page 35: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Design Goals

Page 36: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Exploring Corpora

Corpus is a large collection of text which is used to either train an NLP program or is used as input by an NLP program

In NLTK , a corpus can be loaded using the PlainTextCorpusReader Class

Page 37: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Page 38: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Loading your own corpus

>>> from nltk.corpus import PlaintextCorpusReadercorpus_root = ‘C:\text\’>>> wordlists = PlaintextCorpusReader(corpus_root, '.*‘)>>> wordlists.fileids()['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']>>> wordlists.words('connectives')['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

Page 39: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

NLTK Corpora

• Gutenberg corpus• Brown corpus• Wordnet• Stopwords• Shakespeare corpus• Treebank• And many more…

Page 40: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Computing with Language: Simple Statistics

Frequency Distributions

>>> fdist1 = FreqDist(text1)>>> fdist1 [2]<FreqDist with 260819 outcomes>>>> vocabulary1 = fdist1.keys()>>> vocabulary1[:50][',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-','his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for','this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on','so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were','now', 'which', '?', 'me', 'like']>>> fdist1['whale']906

Page 41: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick

Page 42: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

POS tagging

Page 43: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

WordNet Lemmatizer

Page 44: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Parsing

>>> from nltk.parse import ShiftReduceParser>>> sr = ShiftReduceParser(grammar)>>> sentence1 = 'the cat chased the dog'.split()>>> sentence2 = 'the cat chased the dog on the rug'.split()>>> for t in sr.nbest_parse(sentence1):... print t(S (NP (DT the) (N cat)) (VP (V chased) (NP (DT the) (N dog))))

Page 45: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Authorship Attribution

An Example

Page 46: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

Find nltk @ <python-installation>\Lib\site-packages\nltk

Page 47: NLTK: Natural Language Processing made easy

http://barcampbangalore.org

The Road AheadPython:

• http://www.python.org• A Byte of Python, Swaroop CH

http://www.swaroopch.com/notes/python

Natural Language Processing:• Speech And Language Processing, Jurafsky and Martin• Foundations of Statistical Natural Language Processing,

Manning and Schutze

Natural Language Toolkit:• http://www.nltk.org (for NLTK Book, Documentation)

• Upcoming book by O'reilly Publishers