Introduction to Natural Language Processing ...ce.aut.ac.ir/islab/courses/NLP/archive/1388/s1/nltk-book.pdf · This textbook provides a comprehensive introduction to the eld of natural

Introduction to

Natural Language Processing /

Computational Linguistics

1

Preface

Most human knowledge — and most human communication — is represented and expressed usinglanguage. Language technologies permit computers to process human language automatically; hand-held computers support predictive text and handwriting recognition; web search engines give access toinformation locked up in unstructured text. By providing more natural human-machine interfaces, andmore sophisticated access to stored information, language processing has come to play a central role inthe multilingual information society.

This textbook provides a comprehensive introduction to the field of natural language processing(NLP), covering the major techniques and theories. The book provides numerous worked examplesand exercises, and can serve as the main text for undergraduate and introductory graduate courses onnatural language processing or computational linguistics.

1 Audience

This book is intended for people in the language sciences and the information sciences who want tolearn how to write programs that analyze written language. You won’t need any prior knowledge oflinguistics or computer science; those with a background in either area can simply skip over some of thediscussions. Depending on which background you come from, and your motivation for being interestedin NLP, you will gain different kinds of skills and knowledge from this book, as set out below:

GoalsBackgroundLinguistics Computer Science

LinguisticAnalysis

Programming to manage linguistic data,explore formal theories, and test empir-ical claims

Linguistics as a source of interestingproblems in data modelling, data min-ing, and formal language theory

LanguageTechnology

Learning to program with applicationsto familiar problems, to work in lan-guage technology or other technicalfield

Knowledge of linguistics as fundamen-tal to developing high quality, maintain-able language processing software

The structure of the book is biased towards linguists, in that the introduction to programmingappears in the main chapter sequence, and early chapters contain many elementary examples. Wehope that computer science readers can quickly skip over such materials till they reach content that ismore linguistically challenging.

1

Introduction to Natural Language Processing (DRAFT) .

2 What You Will Learn

By the time you have dug into the material presented here, you will have acquired substantial skills andknowledge in the following areas:

� how simple programs can help linguists manipulate and analyze language data, and how to writethese programs;

� key concepts from linguistic description and analysis;

� how linguistic knowledge is used in important language technology components;

� knowledge of the principal data structures and algorithms used in NLP, and skills in algorithmicproblem solving, data modelling, and data management;

� understanding of the standard corpora and their use in formal evaluation;

� the organization of the field of NLP;

� skills in Python programming for NLP.

3 Download the Toolkit...

This textbook is a companion to the Natural Language Toolkit. All software, corpora, and documenta-tion are freely downloadable from http://nltk.sourceforge.net/. Distributions are provided for Windows,Macintosh and Unix platforms. All NLTK distributions plus Python and WordNet distributions are alsoavailable in the form of an ISO image which can be downloaded and burnt to CD-ROM for easy localredistribution. We strongly encourage you to download the toolkit before you go beyond the firstchapter of the book.

4 Emphasis

This book is a practical introduction to NLP. You will learn by example, write real programs, and graspthe value of being able to test an idea through implementation. If you haven’t learnt already, this bookwill teach you programming. Unlike other programming books, we provide extensive illustrations andexercises from NLP. The approach we have taken is also principled, in that we cover the theoreticalunderpinnings and don’t shy away from careful linguistic and computational analysis. We have triedto be pragmatic in striking a balance between theory and application, and alternate between the twoseveral times each chapter, identifying the connections but also the tensions. Finally, we recognize thatyou won’t get through this unless it is also pleasurable, so we have tried to include many applicationsand examples that are interesting and entertaining, sometimes whimsical.

5 Structure

The book is structured into three parts, as follows:

Part 1: Basics In this part, we focus on recognising simple structure in text. We start with individualwords, then explore parts of speech and simple syntactic constituents.

Bird, Klein & Loper -2 December 6, 2006


Part 2: Parsing Here, we deal with syntactic structure, trees, grammars, and parsing.

Part 3: Advanced Topics This final part of the book contains chapters which addresses selected topicsin NLP in more depth and to a more advanced level. By design, the chapters in this part can beread independently of each other.

The three parts have a common structure: they start off with a chapter on programming, followed byfour chapters on various topics in NLP. The programming chapters are foundational, and you mustmaster this material before progressing further.

Each chapter consists of an introduction; a sequence of major sections together with graded ex-ercises; and finally a summary and suggestions for further reading. The exercises are important forconsolidating the material in each section, and we strongly encourage you to try a few before continuingwith the rest of the chapter.

6 For Instructors

Natural Language Processing (NLP) is often taught within the confines of a single-semester course atadvanced undergraduate level or postgraduate level. Many instructors have found that it is difficultto cover both the theoretical and practical sides of the subject in such a short span of time. Somecourses focus on theory to the exclusion of practical exercises, and deprive students of the challenge andexcitement of writing programs to automatically process language. Other courses are simply designedto teach programming for linguists, and do not manage to cover any significant NLP content. TheNatural Language Toolkit (NLTK) was developed to address this problem, making it feasible to covera substantial amount of theory and practice within a single-semester course, even if students have noprior programming experience.

A significant fraction of any NLP syllabus covers fundamental data structures and algorithms.These are usually taught with the help of formal notations and complex diagrams. Large trees and chartsare copied onto the board and edited in tedious slow motion, or laboriously prepared for presentationslides. It is more effective to use live demonstrations in which those diagrams are generated and updatedautomatically. NLTK provides interactive graphical user interfaces, making it possible to view programstate and to study program execution step-by-step. Most NLTK components have a demonstrationmode, and will perform an interesting task without requiring any special input from the user. It iseven possible to make minor modifications to programs in response to “what if” questions. In thisway, students learn the mechanics of NLP quickly, gain deeper insights into the data structures andalgorithms, and acquire new problem-solving skills.

NLTK supports assignments of varying difficulty and scope. In the simplest assignments, studentsexperiment with existing components to perform a wide variety of NLP tasks. This may involve noprogramming at all, in the case of the existing demonstrations, or simply changing a line or two ofprogram code. As students become more familiar with the toolkit they can be asked to modify existingcomponents or to create complete systems out of existing components. NLTK also provides studentswith a flexible framework for advanced projects, such as developing a multi-component system, byintegrating and extending NLTK components, and adding on entirely new components. Here NLTKhelps by providing standard implementations of all the basic data structures and algorithms, interfacesto standard corpora, substantial corpus samples, and a flexible and extensible architecture. Thus, aswe have seen, NLTK offers a fresh approach to NLP pedagogy, in which theoretical content is tightlyintegrated with application.Relationship to Other NLP Textbooks:



We believe our book is unique in providing a comprehensive pedagogical framework for studentsto learn about NLP in the context of learning to program. What sets our materials apart is thetight coupling of the chapters and exercises with NLTK, giving students — even those with no priorprogramming experience — a practical introduction to NLP. Once completing these materials, studentswill be ready to attempt one of the more advanced textbooks, such as Foundations of Statistical NaturalLanguage Processing, by Manning and Schütze (MIT Press, 2000).

Course Plans; Lectures/Lab Sessions per ChapterChapter Linguists Computer Scientists

1 Introduction 1 12 Programming 4 13 Words 2 24 Tagging 2-3 25 Chunking 0-2 26 Programming 2-4 17 Grammars and Parsing 2-4 2-48 Chart Parsing 1-2 19 Feature Based Grammar 2-4 2-410 Probabilistic Grammars 0-2 211-15 Advanced Topics 2-8 2-16Total 18-36 18-36

Further Reading:

The Association for Computational Linguistics (ACL) The ACL is the foremost professional bodyin NLP. Its journal and conference proceedings, approximately 10,000 articles, are availableonline with a full-text search interface, via http://www.aclweb.org/anthology/.

Linguistic Terminology A comprehensive glossary of linguistic terminology is available at http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/.

Language Files Materials for an Introduction to Language and Linguistics (Ninth Edition), The OhioState University Department of Linguistics. For more information, see http://www.ling.ohio-state.edu/publications/files/.

7 Acknowledgements

NLTK was originally created as part of a computational linguistics course in the Department of Com-puter and Information Science at the University of Pennsylvania in 2001. Since then it has beendeveloped and expanded with the help of dozens of contributors. It has now been adopted in courses indozens of universities, and serves as the basis of many research projects.

In particular, we’re grateful to the following people for their feedback, advice, and contributions:Greg Aumann, Trevor Cohn, James Curran, Jean Mark Gawron, Baden Hughes, Christopher Maloof,Stuart Robinson, Rob Speer. Many others have contributed to the toolkit, and they are listed athttp://nltk.sourceforge.net/contrib.html.



8 About the Authors

Steven Bird Ewan Klein Edward Loper

Steven Bird is an Associate Professor in the Department of Computer Science and SoftwareEngineering at the University of Melbourne, and a Senior Research Associate in the Linguistic DataConsortium at the University of Pennsylvania. After completing a PhD at the University of Edinburghon computational phonology (1990), Steven moved to Cameroon to conduct fieldwork on tone andorthography. Later he spent four years as Associate Director of the Linguistic Data Consortium wherehe developed models and tools for linguistic annotation. His current research interests are in linguisticdatabases and query languages.

Ewan Klein is Professor of Language Technology in the School of Informatics at the University ofEdinburgh. He completed a PhD on formal semantics at the University of Cambridge in 1978. Aftersome years working at the Universities of Sussex and Newcastle upon Tyne, he took up a teachingposition at Edinburgh. His current research interests are in computational semantics.

Edward Loper is a doctoral student in the Department of Computer and Information Sciences atthe University of Pennsylvania. ...

About this document...This chapter is a draft from Introduction to Natural Language Processing, bySteven Bird, Ewan Klein and Edward Loper, Copyright 2006 the authors. It isdistributed with the Natural Language Toolkit [http://nltk.sourceforge.net], Version0.7b1, under the terms of the Creative Commons Attribution-ShareAlike License[http://creativecommons.org/licenses/by-sa/2.5/].


Python and the Natural Language Toolkit

1 Why Python?

Python is a simple yet powerful programming language with excellent functionality for processinglinguistic data. Python can be downloaded for free from http://www.python.org/.

Here is a five-line Python program which takes text input and prints all the words ending in ing:

>>> import sys # load the system library>>> for line in sys.stdin.readlines(): # for each line of input... for word in line.split(): # for each word in the line... if word.endswith(’ing’): # does the word end in ’ing’?... print word # if so, print the word

This program illustrates some of the main features of Python. First, whitespace is used to nestlines of code, thus the line starting with if falls inside the scope of the previous line starting with for,so the ing test is performed for each word. Second, Python is object-oriented; each variable is anentity which has certain defined attributes and methods. For example, line is more than a sequenceof characters. It is a string object that has a method (or operation) called split that we can use tobreak a line into its words. To apply a method to an object, we give the object name, followed by aperiod, followed by the method name. Third, methods have arguments expressed inside parentheses.For instance, split had no argument because we were splitting the string wherever there was whitespace. To split a string into sentences delimited by a period, we could write split(’.’). Finally, andmost importantly, Python is highly readable, so much so that it is fairly easy to guess what the aboveprogram does even if you have never written a program before.

We chose Python as the implementation language for NLTK because it has a shallow learningcurve, its syntax and semantics are transparent, and it has good string-handling functionality. As ascripting language, Python facilitates interactive exploration. As an object-oriented language, Pythonpermits data and methods to be encapsulated and re-used easily. As a dynamic language, Pythonpermits attributes to be added to objects on the fly, and permits variables to be typed dynamically,facilitating rapid development. Python comes with an extensive standard library, including componentsfor graphical programming, numerical processing, and web data processing.

Python is heavily used in industry, scientific research, and education around the world. Python isoften praised for the way it facilitates productivity, quality, and maintainability of software. A collectionof Python success stories is posted at http://www.python.org/about/success/.

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides:basic classes for representing data relevant to natural language processing; standard interfaces forperforming tasks such as tokenization, tagging, and parsing; standard implementations for each taskwhich can be combined to solve complex problems; and extensive documentation including tutorialsand reference documentation.

1


2 The Design of NLTK

NLTK was designed with six requirements in mind:

Ease of use: The primary purpose of the toolkit is to allow students to concentrate onbuilding natural language processing systems. The more time students must spendlearning to use the toolkit, the less useful it is. We have provided software distri-butions for several platforms, along with platform-specific instructions, to make thetoolkit easy to install.

Consistency: We have made a significant effort to ensure that all the data structures andinterfaces are consistent, making it easy to carry out a variety of tasks using a uniformframework.

Extensibility: The toolkit easily accommodates new components, whether those compo-nents replicate or extend existing functionality. Moreover, the toolkit is organized sothat it is usually obvious where extensions would fit into the toolkit’s infrastructure.

Simplicity: We have tried to provide an intuitive and appealing framework along withsubstantial building blocks, for students to gain a practical knowledge of NLP withoutgetting bogged down in the tedious house-keeping usually associated with processingannotated language data.

Modularity: The interaction between different components of the toolkit uses simple,well-defined interfaces. It is possible to complete individual projects using smallparts of the toolkit, without needing to understand how they interact with the restof the toolkit. This allows students to learn how to use the toolkit incrementallythroughout a course. Modularity also makes it easier to change and extend the toolkit.

Well-Documented: The toolkit comes with substantial documentation, including nomen-clature, data structures, and implementations.

Contrasting with these requirements are three non-requirements — potentially useful features thatwe have deliberately avoided. First, while the toolkit provides a wide range of functions, it is notintended to be encyclopedic. There should be a wide variety of ways in which students can extend thetoolkit. Second, while the toolkit should be efficient enough that students can use their NLP systemsto perform meaningful tasks, it does not need to be highly optimized for runtime performance. Suchoptimizations often involve more complex algorithms, and sometimes require the use of C or C++.This would make the toolkit less accessible and more difficult to install. Third, we have avoided cleverprogramming tricks, since clear implementations are far preferable to ingenious yet indecipherableones.

NLTK Organization: NLTK is organized into a collection of task-specific packages. Each packageis a combination of data structures for representing a particular kind of information such as trees, andimplementations of standard algorithms involving those structures such as parsers. This approach isa standard feature of object-oriented design, in which components encapsulate both the resources andmethods needed to accomplish a particular task.

The most fundamental NLTK components are for identifying and manipulating individual wordsof text. These include: tokenize, for breaking up strings of characters into word tokens; tag, foradding part-of-speech tags, including regular-expression taggers, n-gram taggers and Brill taggers; andthe Porter stemmer.

The second kind of module is for creating and manipulating structured linguistic information. Thesecomponents include: tree, for representing and processing parse trees; featurestructure, for



building and unifying nested feature structures (or attribute-value matrices); cfg, for specifying freegrammars; and parse, for creating parse trees over input text, including chart parsers, chunk parsersand probabilistic parsers.

Several utility components are provided to facilitate processing and visualization. These include:draw, to visualize NLP structures and processes; probability, to count and collate events, andperform statistical estimation; and corpora, to access tagged linguistic corpora.

A further group of components is not part of NLTK proper. These are a wide selection of third-party contributions, often developed as student projects at various institutions where NLTK is used,and distributed in a separate package called NLTK Contrib. Several of these student contributions, suchas the Brill tagger and the HMM module, have now been incorporated into NLTK. Although thesecontributed components are not maintained, they may serve as a useful starting point for future studentprojects.

In addition to software and documentation, NLTK provides substantial corpus samples. Many ofthese can be accessed using the corpora module, avoiding the need to write specialized file parsingcode before you can do NLP tasks. These corpora include: Brown Corpus — 1.15 million words oftagged text in 15 genres; a 10% sample of the Penn Treebank corpus, consisting of 40,000 words ofsyntactically parsed text; a selection of books from Project Gutenberg totally 1.7 million words; andother corpora for chunking, prepositional phrase attachment, word-sense disambiguation, informationextraction

Note on NLTK-Lite: Since mid-2005, the NLTK developers have been creating a lightweightversion NLTK, called NLTK-Lite. NLTK-Lite is simpler and faster than NLTK. Once it is complete,NLTK-Lite will provide the same functionality as NLTK. However, unlike NLTK, NLTK-Lite does notimpose such a heavy burden on the programmer. Wherever possible, standard Python objects are usedinstead of custom NLP versions, so that students learning to program for the first time will be learningto program in Python with some useful libraries, rather than learning to program in NLTK.

NLTK Papers: NLTK has been presented at several international conferences with publishedproceedings, as listed below:

Edward Loper and Steven Bird (2002). NLTK: The Natural Language Toolkit, Proceedings ofthe ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processingand Computational Linguistics, Somerset, NJ: Association for Computational Linguistics, pp. 62-69,http://arXiv.org/abs/cs/0205028

Steven Bird and Edward Loper (2004). NLTK: The Natural Language Toolkit, Proceedings of theACL demonstration session, pp 214-217. http://eprints.unimelb.edu.au/archive/00001448/

Steven Bird (2005). NLTK-Lite: Efficient Scripting for Natural Language Processing, 4th Interna-tional Conference on Natural Language Processing, pp 1-8. http://eprints.unimelb.edu.au/archive/00001453/

Steven Bird (2006). NLTK: The Natural Language Toolkit, Proceedings of the ACL demonstrationsession http://www.ldc.upenn.edu/sb/home/papers/nltk-demo-06.pdf

Edward Loper (2004). NLTK: Building a Pedagogical Toolkit in Python, PyCon DC 2004 PythonSoftware Foundation, http://www.python.org/pycon/dc2004/papers/

Ewan Klein (2006). Computational Semantics in the Natural Language Toolkit, Australian Lan-guage Technology Workshop. http://www.alta.asn.au/events/altw2006/proceedings/Klein.pdf





1 Introduction to Natural Language Processing

1 Why Language Processing is Useful

How do we write programs to manipulate natural language? What questions about language could weanswer? How would the programs work, and what data would they need? These are just some of thetopics we will cover in this book. Before we tackle the subject systematically, we will take a quick lookat some simple tasks in which computational tools manipulate language data in a variety of interestingand non-trivial ways.

Our first example involves word stress. The CMU Pronunciation Dictionary is a machine-readabledictionary that gives the pronunciation of over 125,000 words in North American English. Each entryconsists of a word in standard orthography followed by a phonological transcription. For example, theentry for language is the following:

(1) language / L AE1 NG G W AH0 JH .

Each character or group of characters following the slash represents an English phoneme (e.g., ’AE’corresponds to the IPA symbol æ), and the final numbers indicate word stress. That is, ’AE1’ is thenucleus of a stressed syllable, while ’AH0’ is the nucleus of an unstressed one. Let’s suppose that wewant to find every word in the dictionary which exhibits a particular stress pattern; say, words whoseprimary stress is on their fifth-last syllable (this is called pre-preantepenultimate stress). Searchingthrough the dictionary by hand would be tedious, and we would probably miss some of the cases. Wecan write a simple program that will extract the numerals 0, 1 and 2 from transcription, and then createsa new field stress_pattern for each word which is just a sequence of these stress numbers. Afterthis has been done, it is easy to scan the extracted stress patterns for any sequence which ends with 10 0 0 0. Here are some of the words that we can find using this method:

(2) ACCUMULATIVELY / AH0 K Y UW1 M Y AH0 L AH0 T IH0 V L IY0AGONIZINGLY / AE1 G AH0 N AY0 Z IH0 NG L IY0CARICATURIST / K EH1 R AH0 K AH0 CH ER0 AH0 S TCUMULATIVELY / K Y UW1 M Y AH0 L AH0 T IH0 V L IY0FORMALIZATION / F AO1 R M AH0 L AH0 Z EY0 SH AH0 NHYPERSENSITIVITY / HH AY2 P ER0 S EH1 N S AH0 T IH0 V AH0 T IY0IMAGINATIVELY / IH2 M AE1 JH AH0 N AH0 T IH0 V L IY0INSTITUTIONALIZES / IH2 N S T AH0 T UW1 SH AH0 N AH0 L AY0 Z AH0 ZSPIRITUALIST / S P IH1 R IH0 CH AH0 W AH0 L AH0 S T

UNALIENABLE / AH0 N EY1 L IY0 EH0 N AH0 B AH0 L

Our second example also involves phonology. When we construct an inventory of sounds for alanguage, we are usually interested in just those sounds which can make a difference in word meaning.

1


To do this, we look for minimal pairs; that is, distinct words which differ in only one sound. Forexample, we might argue that the sounds [p] and [b] in English are distinctive because if we replaceone with the other, we often end up with a different word:

(3) pat vs. batnip vs. nib

Suppose we want to do this more systematically for a language where we have a list of words,but are still trying to determine the sound inventory. As a case in point, NLTK includes a lexiconfor Rotokas, an East Papuan language spoken on Bougainville Island, near Papua New Guinea. Let’ssuppose we are interested in how many vowels there are in Rotokas. We can write a program to findall four-letter words which differ only by their first vowel, and tabulate the results to illustrate vowelcontrasts:

(4) kasi - kesi kusi kosikava - - kuva kovakaru kiru keru kuru korukapu kipu - - kopukaro kiro - - korokari kiri keri kuri korikapa - kepa - kopakara kira kera - korakaku - - kuku koku

kaki kiki - - koki

The two preceding examples have used lexical resources. We can also write programs to analyzetexts in various ways. In this example, we try to build a model of the patterns of adjacent words in thebook of Genesis. Each pair of adjacent words is known as a bigrams, and we can build a very simplemodel of biblical language just be counting bigrams. There are many useful things we can do with suchinformation, such as identifying genres of literature or even identifying the author of a piece of text.Here we use it for a more whimsical purpose: to generate random text in the style of Genesis. As youwill see, we have managed to capture something about the flow of text from one word to the next, butbeyond this it is simply nonsense:

(5)lo, it came to the land of his father and he said, i will not be awife unto him, saying, if thou shalt take our money in their kind,cattle, in thy seed after these are my son from off any more than allthat is this day with him into egypt, he, hath taken away unawares topass, when she bare jacob said one night, because they were born twohundred years old, as for an altar there, he had made me out at herpitcher upon every living creature after thee shall come near her:

yea,

For our last example, let’s suppose we are engaged in research to study semantic contrast in Englishverbs. We hypothesize that one useful source of data for exploring such contrasts might be a list of verbphrases which are conjoined with the word but. So we need to carry out some grammatical analysisto find conjoined verb phrases, and also need to be able to specify but as the conjunction. Rather thantrying to do the grammatical analysis ourselves, we can make use of a resource in which the syntactic



trees have already been added to lots of sentences. The best known of such resources is the Universityof Pennsylvania Treebank corpus (or Penn Treebank for short), and we can write a program to readtrees from this corpus, find instances of verb phrase conjunctions involving the word but, and displayparsed text corresponding to the two verb phrases.

(6) (VBZ has) (VP opened its market to foreign cigarettes)BUT (VBZ restricts) (NP advertising) (PP-CLR to designated places)

(VBZ admits) (SBAR 0 she made a big mistake)BUT (VBD did) (RB n’t) (VP elaborate)

(VBD confirmed) (SBAR 0 he had consented to the sanctions)BUT (VBD declined) (S *-1 to comment further)

(VBP are) (NP-PRD a guide to general levels)BUT (VBP do) (RB n’t) (ADVP-TMP always) (VP represent actual transactions)

(VBN flirted) (PP with a conversion to tabloid format) (PP-TMP for years)BUT (ADVP-TMP never) (VBN executed) (NP the plan)

(VBD ended) (ADVP-CLR slightly higher)BUT (VBD trailed) (NP gains in the Treasury market)

(VBD confirmed) (NP the filing)

BUT (MD would) (RB n’t) (VP elaborate)

In presenting these examples, we have tried to give you a flavour of the range of things that can bedone with natural language using computational tools. All the above examples were generated usingsimple programming techniques and a small amount of Python code. After working through the firstfew chapters of this book, you will be able write such programs yourself. In the process, you will cometo understand the basics of natural language processing (henceforth abbreviated as NLP) as a subject.In the remainder of this chapter, we will give you more reasons to think that NLP is both important andfun.

2 The Language Challenge

2.1 Language is rich and complex

Language is the chief manifestation of human intelligence. Through language we express basic needsand lofty aspirations, technical know-how and flights of fantasy. Ideas are shared over great separationsof distance and time. The following samples from English illustrate the richness of language:

1. a. Overhead the day drives level and grey, hiding the sun by a flight of grey spears.(William Faulkner, As I Lay Dying, 1935)

b. When using the toaster please ensure that the exhaust fan is turned on. (sign indormitory kitchen)

c. Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated ac-tivities with Ki values of 45.1-271.6 µM (Medline, PMID: 10718780)

d. Iraqi Head Seeks Arms (spoof news headline)

e. The earnest prayer of a righteous man has great power and wonderful results.(James 5:16b)

f. Twas brillig, and the slithy toves did gyre and gimble in the wabe (LewisCarroll, Jabberwocky, 1872)



g. There are two ways to do this, AFAIK :smile: (internet discussion archive)

Thanks to this richness, the study of language is part of many disciplines outside of linguistics,including translation, literary criticism, philosophy, anthropology and psychology. Many less obviousdisciplines investigate language use, such as law, hermeneutics, forensics, telephony, pedagogy, archae-ology, cryptanalysis and speech pathology. Each applies distinct methodologies to gather observations,develop theories and test hypotheses. Yet all serve to deepen our understanding of language and of theintellect which is manifested in language.

The importance of language to science and the arts is matched in significance by the culturaltreasure embodied in language. Each of the world’s ~7,000 human languages is rich in unique respects,in its oral histories and creation legends, down to its grammatical constructions and its very wordsand their nuances of meaning. Threatened remnant cultures have words to distinguish plant subspeciesaccording to therapeutic uses which are unknown to science. Languages evolve over time as they comeinto contact with each other and they provide a unique window onto human pre-history. Technologicalchange gives rise to new words like blog and new morphemes like e- and cyber-. In many parts of theworld, small linguistic variations from one town to the next add up to a completely different languagein the space of a half-hour drive. For its breathtaking complexity and diversity, human language is as acolourful tapestry stretching through time and space.

2.2 Language and the Internet

Today, both professionals and ordinary people are confronted by unprecedented volumes of informa-tion, the vast bulk of which is stored as unstructured text. In 2003, it was estimated that the annualproduction of books amounted to 8 Terabytes. (A Terabyte is 1,000 Gigabytes, i.e., equivalent to 1,000pickup trucks filled with books.) It would take a human being about five years to read the new scientificmaterial that is produced every 24 hours. Although these estimates are based on printed materials,increasingly the information is also available electronically. Indeed, there has been an explosion of textand multimedia content on the World Wide Web. For many people, a large and growing fraction ofwork and leisure time is spent navigating and accessing this universe of information.

The presence of so much text in electronic form is a huge challenge to NLP. Arguably, the onlyway for humans to cope with the information explosion is to exploit computational techniques whichcan sift through huge bodies of text.

Although existing search engines have been crucial to the growth and popularity of the Web,humans require skill, knowledge, and some luck, to extract answers to such questions as What touristsites can I visit between Philadelphia and Pittsburgh on a limited budget? What do expert criticssay about Canon digital cameras? What predictions about the steel market were made by crediblecommentators in the past week? Getting a computer to answer them automatically is a realistic long-term goal, but would involve a range of language processing tasks, including information extraction,inference, and summarization, and would need to be carried out on a scale and with a level of robustnessthat is still beyond our current capabilities.

2.3 The Promise of NLP

As we have seen, NLP is important for scientific, economic, social, and cultural reasons. NLP isexperiencing rapid growth as its theories and methods are deployed in a variety of new languagetechnologies. For this reason it is important for a wide range of people to have a working knowledge ofNLP. Within academia, this includes people in areas from humanities computing and corpus linguistics



through to computer science and artificial intelligence. Within industry, it includes people in human-computer interaction, business information analysis, and Web software development. We hope that you,a member of this diverse audience reading these materials, will come to appreciate the workings of thisrapidly growing field of NLP and will apply its techniques in the solution of real-world problems.

The following chapters present a carefully-balanced selection of theoretical foundations and prac-tical application, and equips readers to work with large datasets, to create robust models of linguisticphenomena, and to deploy them in working language technologies. By integrating all of this into theNatural Language Toolkit (NLTK), we hope this book opens up the exciting endeavour of practicalnatural language processing to a broader audience than ever before.

3 Language and Computtion

3.1 NLP and Intelligence

A long-standing challenge within computer science has been to build intelligent machines. The chiefmeasure of machine intelligence has been a linguistic one, namely the Turing Test: can a dialoguesystem, responding to a user’s typed input with its own textual output, perform so naturally that userscannot distinguish it from a human interlocutor using the same interface? Today, there is substantialongoing research and development in such areas as machine translation and spoken dialogue, andsignificant commercial systems are in widespread use. The following dialogue illustrates a typicalapplication:

S: How may I help you?U: When is Saving Private Ryan playing?S: For what theater?U: The Paramount theater.S: Saving Private Ryan is not playing at the Paramount theater, but

it’s playing at the Madison theater at 3:00, 5:30, 8:00, and 10:30.

Today’s commercial dialogue systems are strictly limited to narrowly-defined domains. We couldnot ask the above system to provide driving instructions or details of nearby restaurants unless therequisite information had already been stored and suitable question and answer sentences had beenincorporated into the language processing system. Observe that the above system appears to understandthe user’s goals: the user asks when a movie is showing and the system correctly determines from thisthat the user wants to see the movie. This inference seems so obvious to humans that we usuallydo not even notice it has been made, yet a natural language system needs to be endowed with thiscapability in order to interact naturally. Without it, when asked Do you know when Saving PrivateRyan is playing, a system might simply — and unhelpfully — respond with a cold Yes. While itappears that this dialogue system can perform simple inferences, such sophistication is only foundin cutting edge research prototypes. Instead, the developers of commercial dialogue systems usecontextual assumptions and simple business logic to ensure that the different ways in which a usermight express requests or provide information are handled in a way that makes sense for the particularapplication. Thus, whether the user says When is ..., or I want to know when ..., or Can you tell mewhen ..., simple rules will always yield screening times. This is sufficient for the system to provide auseful service.

Despite some recent advances, it is generally true that those natural language systems which havebeen fully deployed still cannot perform common-sense reasoning or draw on world knowledge. We



can wait for these difficult artificial intelligence problems to be solved, but in the meantime it isnecessary to live with some severe limitations on the reasoning and knowledge capabilities of naturallanguage systems. Accordingly, right from the beginning, an important goal of NLP research has beento make progress on the holy grail of natural linguistic interaction without recourse to this unrestrictedknowledge and reasoning capability. This is an old challenge, and so it is instructive to review thehistory of the field.

3.2 Language and Symbol Processing

The very notion that natural language could be treated in a computational manner grew out of aresearch program, dating back to the early 1900s, to reconstruct mathematical reasoning using logic,most clearly manifested in work by Frege, Russell, Wittgenstein, Tarski, Lambek and Carnap. Thiswork led to the notion of language as a formal system amenable to automatic processing. Three laterdevelopments laid the foundation for natural language processing. The first was formal languagetheory. This defined a language as a set of strings accepted by a class of automata, such as context-freelanguages and pushdown automata, and provided the underpinnings for computational syntax.

The second development was symbolic logic. This provided a formal method for capturing selectedaspects of natural language that are relevant for expressing logical proofs. A formal calculus insymbolic logic provides the syntax of a language, together with rules of inference and, possibly, rules ofinterpretation in a set-theoretic model; examples are propositional logic and First Order Logic. Givensuch a calculus, with a well-defined syntax and semantics, it becomes possible to associate meaningswith expressions of natural language by translating them into expressions of the formal calculus. Forexample, if we translate John saw Mary into a formula saw(j,m), we (implicitly or explicitly) intepretthe English verb saw as a binary relation, and John and Mary as denoting individuals. More generalstatements like All birds fly require quantifiers, in this case ∀, meaning for all: ∀x(bird(x) → f ly(x)).This use of logic provided the technical machinery to perform inferences that are an important part oflanguage understanding.

A closely related development was the principle of compositionality, namely that the meaning ofa complex expression is comprised of the meaning of its parts and their mode of combination. Thisprinciple provided a useful correspondence between syntax and semantics, namely that the meaning ofa complex expression could be computed recursively. Consider the sentence It is not true that :;x: p,where p is a proposition. We can represent the meaning of this sentence as not(p). Similarly, we canrepresent the meaning of John saw Mary as saw( j,m). Now we can compute the interpretation of It isnot true that John saw Mary recursively, using the above information, to get not(saw( j,m)).

The approaches just outlined share the premise that computing with natural language cruciallyrelies on rules for manipulating symbolic representations. For a certain period in the development ofNLP, particularly during the 1980s, this premise provided a common starting point for both linguistsand practioners of NLP, leading to a family of grammar formalisms known as unification-based (orfeature-based) grammar, and to NLP applications implemented in the Prolog programming language.Although grammar-based NLP is still a significant area of research, it has become somewhat eclipsedin the last 15–20 years dues to a variety of factors. One significant influence came from automaticspeech recognition. Although early work in speech processing adopted a model which emulated thekind of rule-based phonological processing typified by Chomsky and Halle’s SPE, this turned out to behopelessly inadequate in dealing with the hard problem of recognizing actual speech in anything likereal time. By contrast, systems which involved learning patterns from large bodies of speech data weresignificantly more accurate, efficient and robust. In addition, the speech community found that progressin building better systems was hugely assisted by the construction of shared resources for quantitatively



measuring performance against common test data. Eventually, much of the NLP community embraceda data intensive orientation to language processing, coupled with a growing use of machine-learningtechniques and evaluation-led methodology.

3.3 Philosophical Divides

The contrasting approaches to NLP described in the preceding section relates back to early metaphys-ical debates about rationalism versus empiricism and realism versus idealism that occurred in theEnlightenment period of Western philosophy. These debates took place against a backdrop of ortho-dox thinking in which the source of all knowledge was believed to be divine revelation. During thisperiod of the seventeenth and eighteenth centuries, philosophers argued that human reason or sensoryexperience has priority over revelation. Descartes and Leibniz, amongst others, took the rationalistposition, asserting that all truth has its origins in human thought, and in the existence of ’innateideas’ implanted in our minds from birth. For example, they argued that the principles of Euclideangeometry were developed using human reason, and were not the result of supernatural revelation orsensory experience. In contrast, Locke and others took the empiricist view, that our primary source ofknowledge is the experience of our faculties, and that human reason plays a secondary role in reflectingon that experience. Prototypical evidence for this position was Galileo’s discovery — based on carefulobservation of the motion of the planets — that the solar system is heliocentric and not geocentric.In the context of linguistics, this debate leads to the following question: to what extent does humanlinguistic experience, versus our innate ’language faculty’, provide the basis for our knowledge oflanguage? In NLP this matter surfaces as differences in the priority of corpus data versus linguisticintrospection in the construction of computational models. We will return to this issue later in thebook.

A further concern, enshrined in the debate between realism and idealism, was the metaphysical sta-tus of the constructs of a theory. Kant argued for a distinction between phenomena, the manifestationswe can experience, and “things in themselves” which can never been known directly. A linguistic realistwould take a theoretical construct like noun phrase to be real world entity that exists independentlyof human perception and reason, and which actually causes the observed linguistic phenomena. Alinguistic idealist, on the other hand, would argue that noun phrases, along with more abstract con-structs like semantic representations, are intrinsically unobservable, and simply play the role of usefulfictions. The way linguists write about theories often betrays a realist position, while NLP practitionersoccupy neutral territory or else lean towards the idealist position. Thus, in NLP, it is often enough if atheoretical abstraction leads to a useful result; it does not matter whether this result sheds any light onhuman linguistic processing.

These issues are still alive today, and show up in the distinctions between symbolic vs statisticalmethods, deep vs shallow processing, binary vs gradient classifications, and scientific vs engineeringgoals. However, such contrasts are now highly nuanced, and the debate is no longer as polarized as itonce was. In fact, most of the discussions — and most of the advances even — involve a ’balancing act’.For example, one intermediate position is to assume that humans are innately endowed with analogicaland memory-based learning methods (weak rationalism), and use these methods to identify meaningfulpatterns in their sensory language experience (empiricism). For a more concrete illustration, considerthe way in which statistics from large corpora may serve as evidence for binary choices in a symbolicgrammar. For instance, dictionaries describe the words absolutely and definitely as nearly synonymous,yet their patterns of usage are quite distinct when combined with a following verb, as shown below:

Absolutely vs Definitely (Liberman 2005, LanguageLog.org)



Google hits adore love like preferabsolutely 289,000 905,00 16,200 644definitely 1,460 51,000 158,000 62,600ratio 198:1 18:1 1:10 1:97

As you will see, absolutely adore is about 200 times as popular as definitely adore, while absolutelyprefer is about 100 times rarer then definitely prefer. This information is used by statistical languagemodels, but it also counts as evidence for a symbolic account of word combination in which absolutelycan only modify extreme actions or attributes, a property that could be represented as a binary-valuedfeature of certain lexical items. Thus, we see statistical data informing symbolic models. Once thisinformation has been codified symbolically, it is available to be exploited as a contextual feature forstatistical language modelling, alongside many other rich sources of symbolic information, like hand-constructed parse trees and semantic representations. Now the circle is closed, and we see symbolicinformation informing statistical models.

This new rapprochement is giving rise to many exciting new developments. We will touch on someof these in the ensuing pages. We too will perform this balancing act, employing approaches to NLPthat integrate these historically-opposed philosophies and methodologies.

4 The Architecture of linguistic and NLP systems

4.1 Generative Grammar and Modularity

One of the intellectual descendants of formal language theory was the linguistic framework known asgenerative grammar. Such a grammar contains a set of rules that recursively specify (or generate) theset of well-formed strings in a language. While there is a wide spectrum of models which owe someallegiance to this core, Chomsky’s transformational grammar, in its various incarnations, is probablythe best known. In the Chomskyan tradition, it is claimed that humans have distinct kinds of linguisticknowledge, organized into different modules: for example, knowledge of a language’s sound structure(phonology), knowledge of word structure (morphology), knowledge of phrase structure (syntax), andknowledge of meaning (semantics). In a formal linguistic theory, each kind of linguistic knowledge ismade explicit as different module of the theory, consisting of a collection of basic elements togetherwith a way of combining them into complex structures. For example, a phonological module mightprovide a set of phonemes together with an operation for concatenating phonemes into phonologicalstrings. Similarly, a syntactic module might provide labelled nodes as primitives together wih amechanism for assembling them into trees. A set of linguistic primitives, together with some operatorsfor defining complex elements, is often called a level of representation.

As well as defining modules, a generative grammar will prescribe how the modules interact. Forexample, well-formed phonological strings will provide the phonological content of words, and wordswill provide the terminal elements of syntax trees. Well-formed syntactic trees will be mapped tosemantic representations, and contextual or pragmatic information will ground these semantic repre-sentations in some real-world situation.

As we indicated above, an important aspect of theories of generative grammar is that they areintended to model the linguistic knowledge of speakers and hearers; they are not intended to explainhow humans actually process linguistic information. This is, in part, reflected in the claim that agenerative grammer encodes the competence of an idealized native speaker, rather than the speaker’s



performance. A closely related distinction is to say that a generative grammar encodes declarativerather than procedural knowledge. Declarative knowledge can be glossed as ’knowing what’, whereasprocedural knowledge is ’knowing how’. As you might expect, computational linguistics has the crucialrole of proposing procedural models of language. A central example is parsing, where we have todevelop computational mechanisms which convert strings of words into structural representations suchas syntax trees. Nevertheless, it is widely accepted that well-engineered computational models oflanguage contain both declarative and procedural aspects. Thus, a full account of parsing will say howdeclarative knowledge in the form of a grammar and lexicon combines with procedural knowledgewhich determines how a syntactic analysis should be assigned to a given string of words. Thisprocedural knowledge will be expressed as an algorithm: that is, an explicit recipe for mapping someinput into an appropriate output in a finite number of steps.

A simple parsing algorithm for context-free grammars, for instance, looks first for a rule of the formS → X1 · · · Xn, and builds a partial tree structure. It then steps through the grammar rules one-by-one,looking for a rule of the form X1 → Y1 ... Y j which will expand the leftmost daughter introduced by theS rule, and further extends the partial tree. This process continues, for example by looking for a rule ofthe form Y1 → Z1 ... Zk and expanding the partial tree appropriately, until the leftmost node label in thepartial tree is a lexical category; the parser then checks to see if the first word of the input can belongto the category. To illustrate, let’s suppose that the first grammer rule chosen by the parser is S → NPVP and the second rule chosen is NP → Det N; then the partial tree will be as follows:

If we assume that the input string we are trying to parse is the cat slept, we will succeed inidentifying the as a word which can belong to the category DET. In this case, the parser goes onto the next node of the tree, N, and next input word, cat. However, if we had built the same partial treewith an input string did the cat sleep, the parse would fail at this point, since did is not of category DET.The parser would throw away the structure built so far and look for an alternative way of going fromthe S node down to a leftmost lexical category (e.g., using a rule S → V NP VP). The important pointfor now is not the details of this or other parsing algorithms; we discuss this topic much more fully inthe chapter on parsing. Rather, we just want to illustrate the idea that an algorithm can be broken downinto a fixed number of steps which produce a definite result at the end.

In figure 2 we further illustrate some of these points in the context of a spoken dialogue system,such as our earlier example of an application that offers the user information about movies currently onshow.

Down the lefthand side of the diagram is a ’pipeline’ of some representative speech understandingcomponents. These map from speech input via syntactic parsing to some kind of meaning represen-tation. Up the righthand side is an inverse pipeline of components for concept-to-speech generation.These components constitute the dynamic or procedural aspect of the system’s natural language pro-cessing. In the central column of the diagram are some representative bodies of static information: therepositories of language-related data which are called upon by the processing components.

The diagram illustrates that linguistically-motivated ways of modularizing linguistic knowledge areoften reflected in computational systems. That is, the various components are organized so that the datawhich they exchange corresponds roughly to different levels of representation. For example, the outputof the speech analysis component will contain sequences of phonological representations of words,



Figure 1: Architecture of Spoken Dialogue System

and the output of the parser will be a semantic representation. Of course the parallel is not precise,in part because it is often a matter of practical expedience where to place the boundaries betweendifferent processing components. For example, we can assume that within the parsing component thereis a level of syntactic representation, although we have chosen not to expose this at the level of thesystem diagram. Despite such idiosyncracies, most NLP systems break down their work into a seriesof discrete steps. In the process of natural language understanding, these steps go from more concretelevels to more abstract ones, while in natural language production, the direction is reversed.

5 Before Proceeding Further...

An important aspect of learning NLP using these materials is to experience both the challenge and— we hope — the satisfaction of creating software to process natural language. The accompanyingsoftware, NLTK, is available for free and runs on most operating systems including Linux/Unix, MacOSX and Microsoft Windows. You can download NLTK from <http://nltk.sourceforge.net/>, alongwith extensive documentation. We encourage you to install NLTK on your machine before readingbeyond the end of this chapter.

6 Further Reading

Several NLP systems have online interfaces that you might like to experiment with, e.g.:

� WordNet: <http://wordnet.princeton.edu/>

� Translation: <http://world.altavista.com/>



� ChatterBots: <http://www.loebner.net/Prizef/loebner-prize.html>

� Question Answering: <http://www.answerbus.com/>

� Summarisation: <http://newsblaster.cs.columbia.edu/>

Useful websites with substantial information about NLP:

� <http://www.lt-world.org/>

� <http://www.aclweb.org/>

� <http://www.elsnet.org/>

The ACL website contains an overview of computational linguistics, including copies of introduc-tory chapters from recent textbooks, at <http://www.aclweb.org/archive/what.html>.

Recent field-wide surveys: Mitkov, Dale et al, HLT Survey.Acknowledgements: The dialogue example is taken from Bob Carpenter and Jennifer Chu-Carroll’s

ACL-99 Tutorial on Spoken Dialogue Systems.



PART I: BASICS

22

2. Programming Fundamentals and Python

This chapter provides a non-technical overview of Python and will cover the basic programmingknowledge needed for the rest of the chapters in Part 1. It contains many examples and exercises; thereis no better way to learn to program than to dive in and try these yourself. You should then feel confidentin adapting the example for your own purposes. Before you know it you will be programming!

2.1 Python the Calculator

One of the friendly things about Python is that it allows you to type directly into the interactiveinterpreter — the program that will be running your Python programs. We want you to be completelycomfortable with this before we begin, so let’s start it up:

Python 2.4.3 (#1, Mar 30 2006, 11:02:16)[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwinType "help", "copyright", "credits" or "license" for more information.>>>

This blurb depends on your installation; the main thing to check is that you are running Python2.4 or greater (here it is 2.4.3). The >>> prompt indicates that the Python interpreter is now waitingfor input. If you are using the Python interpreter through the Interactive DeveLopment Environment(IDLE) then you should see a colorized version. We have colorized our examples in the same way,so that you can tell if you have typed the code correctly. Let’s begin by using the Python prompt as acalculator:

>>> 3 + 2 * 5 - 112>>>

There are several things to notice here. First, once the interpreter has finished calculating theanswer and displaying it, the prompt reappears. This means the Python interpreter is waiting for anotherinstruction. Second, notice that Python deals with the order of operations correctly (unlike some oldercalculators), so the multiplication 2 * 5 is calculated before it is added to 3.

Try a few more expressions of your own. You can use asterisk (*) for multiplication and slash (/)for division, and parentheses for bracketing expressions. One strange thing you might come across isthat division doesn’t always behave how you expect:

>>> 3/31>>> 1/30>>>

1

Introduction to Natural Language Processing (DRAFT) 2. Programming Fundamentals and Python

The second case is surprising because we would expect the answer to be 0.333333. We willcome back to why that is the case later on in this chapter. For now, let’s simply observe that theseexamples demonstrate how you can work interactively with the interpreter, allowing you to experimentand explore. Also, as you will see later, your intuitions about numerical expressions will be useful formanipulating other kinds of data in Python.

You should also try nonsensical expressions to see how the interpreter handles it:

>>> 1 +Traceback (most recent call last):

File "<stdin>", line 11 +

^SyntaxError: invalid syntax>>>

Here we have produced a syntax error. It doesn’t make sense to end an instruction with a plussign. The Python interpreter indicates the line where the problem occurred.

2.2 Understanding the Basics: Strings and Variables

2.2.1 Representing text

We can’t simply type text directly into the interpreter because it would try to interpret the text as partof the Python language:

>>> Hello WorldTraceback (most recent call last):

File "<stdin>", line 1Hello World

^SyntaxError: invalid syntax>>>

Here we see an error message. Note that the interpreter is confused about the position of the error,and points to the end of the string rather than the start.

Python represents a piece of text using a string. Strings are delimited — or separated from the restof the program — by quotation marks:

>>> ’Hello World’’Hello World’>>> "Hello World"’Hello World’>>>

We can use either single or double quotation marks, as long as we use the same ones on either endof the string.

Now we can perform calculator-like operations on strings. For example, adding two strings togetherseems intuitive enough that you could guess the result:

>>> ’Hello’ + ’World’’HelloWorld’>>>

Bird, Klein & Loper 2-2 December 6, 2006


When applied to strings, the + operation is called concatenation. It produces a new string whichis a copy of the two original strings pasted together end-to-end. Notice that concatenation doesn’t doanything clever like insert a space between the words. The Python interpreter has no way of knowingthat you want a space; it does exactly what it is told. Given the example of +, you might be able guesswhat multiplication will do:

>>> ’Hi’ + ’Hi’ + ’Hi’’HiHiHi’>>> ’Hi’ * 3’HiHiHi’>>>

The point to take from this (apart from learning about strings) is that in Python, intuition aboutwhat should work gets you a long way, so it is worth just trying things to see what happens. You arevery unlikely to break something, so just give it a go.

2.2.2 Storing and reusing values

After a while, it can get quite tiresome to keep retyping Python statements over and over again. It wouldbe nice to be able to store the value of an expression like ’Hi’ + ’Hi’ + ’Hi’ so that we can useit again. We do this by saving results to a location in the computer’s memory, and giving the location aname. Such a named place is called a variable. In Python we create variables by assignment, whichinvolves putting a value into the variable:

>>> msg = ’Hello World’>>> msg’Hello World’>>>

Here we have created a variable called msg (short for ’message’) and set it to have the string value’Hello World’. We used the = operation, which assigns the value of the expression on the right tothe variable on the left. Notice the Python interpreter does not print any output; it only prints outputwhen the statement returns a value, and an assignment statement returns no value. On the second linewe inspect the contents of the variable by naming it on the command line: that is, we use the namemsg. The interpreter prints out the contents of the variable on the next line.

We can use variables in any place where we used values previously:

>>> msg + msg’Hello WorldHello World’>>> three = 3>>> msg * three’Hello WorldHello WorldHello World’>>>

We can also assign a new value to a variable just by using assignment again:

>>> msg = msg * 2>>> msg’Hello WorldHello World’>>>

Here we have taken the value of msg, multiplied it by 2 and then stored that new string (HelloWorldHello World) back into the variable msg.



2.2.3 Printing and inspecting strings

So far, when we have wanted to look at the contents of a variable or see the result of a calculation, wehave just typed the variable name into the interpreter. For example, we can look at the contents of msgusing:

>>> msg’Hello World’>>>

However, there are some situations where this isn’t going to do what we want. To see this, open atext editor, and create a file called test.py, containing the single line

msg = ’Hello World’

Now, open this file in IDLE, then go to the Run menu, and select the command Run Module. Theresult in the main IDLE window should look like this:

>>> ================================ RESTART ================================>>>>>>

But where is the output showing the value of msg? The answer is that the program in test.pywill only show a value if you explicitly tell it to, using the print command. So add another line totest.py so that it looks as follows:

msg = ’Hello World’

print msg

Select Run Module again, and this time you should get output which looks like this:

>>> ================================ RESTART ================================>>>Hello World>>>

On close inspection, you will see that the quotation marks which indicate that Hello World is astring are missing in this case. That is because inspecting a variable (only possible within the interactiveinterpreter) prints out the Python representation of a value, whereas the print statement only printsout the value itself, which in this case is just the text in the string.

You will see that you get the same results if you use the print command in the interactiveinterpreter:

>>> print msgHello World>>>

In fact, you can use a sequence of comma-separated expressions in a print statement.

>>> msg2 = ’Goodbye’>>> print msg, msg2Hello World Goodbye>>>

So, if you want the users of your program to be able to see something then you need to use print.If you just want to check the contents of the variable while you are developing your program in theinteractive interpreter, then you can just type the variable name directly into the interpreter.



2.2.4 Exercises

1. Start up the Python interpreter (e.g. by running IDLE). Try the examples in the last section[xref sect:1], then experiment with using Python as a calculator.

2. Try the examples in this section [xref sect:2], then try the following.

a) Create a variable called msg and put a message of your own in this variable.Remember that strings need to be quoted, so you will need to type somethinglike:

>>> msg = "I like NLP!"

b) Now print the contents of this variable in two ways, first by simply typing thevariable name and pressing enter, then by using the print command.

c) Try various arithmetic expressions using this string, e.g. msg + msg, and 5 *msg.

d) Define a new string hello, and then try hello + msg. Change the hellostring so that it ends with a space character, and then try hello + msg again.

2.3 Slicing and Dicing

Strings are so important (especially for NLP!) that we will spend some more time on them. Here wewill learn how to access the individual characters that make up a string, how to pull out arbitrarysubstrings, and how to reverse strings.

2.3.1 Accessing individual characters

The positions within a string are numbered, starting from zero. To access a position within a string, wespecify the position inside square brackets:

>>> msg = ’Hello World’>>> msg[0]’H’>>> msg[3]’l’>>> msg[5]’ ’>>>

This is called indexing or subscripting the string. The position we specify inside the squarebrackets is called the index. We can retrieve not only letters but any character, such as the space atindex 5.

Note

Be careful to distinguish between the string ’ ’, which is a single whitespacecharacter, and ’’, which is the empty string.



The fact that strings are numbered from zero may seem counter-intuitive. However, it goes back tothe way variables are stored in a computer’s memory. As mentioned earlier, a variable is actually thename of a location, or address, in memory. Strings are arbitrarily long, and their address is taken to bethe position of their first character. Thus, if values are stored in variables as follows,

>>> three = 3>>> msg = ’Hello World’

then the location of those values will along the lines shown in Figure 2.

Figure 1: Variables and Computer Memory

When we index into a string, the computer adds the index to the string’s address. Thus msg[3] isfound at memory location 3136 + 3. Accordingly, the first position in the string is found at 3136 +0, or msg[0].

If you don’t find Figure 2 helpful, you might just want to think of indexes as giving you the positionin a string immediately before a character, as indicated in Figure 3.

Now, what happens when we try to access an index that is outside of the string?

>>> msg[11]Traceback (most recent call last):

File "<stdin>", line 1, in ?IndexError: string index out of range>>>

The index of 11 is outside of the range of valid indices (i.e., 0 to 10) for the string ’Hello World’.This results in an error message. This time it is not a syntax error; the program fragment is syntacticallycorrect. Instead, the error occurred while the program was running. The Traceback message indicateswhich line the error occurred on (line 1 of ’standard input’). It is followed by the name of the error,IndexError, and a brief explanation.

In general, how do we know what we can index up to? If we know the length of the string is n,the highest valid index will be n − 1. We can get access to the length of the string using the len()function.

>>> len(msg)11>>>



Figure 2: String Indexing

Informally, a function is a named snippet of code that provides a service to our program when wecall or execute it by name. We call the len() function by putting parentheses after the name and givingit the string msg we want to know the length of. Because len() is built into the Python interpreter,IDLE colors it purple.

We have seen what happens when the index is too large. What about when it is too small? Let’s seewhat happens when we use values less than zero:

>>> msg[-1]’d’>>>

This does not generate an error. Instead, negative indices work from the end of the string, so -1indexes the last character, which is ’d’.

>>> msg[-3]’r’>>> msg[-6]’ ’>>>

Now the computer works out the location in memory relative to the string’s address plus its length,e.g. 3136 + 11 - 1 = 3146. We can also visualize negative indices as shown in Figure 4.

Figure 3: Negative Indices

Thus we have two ways to access the characters in a string, from the start or the end. For example,we can access the space in the middle of Hello and World with either msg[5] or msg[-6]; theserefer to the same location, because 5 = len(msg) - 6.

2.3.2 Accessing substrings

Next, we might want to access more than one character at a time. This is also pretty simple; we justneed to specify a range of characters for indexing rather than just one. This process is called slicingand we indicate a slice using a colon in the square brackets to separate the beginning and end of therange:



>>> msg[1:4]’ell’>>>

Here we see the characters are ’e’, ’l’ and ’l’ which correspond to msg[1], msg[2] andmsg[3], but not msg[4]. This is because a slice starts at the first index but finishes one before the endindex. This is consistent with indexing: indexing also starts from zero and goes up to one before thelength of the string. We can see that by indexing with the value of len():

>>> len(msg)11>>> msg[0:11]’Hello World’>>>

We can also slice with negative indices — the same basic rules of starting from the start index andstopping one before the end index applies; here we stop before the space character:

>>> msg[0:-6]’Hello’>>>

Python provides two shortcuts for commonly used slice values. If the start index is 0 then you canleave it out entirely, and if the end index is the length of the string then you can leave it out entirely:

>>> msg[:3]’Hel’>>> msg[6:]’World’>>>

The first example above selects the first three characters from the string, and the second exampleselects from the character with index 6, namely ’W’, to the end of the string. These shortcuts lead to acouple of common Python idioms:

>>> msg[:-1]’Hello Worl’>>> msg[:]’Hello World’>>>

The first chomps off just the last character of the string, and the second makes a complete copy ofthe string (which is more important when we come to lists below).

2.3.3 Exercises

1. Define a string s = ’colorless’. Write a Python statement that changes this to “colour-less” using only the slice and concatenation operations.

2. Try the slice examples from this section using the interactive interpreter. Then try somemore of your own. Guess what the result will be before executing the command.



3. We can use the slice notation to remove morphological endings on words. For example,’dogs’[:-1] removes the last character of dogs, leaving dog. Use slice notation toremove the affixes ending from these words (we’ve inserted a hyphen to indicate theaffix boundary, but omit this from your strings): dish-es, run-ning, nation-ality,un-do, pre-heat.

4. We saw how we can generate an IndexError by indexing beyond the end of a string. Isit possible to construct an index that goes too far to the left, before the start of the string?

5. We can also specify a step size for the slice. The following returns every second characterwithin the slice, in a forwards or reverse direction:

>>> msg[6:11:2]’Wrd’>>> msg[10:5:-2]’drW’>>>

Experiment with different step values.

6. What happens if you ask the interpreter to evaluate msg[::-1]? Explain why this is areasonable result.

2.4 Strings, Sequences, and Sentences

We have seen how words like Hello can be stored as a string ’Hello’. Whole sentences can alsobe stored in strings, and manipulated as before, as we can see here for Chomsky’s famous nonsensesentence:

>>> sent = ’colorless green ideas sleep furiously’>>> sent[16:21]’ideas’>>> len(sent)37>>>

However, it turns out to be a bad idea to treat a sentence as a sequence of its characters, becausethis makes it too inconvenient to access the words or work out the length. Instead, we would preferto represent a sentence as a sequence of its words; as a result, indexing a sentence accesses the words,rather than characters. We will see how to do this now.

2.4.1 Lists

A list is designed to store a sequence of values. A list is similar to a string in many ways except thatindividual items don’t have to be just characters; they can be arbitrary strings, integers or even otherlists.

A Python list is represented as a sequence of comma-separated items, delimited by square brackets.Let’s create part of Chomsky’s sentence as a list and put it in a variable phrase1:



>>> phrase1 = [’colorless’, ’green’, ’ideas’]>>> phrase1[’colorless’, ’green’, ’ideas’]>>>

Because lists and strings are both kinds of sequence, they can be processed in similar ways; just asstrings support len(), indexing and slicing, so do lists. The following example applies these familiaroperations to the list phrase1:

>>> len(phrase1)3>>> phrase1[0]’colorless’>>> phrase1[-1]’ideas’>>> phrase1[-5]Traceback (most recent call last):

File "<stdin>", line 1, in ?IndexError: list index out of range>>>

Here, phrase1[-5] generates an error, because the fifth-last item in a three item list would occurbefore the list started, i.e., it is undefined. We can also slice lists in exactly the same way as strings:

>>> phrase1[1:3][’green’, ’ideas’]>>> phrase1[-2:][’green’, ’ideas’]>>>

Lists can be concatenated just like strings. Here we will put the resulting list into a new variablephrase2. The original variable phrase1 is not changed in the process:

>>> phrase2 = phrase1 + [’sleep’, ’furiously’]>>> phrase2[’colorless’, ’green’, ’ideas’, ’sleep’, ’furiously’]>>> phrase1[’colorless’, ’green’, ’ideas’]>>>

Now, lists and strings do not have exactly the same functionality. Lists have the added power thatyou can change their elements. Let’s imagine that we want to change the 0th element of phrase1 to’colorful’, we can do that by assigning to the index phrase1[0]:

>>> phrase1[0] = ’colorful’>>> phrase1[’colorful’, ’green’, ’ideas’]>>>

On the other hand if we try to do that with a string (for example changing the 0th character in msg to’J’ we get:

>>> msg[0] = ’J’Traceback (most recent call last):

File "<stdin>", line 1, in ?TypeError: object does not support item assignment>>>



This is because strings are immutable — you can’t change a string once you have created it. However,lists are mutable, and their contents can be modified at any time. As a result, lists support a number ofoperations, or methods, which modify the original value rather than returning a new value.

Note

Methods are functions, so they can be called in a similar manner. However, as wewill see later on in this book, methods are tightly associated with objects of thatbelong to specific classes (for example, strings and lists). A method is called ona particular object using the object’s name, then a period, then the name of themethod, and finally the parentheses containing any arguments.

Two of these methods are sorting and reversing:

>>> phrase2.sort()>>> phrase2[’colorless’, ’furiously’, ’green’, ’ideas’, ’sleep’]>>> phrase2.reverse()>>> phrase2[’sleep’, ’ideas’, ’green’, ’furiously’, ’colorless’]>>>

As you will see, the prompt reappears immediately on the line after phrase2.sort() andphrase2.reverse(). That is because these methods do not return a new list, but instead modifythe original list stored in the variable phrase2.

Lists also support an append()method for adding items to the end of the list and an indexmethodfor finding the index of particular items in the list:

>>> phrase2.append(’said’)>>> phrase2.append(’Chomsky’)>>> phrase2[’sleep’, ’ideas’, ’green’, ’furiously’, ’colorless’, ’said’, ’Chomsky’]>>> phrase2.index(’green’)2>>>

Finally, just as a reminder, you can create lists of any values you like. They don’t even have to bethe same type, although this is rarely a good idea:

>>> bat = [’bat’, [[1, ’n’, ’flying mammal’], [2, ’n’, ’striking instrument’]]]>>>

2.4.2 Working on sequences one item at a time

We have shown you how to create lists, and how to index and manipulate them in various ways. Oftenit is useful to step through a list and process each item in some way. We do this using a for loop. Thisis our first example of a control structure in Python, a statement that controls how other statementsare run:

>>> for word in phrase2:... print len(word), word5 sleep



5 ideas5 green9 furiously9 colorless4 said7 Chomsky

This program runs the statement print len(word), word for every item in the list of words.This process is called iteration. Each iteration of the for loop starts by assigning the next item of thelist phrase2 to the loop variable word. Then the indented body of the loop is run. Here the bodyconsists of a single command, but in general the body can contain as many lines of code as you want,so long as they are all indented by the same amount.

Note

The interactive interpreter changes the prompt from >>> to the ... prompt afterencountering a colon (:). This indicates that the interpreter is expecting an indentedblock of code to appear next. However, it is up to you to do the indentation. To finishthe indented block just enter a blank line.

We can run another for loop over the Chomsky nonsense sentence, and calculate the average wordlength. As you will see, this program uses the len() function in two ways: to count the number ofcharacters in a word, and to count the number of words in a phrase. Note that x += y is shorthand forx = x + y; this idiom allows us to increment the total variable each time the loop is run.

>>> total = 0>>> for word in phrase2:... total += len(word)...>>> total / len(phrase2)6>>>

2.4.3 Tuples

Python tuples are just like lists, except that there is one important difference: tuples cannot be changedin place, for example by sort() or reverse(). In other words, like strings they are immutable.Tuples are formed with enclosing parentheses rather than square brackets, and items are separated bycommas. Like lists, tuples can be indexed and sliced.

>>> t = (’walk’, ’fem’, 3)>>> t[0]’walk’>>> t[1:](’fem’, 3)>>> t[0] = ’run’Traceback (most recent call last):

File "<stdin>", line 1, in ?TypeError: object does not support item assignment>>>



2.4.4 String Formatting

The output of a program is usually structured to make the information easily digestible by a reader.Instead of running some code and then manually inspecting the contents of a variable, we would likethe code to tabulate some output. We already saw this above in the first for loop example, where eachline of output was similar to 5 sleep, consisting of a word length, followed by the word in question.

There are many ways we might want to format such output. For instance, we might want to placethe length value in parentheses after the word, and print all the output an a single line:

>>> for word in phrase2:... print word, ’(’, len(word), ’),’,sleep ( 5 ), ideas ( 5 ), green ( 5 ), furiously ( 9 ), colorless ( 9 ),said ( 4 ), Chomsky ( 7 ),

Notice that this print statement ends with a trailing comma, which is how we tell Python not toprint a newline at the end.

However, this approach has a couple of problems. First, the print statement intermingles variablesand punctuation, making it a little difficult to read. Second, the output has spaces around every item thatwas printed. A cleaner way to produce structured output uses Python’s string-formatting expressions.Here’s an example:

>>> for word in phrase2:... print "%s (%d)," % (word, len(word)),sleep (5), ideas (5), green (5), furiously (9), colorless (9),said (4), Chomsky (7),

Here, the print command is followed by a three-part object having the syntax: format %values. The format section is a string containing format specifiers such as %s and %d whichPython will replace with the supplied values. The %s specifier tells Python that the correspondingvariable is a string (or should be converted into a string), while the %d specifier indicates that thecorresponding variable should be converted into a decimal representation. Finally, the values sectionof a formatting string is a tuple containing exactly as many items as there are format specifiers in theformat section.

In the above example, we used a trailing comma to suppress the printing of a newline. Suppose, onthe other hand, that we want to introduce some additional newlines in our output. We can accomplishthis by inserting the ’special’ character \n into the print string:

>>> for word in phrase2:... print "Word = %s\nIndex = %s\n*****" % (word, phrase2.index(word))...Word = colorlessIndex = 0*****Word = greenIndex = 1*****...>>>



2.4.5 Converting between strings and lists

Often we want to convert between a string containing a space-separated list of words and a list ofstrings. Let’s first consider turning a list into a string. One way of doing this is as follows:

>>> str = ’’>>> for word in phrase2:... str += ’ ’ + word...>>> str’ colorless green ideas sleep furiously’>>>

One drawback of this approach is that we have an unwanted space at the start of str. But moreimportantly, Python strings have built-in support which allows us to carry out this task much moreeconomically. First we’ll consider the join() method:

>>> phrase3 = ’ ’.join(phrase2)>>> phrase3’sleep ideas green furiously colorless said Chomsky’>>>

This notation for join() may seem very odd at first. However, it follows exactly the sameconvention as sort() and append() above. As we mentioned earlier, a method is called on anobject using the object’s name, then a period, then the name of the method, and finally the parenthesescontaining any arguments. Here, the object is a string that consists of a single whitespace ’ ’. Thename of the method is join, and the single argument to the join method is the list of words phrase2.As you can see from the example above, ’ ’.join(phrase2) takes the whitespace and creates a newstring by inserting it between all of the items in the list phrase2. We have stored that string in thevariable phrase3. We can use the join() method with other strings, such as ’-> ’:

>>> ’ -> ’.join(phrase2)’sleep -> ideas -> green -> furiously -> colorless -> said -> Chomsky’>>>

Now let’s try to reverse the process: that is, we want to convert a string into a list. Again, wecould start off with an empty list [] and append() to it within a for loop. But as before, there is amore succinct way of achieving the same goal. This time, we will split the new string phrase3 on thewhitespace character:

>>> phrase3.split(’ ’)[’sleep’, ’ideas’, ’green’, ’furiously’, ’colorless’, ’said’, ’Chomsky’]>>> phrase3.split(’s’)[’’, ’leep idea’, ’ green furiou’, ’ly colorle’, ’’, ’ ’, ’aid Chom’, ’ky’]>>>

We can also split on any character, so we tried splitting on ’s’ as well.

2.4.6 Exercises

1. Using the Python interactive interpreter, experiment with the examples in this section.Think of a sentence and represent it as a list of strings, e.g. [’Hello’, ’world’]. Try thevarious operations for indexing, slicing and sorting the elements of your list. Extractindividual items (strings), and perform some of the string operations on them.



# We pointed out that when phrase is a list, phrase.reverse() returns a modified version of phrase‘‘ratherthan a new list. On the other hand, we can use the slice trick mentionedabove [xref previous exercises], ‘‘[::-1] to create a new reversed list withoutchanging phrase. Show how you can confirm this difference in behaviour.

1. We have seen how to represent a sentence as a list of words, where each word is a sequenceof characters. What does phrase1[2][2] do? Why? Experiment with other indexvalues.

2. Write a for loop to print out the characters of a string, one per line.

3. Process the list phrase2 using a for loop, and store the result in a new list lengths.Hint: begin by assigning the empty list to lengths, using lengths = []. Then eachtime through the loop, use append() to add another length value to the list.

4. Define a variable silly to contain the string: ’newly formed bland ideas areunexpressible in an infuriating way’. (This happens to be the legitimate inter-pretation that bilingual English-Spanish speakers can assign to Chomsky’s famous phrase,according to Wikipedia). Now write code to perform the following tasks:

a) Split silly into a list of strings, one per word, using Python’s split()operation.

b) Extract the second letter of each word in silly and join them into a string, toget ’eoldrnnnna’.

c) Build a list phrase4 consisting of all the words up to (but not including) in insilly. Hint: use the index() function in combination with list slicing.

d) Combine the words in phrase4 back into a single string, using join(). Makesure the words in the resulting string are separated with whitespace.

e) Print the words of silly in alphabetical order, one per line.

5. What happens if you call split on a string, with no argument, e.g. phrase3.split()?What happens when the string being split contains tab characters, consecutive space char-acters, or a sequence of tabs and spaces?

6. Create a variable words containing a list of words. Experiment with words.sort() andsorted(words). What is the difference?

2.5 Making Decisions

So far, our simple programs have been able to manipulate sequences of words, and perform someoperation on each one. We applied this to lists consisting of a few words, but the approach worksthe same for lists of arbitrary size, containing thousands of items. Thus, such programs have someinteresting qualities: (i) the ability to work with language, and (ii) the potential to save human effortthrough automation. Another useful feature of programs is their ability to make decisions on our behalf;this is our focus in this section.



2.5.1 Making simple decisions

Most programming languages permit us to execute a block of code when a conditional expression, orif statement, is satisfied. In the following program, we have created a variable called word containingthe string value ’cat’. The if statement then checks whether the condition len(word) < 5 is true.Because the conditional expression is true, the body of the if statement is invoked and the printstatement is executed.

>>> word = "cat">>> if len(word) < 5:... print ’word length is less than 5’...word length is less than 5>>>

If we change the conditional expression to len(word) >= 5 — the length of word is greater thanor equal to 5 — then the conditional expression will no longer be true, and the body of the if statementwill not be run:

>>> if len(word) >= 5:... print ’word length is greater than or equal to 5’...>>>

The if statement, just like the for statement above is a control structure. An if statement is acontrol structure because it controls whether the code in the body will be run. You will notice that bothif and for have a colon at the end of the line, before the indentation begins. That’s because all Pythoncontrol structures end with a colon.

What if we want to do something when the conditional expression is not true? The answer is to addan else clause to the if statement:

>>> if len(word) >= 5:... print ’word length is greater than or equal to 5’... else:... print ’word length is less than 5’...word length is less than 5>>>

Finally, if we want to test multiple conditions in one go, we can use an elif clause which acts likean else and an if combined:

>>> if len(word) < 3:... print ’word length is less than three’... elif len(word) == 3:... print ’word length is equal to three’... else:... print ’word length is greater than three’...word length is equal to three>>>



2.5.2 Conditional expressions

Python supports a wide range of operators like < and >= for testing the relationship between values.The full set of these relational operators are:

Operator Relationship< less than<= less than or equal to== equal to (note this is two not one = sign)!= not equal to> greater than>= greater than or equal to

Normally we use conditional expressions as part of an if statement. However, we can test theserelational operators directly at the prompt:

>>> 3 < 5True>>> 5 < 3False>>> not 5 < 3True>>>

Here we see that these expressions have Boolean values, namely True or False. not is a Booleanoperator, and flips the truth value of Boolean statement.

Strings and lists also support conditional operators:

>>> word = ’sovereignty’>>> ’sovereign’ in wordTrue>>> ’gnt’ in wordTrue>>> ’pre’ not in wordTrue>>> ’Hello’ in [’Hello’, ’World’]True>>>

Strings also have methods for testing what appears at the beginning and the end of a string (asopposed to just anywhere in the string:

>>> word.startswith(’sovereign’)True>>> msg.endswith(’ty’)True>>>



Note

Integers, strings and lists are all kinds of data types in Python. In fact, every valuein Python has a type. The type determines what operations you can perform on thedata value. So, for example, we have seen that we can index strings and lists, butwe can’t index integers:

>>> one = ’cat’ >>> one[0] ’c’ >>> two = [1, 2, 3] >>> two[1]2 >>> three = 3 >>> three[2] Traceback (most recent calllast): File "<pyshell#95>", line 1, in -toplevel- three[2]TypeError: unsubscriptable object >>>

You can use Python’s type() function to check what the type of an object is:

>>> data = [one, two, three] >>> for item in data: ...print "item ’%s’ belongs to %s" % (item, type(item)) ...item ’cat’ belongs to <type ’str’> item ’[1, 2, 3]’ belongsto <type ’list’> item ’3’ belongs to <type ’int’> >>>

Because strings and lists (and tuples) have so much in common, they are groupedtogether in a higher level type called sequences.

2.5.3 Iteration, items, and if

Now it is time to put some of the pieces together. We are going to take the string ’how now browncow’ and print out all of the words ending in ’ow’. Let’s build the program up in stages. The first stepis to split the string into a list of words:

>>> sentence = ’how now brown cow’>>> words = sentence.split()>>> words[’how’, ’now’, ’brown’, ’cow’]>>>

Next, we need to iterate over the words in the list. Just so we don’t get ahead of ourselves, let’sprint each word, one per line:

>>> for word in words:... print word...hownowbrowncow

The next stage is to only print out the words if they end in the string ’ow’. Let’s check that weknow how to do this first:

>>> ’how’.endswith(’ow’)True>>> ’brown’.endswith(’ow’)False>>>



Now we are ready to put an if statement inside the for loop. Here is the complete program:

>>> sentence = ’how now brown cow’>>> words = sentence.split()>>> for word in words:... if word.endswith(’ow’):... print word...hownowcow>>>

As you can see, even with this small amount of Python knowledge it is possible to develop usefulprograms. The key idea is to develop the program in pieces, testing that each one does what you expect,and then combining them to produce whole programs. This is why the Python interactive interpreter isso invaluable, and why you should get comfortable using it.

2.5.4 Exercises

1. Assign a new value to sentence, namely the string ’she sells sea shells by thesea shore’, then write code to perform the following tasks:

a) Print all words beginning with ’sh’:

b) Print all words longer than 4 characters.

c) Generate a new sentence that adds the popular hedge word ’like’ before everyword beginning with ’se’. Your result should be a single string.

2. Write conditional expressions, such as ’H’ in msg, but applied to lists instead of strings.Check whether particular words are included in the Chomsky nonsense sentence.

3. Write code to abbreviate text by removing all the vowels. Define sentence to hold anystring you like, then initialize a new string result to hold the empty string ’’. Nowwrite a for loop to process the string, one character at a time, and append any non-vowelcharacters to the result string.

4. Write code to convert text into hAck3r, where e → 3, i → 1, o → 0, l → |, s → 5, . →

5w33t!, ate → 8.

2.6 Getting organized

Strings and lists are a simple way to organize data. In particular, they map from integers to values. Wecan ’look up’ a string using an integer to get one of its letters, and we can also look up a list of wordsusing an integer to get one of its strings. These cases are shown in Figure 5.

However, we need a more flexible way to organize and access our data. Consider the examples inFigure 6.

In the case of a phone book, we look up an entry using a name, and get back a number. When wetype a domain name in a web browser, the computer looks this up to get back an IP address. A word



Figure 4: Sequence Look-up

Figure 5: Dictionary Look-up

frequency table allows us to look up a word and find its frequency in a text collection. In all thesecases, we are mapping from names to numbers, rather than the other way round as with indexing intosequences. In general, we would like to be able to map between arbitrary types of information. Thefollowing table lists a variety of linguistic objects, along with what they map.

Linguistic Object Mapsfrom

Document Index Word List of pages (where word is found)Thesaurus Word sense List of synonymsDictionary Headword Entry (part of speech, sense definitions, ety-

mology)Comparative Wordlist Gloss term Cognates (list of words, one per language)Morph Analyzer Surface form Morphological analysis (list of component

morphemes)

Most often, we are mapping from a string to some structured object. For example, a documentindex maps from a word (which we can represent as a string), to a list of pages (represented as a list ofintegers). In this section, we will see how to represent such mappings in Python.

2.6.1 Accessing data with data

Python provides a dictionary data type, which can be used for mapping between arbitrary types.

Note

A Python dictionary is somewhat like a linguistic dictionary — they both give youa systematic means of looking things up, and so there is some potential for confu-sion. However, we hope that it will usually be clear from the context which kind ofdictionary we are tallking about.

Here we define pos to be an empty dictionary and then add three entries to it, specifying the part-of-speech of some words. We add entries to a dictionary using the familiar square bracket notation:

>>> pos = {}>>> pos[’colorless’] = ’adj’>>> pos[’furiously’] = ’adv’>>> pos[’ideas’] = ’n’>>>

So, for example, pos[’colorless’] = ’adj’ says that the look-up value of ’colorless’ inpos is the string ’adj’.

To look up a value in pos, we again use indexing notation, except now the thing inside the squarebrackets is the item whose value we want to recover:

>>> pos[’ideas’]’n’>>> pos[’colorless’]’adj’>>>



The item used for look-up is called the key, and the data that is returned is known as the value. Aswith indexing a list or string, we get an exception when we try to access the value of a key that doesnot exist:

>>> pos[’missing’]Traceback (most recent call last):

File "<stdin>", line 1, in ?KeyError: ’missing’>>>

This raises an important question. Unlike lists and strings, where we can use len() to work outwhich integers will be legal indices, how do we work out the legal keys for a dictionary? Fortunately,we can check whether a key exists in a dictionary using the in operator:

>>> ’colorless’ in posTrue>>> ’missing’ in posFalse>>> ’missing’ not in posTrue>>>

Notice that we can use not in to check if a key is missing. Be careful with the in operator fordictionaries: it only applies to the keys and not their values. If we check for a value, e.g. ’adj’ inpos, the result is False, since ’adj’ is not a key. We can loop over all the entries in a dictionaryusing a for loop.

>>> for word in pos:... print "%s (%s)" % (word, pos[word])...colorless (adj)furiously (adv)ideas (n)>>>

We can see what the contents of the dictionary look like by inspecting the variable pos:

>>> pos{’furiously’: ’adv’, ’ideas’: ’n’, ’colorless’: ’adj’}>>>

Here, the contents of the dictionary are shown as key-value pairs. As you can see, the order ofthe key-value pairs is different from the order in which they were originally entered. This is becausedictionaries are not sequences but mappings. The keys in a mapping are not inherently ordered, andany ordering that we might want to impose on the keys exists independently of the mapping. As weshall see later, this gives us a lot of flexibility.

We can use the same key-value pair format to create a dictionary:

>>> pos = {’furiously’: ’adv’, ’ideas’: ’n’, ’colorless’: ’adj’}>>>

Using the dictionary methods keys(), values() and items(), we can access the keys and valuesas separate lists, and also the key-value pairs:



>>> pos.keys()[’colorless’, ’furiously’, ’ideas’]>>> pos.values()[’adj’, ’adv’, ’n’]>>> pos.items()[(’colorless’, ’adj’), (’furiously’, ’adv’), (’ideas’, ’n’)]>>>

2.6.2 Counting with dictionaries

The values stored in a dictionary can be any kind of object, not just a string — the values can even bedictionaries. The most common kind is actually an integer. It turns out that we can use a dictionaryto store counters for many kinds of data. For instance, we can have a counter for all the letters of thealphabet; each time we get a certain letter we increment its corresponding counter:

>>> phrase = ’colorless green ideas sleep furiously’>>> count = {}>>> for letter in phrase:... if letter not in count:... count[letter] = 0... count[letter] += 1>>> count{’a’: 1, ’ ’: 4, ’c’: 1, ’e’: 6, ’d’: 1, ’g’: 1, ’f’: 1, ’i’: 2,’l’: 4, ’o’: 3, ’n’: 1, ’p’: 1, ’s’: 5, ’r’: 3, ’u’: 2, ’y’: 1}

Observe that in is used here in two different ways: for letter in phrase iterates over everyletter, running the body of the for loop. Inside this loop, the conditional expression if letter notin count checks whether the letter is missing from the dictionary. If it is missing, we create a newentry and set its value to zero: count[letter] = 0. Now we are sure that the entry exists, and itmay have a zero or non-zero value. We finish the body of the for loop by incrementing this particularcounter using the += assignment operator. Finally, we print the dictionary, to see the letters and theircounts. This method of maintaining many counters will find many uses, and you will become veryfamiliar with it.

There are other useful ways to display the result, such as sorting alphabetically by the letter:

>>> sorted(count.items())[(’ ’, 4), (’a’, 1), (’c’, 1), (’d’, 1), (’e’, 6), (’f’, 1), ...,...(’y’, 1)]

Note

The function sorted() is similar to the sort() method on sequences, but ratherthan sorting in-place, it produces a new sorted copy of its argument. Moreover, aswe will see very soon, sorted() will work on a wider variety of data types, includingdictionaries.

2.6.3 Getting unique entries

Sometimes, we don’t want to count at all, but just want to make a record of the items that we have seen,regardless of repeats. For example, we might want to compile a vocabulary from a document. This is



a sorted list of the words that appeared, regardless of frequency. At this stage we have two ways to dothis. The first uses lists.

>>> sentence = "she sells sea shells by the sea shore".split()>>> words = []>>> for word in sentence:... if word not in words:... words.append(word)...>>> sorted(words)[’by’, ’sea’, ’sells’, ’she’, ’shells’, ’shore’, ’the’]

We can write this using a dictionary as well. Each word we find is entered into the dictionary asa key. We use a value of 1, but it could be anything we like. We extract the keys from the dictionarysimply by converting the dictionary to a list:

>>> found = {}>>> for word in sentence:... found[word] = 1...>>> sorted(found)[’by’, ’sea’, ’sells’, ’she’, ’shells’, ’shore’, ’the’]

There is a third way to do this, which is best of all: using Python’s set data type. We can convertsentence into a set, using set(sentence):

>>> set(sentence)set([’shells’, ’sells’, ’shore’, ’she’, ’sea’, ’the’, ’by’])

The order of items in a set is not significant, and they will usually appear in a different order to theone they were entered in. The main point here is that converting a list to a set removes any duplicates.We convert it back into a list, sort it, and print. Here is the complete program:

>>> sentence = "she sells sea shells by the sea shore".split()>>> sorted(set(sentence))[’by’, ’sea’, ’sells’, ’she’, ’shells’, ’shore’, ’the’]

Here we have seen that there is sometimes more than one way to solve a problem with a program.In this case, we used three different built-in data types, a list, a dictionary, and a set. The set data typemostly closely modelled our task, so it required the least amount of work.

2.6.4 Scaling it up

We can use dictionaries to count word occurrences. For example, the following code reads Macbethand counts the frequency of each word:

>>> from nltk_lite.corpora import gutenberg>>> count = {} # initialize a dictionary>>> for word in gutenberg.raw(’shakespeare-macbeth’): # tokenize Macbeth... word = word.lower() # normalize to lowercase... if word not in count: # seen this word before?... count[word] = 0 # if not, set count to zero... count[word] += 1 # increment the counter...>>>



This example demonstrates some of the convenience of NLTK in accessing corpora. We will seemuch more of this later. For now, all you need to know is that gutenberg.raw() returns a list ofwords, in this case from Shakespeare’s play Macbeth, which we are iterating over using a for loop.We convert each word to lowercase using the string method word.lower(), and use a dictionary tomaintain a set of counters, one per word. Now we can inspect the contents of the dictionary to getcounts for particular words:

>>> count[’scotland’]12>>> count[’the’]692>>>

2.6.5 Exercises

1. Using the Python interpreter in interactive mode, experiment with the examples in thissection. Create a dictionary d, and add some entries. What happens if you try to access anon-existent entry, e.g. d[’xyz’]?

2. Try deleting an element from a dictionary, using the syntax del d[’abc’]. Check thatthe item was deleted.

3. Create a dictionary e, to represent a single lexical entry for some word of your choice.Define keys like headword, part-of-speech, sense, and example, and assign themsuitable values.

4. Create two dictionaries, d1 and d2, and add some entries to each. Now issue the commandd1.update(d2). What did this do? What might it be useful for?

5. Write a program that takes a sentence expressed as a single string, splits it and countsup the words. Get it to print out each word and the word’s frequency, one per line, inalphabetical order.

2.7 Defining Functions

It often happens that part of a program needs to be used several times over. For example, supposewe were writing a program that needed to be able to form the plural of a singular noun, and that thisneeded to be done at various places during the program. Rather than repeating the same code severaltimes over, it is more efficient (and reliable) to localize this work inside a function. A function isa programming construct which can be called with one or more inputs, and which returns an output.We define a function using the keyword def followed by the function name and any input parameters,followed by a colon; this in turn is followed by the body of the function. We use the keyword returnto indicate the value that is produced as output by the function. The best way to convey this is with anexample. Our function plural() takes a singular noun as input, and generates a plural form as output:

>>> def plural(word):... if word[-1] == ’y’:... return word[:-1] + ’ies’



... elif word[-1] in ’sx’:

... return word + ’es’

... elif word[-2:] in [’sh’, ’ch’]:

... return word + ’es’

... elif word[-2:] == ’an’:

... return word[:-2] + ’en’

... return word + ’s’>>> plural(’fairy’)’fairies’>>> plural(’woman’)’women’

There is much more to be said about ways of defining in functions, but we will defer this untilChapter 6.

2.8 Regular Expressions

For a moment, imagine that you are editing a large text, and you have strong dislike of repeatedoccurrences of the word very. How could you find all such cases in the text? To be concrete, let’ssuppose that the variable str is bound to the text shown below:

>>> str = """... Google Analytics is very very very nice (now)... By Jason Hoffman 18 August 06...... Google Analytics, the result of Google’s acquisition of the San... Diego-based Urchin Software Corporation, really really opened it’s... doors to the world a couple of days ago.""">>>

The triple quotes """ are useful here, since they allow us to break a string across lines.One approach to our task would be to convert the string into a list, and look for adjacent items

which are both equal to the string ’very’. We use the range(n) function in this example to create alist of consecutive integers from 0 up to, but not including, n:

>>> text = str.split(’ ’)>>> for n in range(len(text)):... if text[n] == ’very’ and text[n+1] == ’very’:... print n, n+1...3 44 5>>>

However, such an approach is not very flexible or convenient. In this section, we will present Python’sregular expression module re, which supports powerful search and substitution inside strings. Asa gentle introduction, we will start out using a utility function re_show() to illustrate how regularexpressions match against substrings. re_show() takes two arguments, a pattern that it is looking for,and a string in which the pattern might occur.

>>> import re>>> from nltk_lite.utilities import re_show



>>> re_show(’very very’, str)Google Analytics is {very very} very nice (now)...>>>

(We have only displayed first part of str that is returned, since the rest is irrelevant for the moment.)As you can see, re_show places curly braces around the first occurrence it has found of the string’very very’. So an important part of what re_show is doing is searching for any substring of strwhich matches the pattern in its first argument.

Now we might want to modify the example so that re_show highlights cases where there are twoor more adjacent sequences of ’very’. To do this, we need to use a regular expression operator,namely ’+’. If s is a string, then s+ means: ’match one or more occurrences of s’. Let’s first look atthe case where s is a single character, namely the letter ’o’:

>>> re_show(’o+’, str)G{oo}gle Analytics is very very very nice (n{o}w)...>>>

’o+’ is our first proper regular expression. You can think of it as matching an infinite set of strings,namely the set {’o’, ’oo’, ’ooo’, ...}. But we would really like to match against the set whichcontains strings of least two ’o’s; for this, we need the regular expression ’oo+’, which matches anystring consisting of ’o’ followed by one or more occurrences of o.

>>> re_show(’oo+’, str)G{oo}gle Analytics is very very very nice (now)>>>

Let’s return to the task of identifying multiple occurrences of ’very’. Some initially plausiblecandidates won’t do what we want. For example, ’very+’ would match ’veryyy’ (but not ’veryvery’), since the + scopes over the immediately preceding expression, in this case ’y’. To widenthe scope of +, we need to use parentheses, as in ’(very)+’. Will this match ’very very’? No,because we’ve forgotten about the whitespace between the two words; instead, it will match strings like’veryvery’. However, the following does work:

>>> re_show(’(very\s)+’, str)Google Analytics is {very very very }nice (now)>>>

Characters which are preceded by a \, such as ’\s’, have a special interpretation inside regularexpressions; thus, ’\s’ matches a whitespace character. We could have used ’ ’ in our pattern,but ’\s’ is better practice in general. One reason is that the sense of ’whitespace’ we are using ismore general than you might have imagined; it includes not just inter-word spaces, but also tabs andnewlines. If you try to inspect the variable str, you might initially get a shock:

>>> str"Google Analytics is very very very nice (now)\nBy Jason Hoffman18 August 06\n\nGoogle...>>>

You might recall that ’\n’ is a special character that corresponds to a newline in a string. The followingexample shows how newline is matched by ’\s’.



>>> str2 = "I’m very very\nvery happy">>> re_show(’very\s’, str2)I’m {very }{very}{very }happy>>>

Python’s re.findall(patt, str) is a useful function which returns a list of all the substringsin str that are matched by patt. Before illustrating, let’s introduce two further special characters,’\d’ and ’\w’: the first will match any digit, and the second will match any alphanumeric character.

>>> re.findall(’\d\d’, str)[’18’, ’06’, ’10’]>>> re.findall(’\s\w\w\w\s’, str)[’ the ’, ’ the ’, ’ the ’, ’ and ’, ’ you ’]>>>

As you will see, the second example matches three-letter words. However, this regular expression isnot quite what we want. First, the leading and trailing spaces are extraneous. Second, it will fail tomatch against strings such as ’the San’, where two three-letter words are adjacent. To solve thisproblem, we can use another special character, namely ’\b’. This is sometimes called a ’zero-width’character; it matches against the empty string, but only at the begining and ends of words:

>>> re.findall(r’\b\w\w\w\b’, str)[’now’, ’the’, ’the’, ’San’, ’the’, ’ago’, ’and’, ’you’, ’you’]

Note

This example uses a Python raw string: r’\b\w\w\w\b’. The specific justificationhere is that in an ordinary string, \b is interpreted as a backspace character. Pythonwill convert it to a backspace in a regular expression unless you use the r prefixto create a raw string as shown above. Another use for raw strings is to matchstrings which include backslashes. Suppose we want to match ’either\or’. In order tocreate a regular expression, the backslash needs to be escaped, since it is a specialcharacter; so we want to pass the pattern \\ to the regular experession interpreter.But to express this as a Python string literal, each backslash must be escaped again,yielding the string ’\\\\’. However, with a raw string, this reduces down to r’\\’.

Returning to the case of repeated words, we might want to look for cases involving ’very’ or’really’, and for this we use the disjunction operator |.

>>> re_show(’((very|really)\s)+’, str)Google Analytics is {very very very }nice (now)By Jason Hoffman 18 August 06...Google Analytics, the result of Google’s acquisition of the SanDiego-based Urchin Software Corporation, {really really }opened it’s...>>>

In addition to the matches just illustrated, the regular expression ’((very|really)\s)+’ willalso match cases where the two disjuncts occur with each other, such as the string ’really veryreally ’.



Let’s now look at how to perform substitutions, using the re.sub() function. In the first instancewe replace all instances of l with s. Note that this generates a string as output, and doesn’t modify theoriginal string. Then we replace any instances of green with red.

>>> sent = "colorless green ideas sleep furiously">>> re.sub(’l’, ’s’, sent)’cosorsess green ideas sseep furioussy’>>> re.sub(’green’, ’red’, sent)’colorless red ideas sleep furiously’>>>

We can also disjoin individual characters using a square bracket notation. For example, [aeiou]matches any of a, e, i, o, or u, that is, any vowel. The expression [âeiou] matches anything thatis not a vowel. In the following example, we match sequences consisting of non-vowels followed byvowels.

>>> re_show(’[âeiou][aeiou]’, sent){co}{lo}r{le}ss g{re}en{ i}{de}as s{le}ep {fu}{ri}ously>>>

Using the same regular expression, the function re.findall() returns a list of all the substringsin sent that are matched:

>>> re.findall(’[âeiou][aeiou]’, sent)[’co’, ’lo’, ’le’, ’re’, ’ i’, ’de’, ’le’, ’fu’, ’ri’]>>>

2.8.1 Groupings

Returning briefly to our earlier problem with unwanted whitespace around three-letter words, we notethat re.findall() behaves slightly differently if we create groups in the regular expression usingparentheses; it only returns strings which occur within the groups:

>>> re.findall(’\s(\w\w\w)\s’, str)[’the’, ’the’, ’the’, ’and’, ’you’]>>>

The same device allows us to select only the non-vowel characters which appear before a vowel:

>>> re.findall(’([âeiou])[aeiou]’, sent)[’c’, ’l’, ’l’, ’r’, ’ ’, ’d’, ’l’, ’f’, ’r’]>>>

By delimiting a second group in the regular expression, we can even generate pairs (or tuples),which we may then go on and tabulate.

>>> re.findall(’([âeiou])([aeiou])’, sent)[(’c’, ’o’), (’l’, ’o’), (’l’, ’e’), (’r’, ’e’), (’ ’, ’i’),

(’d’, ’e’), (’l’, ’e’), (’f’, ’u’), (’r’, ’i’)]>>>

Our next example also makes use of groups. One further special character is the so-called wildcardelement, ’.’; this has the distinction of matching any single character (except ’\n’). Given the stringstr3, our task is to pick out login names and email domains:



>>> str3 = """... <[email protected]>... Final editing was done by Martin Ward <[email protected]>... Michael S. Hart <[email protected]>... Prepared by David Price, email <[email protected]>"""

The task is made much easier by the fact that all the email addresses in the example are delimitedby angle brackets, and we can exploit this feature in our regular expression:

>>> re.findall(r’<(.+)@(.+)>’, str3)[(’hart’, ’vmd.cso.uiuc.edu’), (’Martin.Ward’, ’uk.ac.durham’),(’hart’, ’pobox.com’), (’ccx074’, ’coventry.ac.uk’)]>>>

Since ’.’ matches any single character, ’.+’ will match any non-empty string of characters,including punctuation symbols such as the period.

One question which might occur to you is how do we specify a match against a period? The answeris that we have to place a ’\’ immediately before the ’.’ in order to escape its special interpretation.

>>> re.findall(r’(\w+\.)’, str3)[’vmd.’, ’cso.’, ’uiuc.’, ’Martin.’, ’uk.’, ’ac.’, ’S.’,’pobox.’, ’coventry.’, ’ac.’]>>>

Now, let’s suppose that we wanted to match occurrences of both ’Google’ and ’google’ in oursample text. If you have been following up till now, you would reasonably expect that this regularexpression with a disjunction would do the trick: ’(G|g)oogle’. But look what happens when wetry this with re.findall():

>>> re.findall(’(G|g)oogle’, str)[’G’, ’G’, ’G’, ’g’]>>>

What is going wrong? We innocently used the parentheses to indicate the scope of the operator’|’, but re.findall() has interpreted them as marking a group. In order to tell re.findall()“don’t try to do anything special with these parentheses”, we need an extra piece of notation:

>>> re.findall(’(?:G|g)oogle’, str)[’Google’, ’Google’, ’Google’, ’google’]>>>

Placing ’?:’ immediately after the opening parenthesis makes it explicit that the parentheses arejust being used for scoping.

2.8.2 Practice Makes Perfect

Regular expressions are very flexible and very powerful. However, they often don’t do what youexpect. For this reason, you are strongly encouraged to try out a variety of tasks using re_show()and re.findall() in order to develop your intuitions further; the exercises below should help getyou started. One tip is to build up a regular expression in small pieces, rather than trying to get itcompletely right first time.

As you will see, we will be using regular expressions quite frequently in the following chapters,and we will describe further features as we go along.



2.8.3 Exercises

1. Describe the class of strings matched by the following regular expressions. Note that ’*’means: match zero or more occurrences of the preceding regular expression.

a) [a-zA-Z]+

b) [A-Z][a-z]*

c) \d+(\.\d+)?

d) ([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*

e) \w+|[^\w\s]+

Test your answers using re_show().

2. Write regular expressions to match the following classes of strings:

a) A single determiner (assume that a, an, and the are the only determin-ers).

b) An arithmetic expression using integers, addition, and multiplication,such as 2*3+8.

3. Using re.findall(), write a regular expression which will extract pairs of values of theform login name, email domain from the following string:

>>> str = """... austen-emma.txt:[email protected] (internet) hart@uiucvmd (bitnet)... austen-emma.txt:Internet ([email protected]); TEL: (212-254-5093)... austen-persuasion.txt:Editing by Martin Ward ([email protected])... blake-songs.txt:Prepared by David Price, email [email protected]"""

4. Write code to convert text into hAck3r*, using regular expressions and substitution, where“e‘ → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8. Normalise the text tolowercase before converting it. Add more substitutions of your own. Now try to map s totwo different values: $ for word-initial s, and 5 for word-internal s.

5. Write code to read a file and print it in reverse, so that the last line is listed first.

6. Write code to access a favorite webpage and extract some text from it. For example, accessa weather site and extract the forecast top temperature for your town or city today.

7. Read the Wikipedia entry on the Soundex Algorithm. Implement this algorithm in Python.

2.9 Summary

� Text is represented in Python using strings, and we type these with single or double quotes:’Hello’, "World".

� The characters of a string are accessed using indexes, counting from zero: ’Hello World’[1]gives the value e. The length of a string is found using len().



� Substrings are accessed using slice notation: ’Hello World’[1:5] gives the value ello. Ifthe start index is omitted, the substring begins at the start of the string, similarly for the end index.

� Sequences of words are represented in Python using lists of strings: [’colorless’, ’green’,’ideas’]. We can use indexing, slicing and the len() function on lists.

� Strings can be split into lists: ’Hello World’.split() gives [’Hello’, ’World’]. Listscan be joined into strings: ’/’.join([’Hello’, ’World’]) gives ’Hello/World’.

� Lists can be sorted in-place: words.sort(). To produce a separate, sorted copy, use:sorted(words).

� We process each item in a string or list using a for statement: for word in phrase. Thismust be followed by the colon character and an indented block of code, to be executed each timethrough the loop.

� We test a condition using an if statement: if len(word) < 5. This must be followed by thecolon character and an indented block of code, to be executed only if the condition is true.

� A dictionary is used to map between arbitrary types of information, such as a string and anumber: freq[’cat’] = 12. We create dictionaries using the brace notation: pos = {},pos = {’furiously’: ’adv’, ’ideas’: ’n’, ’colorless’: ’adj’}.

� [More: regular expressions]

2.10 Further Reading

Guido Van Rossum (2003). An Introduction to Python, Network Theory Ltd.Guido Van Rossum (2003). The Python Language Reference, Network Theory Ltd.Guido van Rossum (2005). Python Tutorial http://docs.python.org/tut/tut.htmlA.M. Kuchling. Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/Python Documentation http://docs.python.org/Allen B. Downey, Jeffrey Elkner and Chris Meyers () How to Think Like a Computer Scientist:

Learning with Python http://www.ibiblio.org/obp/thinkCSpy/



3. Words: The Building Blocks of Language

3.1 Introduction

Language can be divided up into pieces of varying sizes, ranging from morphemes to paragraphs. Inthis chapter we will focus on words, a very important level for much work in NLP. Just what are words,and how should we represent them in a machine? These may seem like trivial questions, but it turnsout that there are some important issues involved in defining and representing words.

In the following sections, we will explore the division of text into words; the distinction betweentypes and tokens; sources of text data including files, the web, and linguistic corpora; accessing thesesources using Python and NLTK; stemming and normalisation; WordNet; and a variety of usefulprogramming tasks involving words

3.2 Tokens, Types and Texts

In Chapter 1, we showed how a string could be split into a list of words. Once we have derived a list,the len() function will count the number of words for us:

>>> sentence = "This is the time -- and this is the record of the time.">>> words = sentence.split()>>> len(words)13

This process of segmenting a string of characters into words is known as tokenization. Tokenizationis a prelude to pretty much everything else we might want to do in NLP, since it tells our processingsoftware what our basic units are. We will discuss tokenization in more detail shortly.

We also pointed out that we could compile a list of the unique vocabulary items in a string by usingset() to eliminate duplicates:

>>> len(set(words))10

So if we ask how many words there are in sentence, we get two different answers, depending onwhether we count duplicates or not. Clearly we are using different senses of ’word’ here. To helpdistinguish between them, let’s introduce two terms: token and type. A word token is an individualoccurrence of a word in a concrete context; it exists in time and space. A word type is a more abstract;it’s what we’re talking about when we say that the three occurrences of the in sentence are ’the sameword’.

Something similar to a type/token distinction is reflected in the following snippet of Python:

1

Introduction to Natural Language Processing (DRAFT) 3. Words: The Building Blocks of Language

>>> words[2]’the’>>> words[2] == words[8]True>>> words[2] is words[8]False>>> words[2] is words[2]True

The operator == tests whether two expressions are equal, and in this case, it is testing for string-identity.This is the notion of identity that was assumed by our use of set() above. By contrast, the is operatortests whether two objects are stored in the same location of memory, and is therefore analogous totoken-identity.

In effect, when we used split() above to turn a string into a list of words, our tokenizationmethod was to say that any strings which are delimited by whitespace count as a word token. But thissimple approach doesn’t always lead to the results we want. Moreover, string-identity doesn’t alwaysgive us a useful criterion for assigning tokens to types. We therefore need to address two questions inmore detail:

Tokenization: Which substrings of the original text should be treated as word tokens?

Type definition: How do we decide whether two tokens have the same type?

To see the problems with our first stab at defining tokens and types in sentence, let’s look moreclosely at what is contained in set(words):

>>> set(words)set([’and’, ’this’, ’record’, ’This’, ’of’, ’is’, ’--’, ’time.’,’time’, ’the’]

One point to note is that ’time’ and ’time.’ come out as distinct tokens, and of necessity, distincttypes, since the trailing period has been bundled up with the rest of the word into a single token. Wemight also argue that although ’--’ is some kind of token, it isn’t really a word token. Third, wewould probably want to say that ’This’ and ’this’ are not distinct types, since capitalization shouldbe ignored.

The terms token and type can also be applied to other linguistic entities. For example, a sentencetoken is an individual occurrence of a sentence; but a sentence type is an abstract sentence, withoutcontext. If I say the same sentence twice, I have uttered two sentence tokens but only used one sentencetype. When the kind of token or type is obvious from context, we will simply use the terms token andtype.

To summarize, although the type/token distinction is a useful one, we cannot just say that two wordtokens have the same type if they are the same string of characters — we need to take into considerationa number of other factors in determining what counts as the same word. Moreover, we also need to bemore careful in how we identify tokens in the first place.

Up till now, we have relied on getting our source texts by defining a string in a fragment of Pythoncode. However, this is an impractical approach for all but the simplest of texts, and makes it hard topresent realistic examples. So how do we get larger chunks of text into our programs? In the rest ofthis section, we will see how to extract text from files, from the web, and from the corpora distributedwith NLTK.



3.2.1 Extracting text from files

It is easy to access local files in Python. As an exercise, create a file called corpus.txt using a texteditor, and enter the following text:

Hello World!

This is a test file.

Be sure to save the file as plain text. You also need to make sure that you have saved the file in the samedirectory or folder in which you are running the Python interactive interpreter.

Note

If you are using IDLE, you can easily create this file by selecting the New Windowcommand in the File menu, typing in the required text into this window, and thensaving the file as corpus.txt in the first directory that IDLE offers in the pop-updialogue box.

The next step is to open a file using the built-in function open(), which takes two arguments, thename of the file, here corpus.txt, and the mode to open the file with (’r’ means to open the filefor reading, and ’U’ stands for “Universal”, which lets us ignore the different conventions used formarking newlines).

>>> f = open(’corpus.txt’, ’rU’)

Note

If the interpreter cannot find your file, it will give an error like this:

>>> f = open(’corpus.txt’, ’rU’) Traceback (most recent call last):File "<pyshell#7>", line 1, in -toplevel- f = open(’foo.txt’, ’rU’)IOError: [Errno 2] No such file or directory: ’corpus.txt’

To check that the file that you are trying to open is really in the right directory, useIDLE’s Open command in the File menu; this will display a list of all the files in thedirectory where IDLE is running. An alternative is to examine the current directoryfrom within Python:

>>> import os >>> os.listdir(’.’)

To read the contents of the file we can use lots of different methods. The following uses the read methodread() on the file object f; this reads the entire contents of a file into a string.

>>> f.read()’Hello World!\nThis is a test file.\n’

You will recall that the strange ’\n’ character on the end of the string is a newline character; this isequivalent to pressing Enter on a keyboard and starting a new line. .. There is also a ’\t’ characterfor representing tab. Note that we can open and read a file in one step:

>>> text = open(’corpus.txt’, ’rU’).read()

We can also read a file one line at a time using the for loop construct:



>>> f = open(’corpus.txt’, ’rU’)>>> for line in f:... print line[:-1]Hello world!This is a test file.

Here we use the slice [:-1] to remove the newline character at the end of the input line.

3.2.2 Extracting text from the Web

To read in a web page, we use urlopen():

>>> from urllib import urlopen>>> page = urlopen("http://news.bbc.co.uk/").read()>>> print page[:60]<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"

Web pages are usually in HTML format. To extract the plain text, we can strip out the HTML markup,that is remove all material enclosed in angle brackets. Let’s digress briefly to consider how to carry outthis task using regular expressions. Our first attempt might look as follows:

>>> line = ’<title>BBC NEWS | News Front Page</title>’>>> import re>>> new = re.sub(r’<.*>’, ’’, line)

So the regular expression ’<.*>’ is intended to match a pair of left and right angle brackets, with astring of any characters intervening. However, look at what the result is:

>>> new’’

What has happened here? The problem is two-fold. First, as already noted, the wildcard ’.’ matchesany character other than ’\n’, so in particular it will match ’>’ and ’<’. Second, the ’*’ oper-ator is ’greedy’, in the sense that it matches as many characters as it can. In the example we justlooked at, therefore, ’.*’ will return not the shortest match, namely ’title’, but the longest match,’title>BBC NEWS | News Front Page</title’.

In order to get the results we want, we need to think about the task in a slightly different way. Ourassumption is that after we have encountered a ’<’, any character can occur within the tag except a’>’; once we find the latter, we know the tag is closed. Now, we have already seen how to matcheverything but α, for some character α; we use a negated range expression. In this case, the expressionwe need is ’[^<]’: match everything except ’<’. This range expression is then quantified with the’*’ operator. In our revised example below, we use the improved regular expression, and we alsonormalise whitespace, replacing any sequence of one or more spaces, tabs or newlines (these are allmatched by ’\s+’) with a single space character.

>>> import re>>> page = re.sub(’<[^>]*>’, ’’, page)>>> page = re.sub(’\s+’, ’ ’, page)>>> print page[:60]BBC NEWS | News Front Page News Sport Weather World Service>>>



You will probably find it useful to borrow the structure of this code snippet for future tasks involvingregular expressions: each time through a series of substitutions, the result of operating on page getsassigned as the new value of page. This approach allows us to decompose the transformations we needinto a series of simple regular expression substitutions, each of which can be tested and debugged onits own.

3.2.3 Extracting text from NLTK Corpora

NLTK is distributed with several corpora and corpus samples and many are supported by the corporapackage. Here we import gutenberg, a selection of texts from the Project Gutenberg electronic textarchive, and list the items it contains:

>>> from nltk_lite.corpora import gutenberg>>> gutenberg.items[’austen-emma’, ’austen-persuasion’, ’austen-sense’, ’bible-kjv’,’blake-poems’, ’blake-songs’, ’chesterton-ball’, ’chesterton-brown’,’chesterton-thursday’, ’milton-paradise’, ’shakespeare-caesar’,’shakespeare-hamlet’, ’shakespeare-macbeth’, ’whitman-leaves’]

Next we iterate over the text content to find the number of word tokens:

>>> count = 0>>> for word in gutenberg.raw(’whitman-leaves’):... count += 1>>> print count154873

NLTK also includes the Brown Corpus, the first million-word, part-of-speech tagged electroniccorpus of English, created in 1961 at Brown University. Each of the sections a through r represents adifferent genre.

>>> from nltk_lite.corpora import brown>>> brown.items[’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’, ’h’, ’j’, ’k’, ’l’, ’m’, ’n’, ’p’, ’r’]

We can extract individual sentences (as lists of words) from the corpus using the extract() function.This is called below with 0 as an argument, indicating that we want the first sentence of the corpus tobe returned; 1 will return the second sentence, and so on. brown.raw() is an iterator which gives usthe words without their part-of-speech tags.

>>> from nltk_lite.corpora import extract>>> print extract(0, brown.raw())[’The’, ’Fulton’, ’County’, ’Grand’, ’Jury’, ’said’, ’Friday’, ’an’,’investigation’, ’of’, "Atlanta’s", ’recent’, ’primary’, ’election’,’produced’, ’‘‘’, ’no’, ’evidence’, "’’", ’that’, ’any’, ’irregularities’,’took’, ’place’, ’.’]

3.2.4 Exercises

1. Create a small text file, and write a program to read it and print it with a line number at thestart of each line.



2. Use the corpus module to read austin-persuasion.txt. How many word tokens doesthis book have? How many word types?

3. Use the Brown corpus reader brown.raw() to access some sample text in two differentgenres.

4. Write a program to generate a table of token/type ratios, as we saw above. Include the fullset of Brown Corpus genres. Which genre has the lowest diversity. Is this what you wouldhave expected?

5. Read in some text from a corpus, tokenize it, and print the list of all wh-word typesthat occur. (wh-words in English are questions used in questions, relative clauses andexclamations: who, which, what, and so on.) Print them in order. Are any words duplicatedin this list, because of the presence of case distinctions or punctuation?

6. Read in the texts of the State of the Union addresses, using the state_union corpusreader. Count occurrences of men, women, and people in each document. What hashappened to the usage of these words over time?

7. Examine the results of processing the URL http://news.bbc.co.uk/ using the regularexpressions suggested above. You will see that there is still a fair amount of non-textualdata there, particularly Javascript commands. You may also find that sentence breakshave not been properly preserved. Define further regular expressions which improve theextraction of text from this web page.

8. Take a copy of the http://news.bbc.co.uk/ over three different days, say at two-day intervals. This should give you three different files, bbc1.txt, bbc2.txt andbbc3.txt, each corresponding to a different snapshot of world events. Collect the 100most frequent word tokens for each file. What can you tell from the changes in frequency?

9. Define a function ghits(), which takes a word as its argument, and builds a Google querystring of the form http://www.google.com?q=word. Strip the HTML markup andnormalize whitespace. Search for a substring of the form Results 1 - 10 of about,followed by some number n, and extract n. Convert this to an integer and return it.

10. Just for fun: Try running the various chatbots. How intelligent are these programs? Take alook at the program code and see if you can discover how it works. You can find the codeonline at: http://nltk.sourceforge.net/lite/nltk_lite/chat/.

3.3 Tokenization and Normalization

Tokenization, as we saw, is the task of extracting a sequence of elementary tokens that constitute a pieceof language data. In our first attempt to carry out this task, we started off with a string of characters, andused the split() method to break the string at whitespace characters. (Recall that ’whitespace’ coversnot only interword space, but also tabs and newlines.) We pointed out that tokenization based solelyon whitespace is too simplistic for most applications. In this section we will take a more sophisticatedapproach, using regular expression to specify which character sequences should be treated as words.We will also consider important ways to normalize tokens.



3.3.1 Tokenization with Regular Expressions

The function tokenize.regexp() takes a text string and a regular expression, and returns the list ofsubstrings that match the regular expression. To define a tokenizer that includes punctuation as separatetokens, we could do the following:

>>> from nltk_lite import tokenize>>> text = ’’’Hello. Isn’t this fun?’’’>>> pattern = r’\w+|[^\w\s]+’>>> list(tokenize.regexp(text, pattern))[’Hello’, ’.’, ’Isn’, "’", ’t’, ’this’, ’fun’, ’?’]

The regular expression in this example will match a sequence consisting of one or more word characters\w+. It will also match a sequence consisting of one or more punctuation characters (or non-word,non-space characters [^\w\s]+). This is another negated range expression; it matches one or morecharacters which are not word characters (i.e., not a match for \w) and not a whitespace character(i.e., not a match for \s). We use the disjunction operator | to combine these into a single complexexpression \w+|[^\w\s]+.

There are a number of ways we might want to improve this regular expression. For example, itcurrently breaks $22.50 into four tokens; but we might want it to treat this as a single token. Similarly,we would want to treat U.S.A. as a single token. We can deal with these by adding further clausesto the tokenizer’s regular expression. For readability we break it up and insert comments, and use there.VERBOSE flag, so that Python knows to strip out the embedded whitespace and comments.

>>> import re>>> text = ’That poster costs $22.40.’>>> pattern = re.compile(r’’’... \w+ # sequences of ’word’ characters... | \$?\d+(\.\d+)? # currency amounts, e.g. $12.50... | ([\A\.])+ # abbreviations, e.g. U.S.A.... | [^\w\s]+ # sequences of punctuation... ’’’, re.VERBOSE)>>> list(tokenize.regexp(text, pattern))[’That’, ’poster’, ’costs’, ’$22.40’, ’.’]

It is sometimes more convenient to write a regular expression matching the material that appearsbetween tokens, such as whitespace and punctuation. The tokenize.regexp() function permitsan optional boolean parameter gaps; when set to True the pattern is matched against the gaps. Forexample, here is how tokenize.whitespace() is defined:

>>> list(tokenize.regexp(text, pattern=r’\s+’, gaps=True))[’That’, ’poster’, ’costs’, ’$22.40.’]

3.3.2 Lemmatization and Normalization

Earlier we talked about counting word tokens, and completely ignored the rest of the sentence in whichthese tokens appeared. Thus, for an example like I saw the saw, we would have treated both sawtokens as instances of the same type. However, one is a form of the verb see, and the other is the nameof a cutting instrument. How do we know that these two forms of saw are unrelated? One answeris that as speakers of English, we know that these would appear as different entries in a dictionary.Another, more empiricist, answer is that if we looked at a large enough number of texts, it would



become clear that the two forms have very different distributions. For example, only the noun saw willoccur immediately after determiners such as the. Distinct words which have the same written form arecalled homographs. We can distinguish homographs with the help of context; often the previous wordsuffices. We will explore this idea of context briefly, before addressing the main topic of this section.

A bigram is simply a pair of words. For example, in the sentence She sells sea shells by the seashore, the bigrams are She sells, sells sea, sea shells, shells by, by the, the sea, sea shore.

As a first approximation to discovering the distribution of a word, we can look at all the bigramsit occurs in. Let’s consider all bigrams from the Brown Corpus which have the word often as firstelement. Here is a small selection, ordered by their counts:

often , 16often a 10often in 8often than 7often the 7often been 6often do 5often called 4often appear 3often were 3often appeared 2often are 2often did 2often is 2often appears 1

often call 1

In the topmost entry, we see that often is frequently followed by a comma. This suggests that oftenis common at the end of phrases. We also see that often precedes verbs, presumably as an adverbialmodifier. We might infer from this that if we come across saw in the context often __, then saw is beingused as a verb.

You will also see that this list includes different grammatical forms of the same verb. We can formseparate groups consisting of appear ~ appears ~ appeared; call ~ called; do ~ did; and and been ~were ~ are ~ is. It is common in linguistics to say that two forms such as appear and appeared belongto a more abstract notion of a word called a lexeme; by contast, appeared and called belong to differentlexemes. You can think of a lexeme as corresponding to an entry in a dictionary. By convention, smallcapitals are used to indicate a lexeme: APPEAR.

Although appeared and called belong to different lexemes, they do have something in common:they are both past tense forms. This is signalled by the segment -ed, which we call a morphologicalsuffix. We also say that such morphologically complex forms are inflected. If we strip off the suffix,we get something called the stem, namely appear and call respectively. While appeared, appears andappearing are all morphologically inflected, appear lacks any morphological inflection and is thereforetermed the base form. In English, the base form is conventionally used as the lemma for a word.

Our notion of context would be more compact if we could group different forms of the variousverbs into their lemmas; then we could study which verb lexemes are typically modified by a particularadverb. Lemmatization — the process of mapping grammatical forms into their lemmas — wouldyield the following picture of the distribution of often.

often , 16often a 10



often be 13often in 8often than 7often the 7often do 7often appear 6

often call 5

Lemmatization is a rather sophisticated process which requires a mixture of rules for regularinflections and table look-up for irregular morphological patterns. Within NLTK, a simpler approachis offered by the Porter Stemmer, which strips inflectional suffixes from words, collapsing the differentforms of APPEAR and CALL. Given the simple nature of the algorithm, you may not be surprised tolearn that this stemmer does not attempt to identify were as a form of the lexeme BE.

>>> from nltk_lite.stem.porter import *>>> stemmer = Porter()>>> verbs = [’appears’, ’appear’, ’appeared’, ’calling’, ’called’]>>> lemmas = [stemmer.stem(verb) for verb in verbs]>>> set(lemmas)set([’call’, ’appear’])

Lemmatization and stemming can be regarded as special cases of normalization. They identifya canonical representative for a group of related word forms. By its nature, normalization collapsesdistinctions. An example is case normalization, where all variants are mapped into a single format.What counts as the normalized form will vary according to context. Often, we convert everything intolower case, so that words which were capitalized by virtue of being sentence-initial are treated the sameas those which occur elsewhere in the sentence. The Python string method lower() will accomplishthis for us:

>>> str = ’This is THE time’>>> str.lower()’this is the time’

We need to be careful, however; case normalization will also collapse the New of New York withthe new of my new car.

A final issue for normalization is the presence of contractions, such as didn’t. If we are analyzingthe meaning of a sentence, it would probably be more useful to normalize this form to two separateforms: did and not.

3.3.3 Exercises

1. Regular expression tokenizers: Save the Wall Street Journal example text from earlierin this chapter into a file corpus.txt. Write a function load(f) to read the file into astring.

a) Use tokenize.regexp() to create a tokenizer which tokenizes the variouskinds of punctuation in this text. Use a single regular expression, with inlinecomments using the re.VERBOSE flag.

b) Use tokenize.regexp() to create a tokenizer which tokenizes the following kindsof expression: monetary amounts; dates; names of people and companies.



2. Sentence tokenizers: (Advanced) Develop a sentence tokenizer. Test it on the BrownCorpus, which has been grouped into sentences.

3. Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on eachword.

4. Readability measures are used to score the reading difficulty of a text, for the purposes ofselecting texts of appropriate difficulty for language learners. For example, the AutomatedReadability Index (ARI) of a text is defined to be: 4.71 * µw + 0.5 * µs - 21.43,where µw is the mean word length (in letters), and where µs is the mean sentence length(in words). With the help of your word and sentence tokenizers, compute the ARI scoresfor a collection of texts.

3.4 Lexical Resources (INCOMPLETE)

[This section will contain a discussion of lexical resources, focusing on WordNet, but also includingthe cmudict and timit corpus readers.]

3.4.1 Pronunciation Dictionary

Here we access the pronunciation of words...

>>> from nltk_lite.corpora import cmudict>>> from string import join>>> for word, num, pron in cmudict.raw():... if pron[-4:] == (’N’, ’IH0’, ’K’, ’S’):... print word.lower(),atlantic’s audiotronics avionics beatniks calisthenics centronicschetniks clinic’s clinics conics cynics diasonics dominic’sebonics electronics electronics’ endotronics endotronics’ enixenvironics ethnics eugenics fibronics flextronics harmonicshispanics histrionics identics ionics kibbutzniks lasersonicslumonics mannix mechanics mechanics’ microelectronics minix minnixmnemonics mnemonics molonicks mullenix mullenix mullinix mulnixmunich’s nucleonics onyx panic’s panics penix pennix personicsphenix philharmonic’s phoenix phonics photronics pinnixplantronics pyrotechnics refuseniks resnick’s respironics sconnixsiliconix skolniks sonics sputniks technics tectonics tektronixtelectronics telephonics tonics unix vinick’s vinnick’s vitronics

3.4.2 WordNet Semantic Network

Note

Before using WordNet it must be installed on your machine. Please see the instruc-tions on the NLTK website.

Access WordNet as follows:

>>> from nltk_lite import wordnet



Help on the wordnet interface is available using help(wordnet).WordNet contains four dictionaries: N (nouns), V (verbs), ADJ (adjectives), and ADV (adverbs).

Here we will focus on just the nouns.Access the senses of a word (synsets) using getSenses()

>>> dog = wordnet.N[’dog’]>>> for sense in dog.getSenses():... print sense’dog’ in {noun: dog, domestic dog, Canis familiaris}’dog’ in {noun: frump, dog}’dog’ in {noun: dog}’dog’ in {noun: cad, bounder, blackguard, dog, hound, heel}’dog’ in {noun: frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie}’dog’ in {noun: pawl, detent, click, dog}’dog’ in {noun: andiron, firedog, dog, dog-iron}

A synset is a set of synonymous words, in fact this is how a word sense is specified in WordNet.Each synset is linked to several other synsets, and the wordnet interface permits us to navigate theselinks. For example, we can navigate from a synset to its hyponyms, namely synsets having narrowermeaning. We use getPointerTargets() to do this navigation:

>>> dog_canine = dog.getSenses()[0]>>> for sense in dog_canine.getPointerTargets(wordnet.HYPONYM):... print sense{noun: pooch, doggie, doggy, barker, bow-wow}{noun: cur, mongrel, mutt}{noun: lapdog}{noun: toy dog, toy}{noun: hunting dog}{noun: working dog}{noun: dalmatian, coach dog, carriage dog}{noun: basenji}{noun: pug, pug-dog}{noun: Leonberg}{noun: Newfoundland}{noun: Great Pyrenees}{noun: spitz}{noun: griffon, Brussels griffon, Belgian griffon}{noun: corgi, Welsh corgi}{noun: poodle, poodle dog}{noun: Mexican hairless}

Each synset has a unique hypernym, a more general synset that contains it. Thus, from any synsetwe can trace paths back to the most general synset. First we define a recursive function to return thehypernym of a synset. (We will study recursive functions systematically in Part II.)

Now we can write a simple program to display these hypernym paths:

>>> def hypernym_path(sense, depth=0):... if sense != None:... print " " * depth, sense... hypernym_path(hypernym(sense), depth+1)>>> for sense in dog.getSenses():... hypernym_path(sense)



’dog’ in {noun: dog, domestic dog, Canis familiaris}{noun: canine, canid}

{noun: carnivore}{noun: placental, placental mammal, eutherian, eutherian mammal}

{noun: mammal}{noun: vertebrate, craniate}

{noun: chordate}{noun: animal, animate being, beast, brute, creature, fauna}

{noun: organism, being}{noun: living thing, animate thing}

{noun: object, physical object}{noun: entity}

See dir(wordnet) for a full list of lexical relations supported by WordNet.

3.4.3 Wordnet Similarity

The wordnet package includes a variety of functions to measure the similarity of two word senses. Forexample, path_distance_similarity assigns a score in the range 0-1, based on the shortest paththat connects the senses in the hypernym hierarchy (-1 is returned in those cases where a path cannotbe found). A score of 1 represents identity i.e. comparing a sense with itself will return 1.

>>> from nltk_lite.wordnet import *>>> N[’poodle’][0].path_distance_similarity(N[’dalmatian’][1])0.33333333333333331>>> N[’dog’][0].path_distance_similarity(N[’cat’][0])0.20000000000000001>>> V[’run’][0].path_distance_similarity(V[’walk’][0])0.25>>> V[’run’][0].path_distance_similarity(V[’think’][0])-1

3.4.4 Exercises

1. Familiarize yourself with the WordNet interface, by reading the documentation availablevia help(wordnet).

2. Investigate the holonym / meronym pointers for some nouns. Note that there are threekinds (member, part, substance), so access is more specific, e.g. MEMBER_MERONYM,SUBSTANCE_HOLONYM.

3. Write a program to score the similarity of two nouns as the depth of their first commonhypernym. Evaluate your findings against the Miller-Charles set of word pairs, listed herein order of decreasing similarity:

car-automobile, gem-jewel, journey-voyage, boy-lad,coast-shore, asylum-madhouse, magician-wizard, midday-noon,furnace-stove, food-fruit, bird-cock, bird-crane, tool-implement,brother-monk, lad-brother, crane-implement, journey-car,monk-oracle, cemetery-woodland, food-rooster, coast-hill,forest-graveyard, shore-woodland, monk-slave, coast-forest,lad-wizard, chord-smile, glass-magician, rooster-voyage,

noon-string.



3.5 Simple Statistics with Tokens

3.5.1 Example: Stylistics

So far, we’ve seen how to count the number of tokens or types in a document. But it’s much moreinteresting to look at which tokens or types appear in a document. We can use a Python dictionary tocount the number of occurrences of each word type in a document:

>>> counts = {}>>> for word in text.split():... if word not in counts:... counts[word] = 0... counts[word] += 1

The first statement, counts = {}, initializes the dictionary, while the next four lines successivelyadd entries to it and increment the count each time we encounter a new token of a given type. To viewthe contents of the dictionary, we can iterate over its keys and print each entry (here just for the first 10entries):

>>> for word in sorted(counts)[:10]:... print counts[word], word1 $1.12 $1301 $361 $451 $4901 $51 $62.625,1 $6201 $632 $7

We can also print the number of times that a specific word we’re interested in appeared:

>>> print counts[’might’]3

Applying this same approach to document collections that are categorized by genre, we can learnsomething about the patterns of word usage in those genres. For example, the following table wasconstructed by counting the number of times various modal words appear in different genres in theBrown Corpus:

Use of Modals in Brown Corpus, by GenreGenre can could may might must willskill and hobbies 273 59 130 22 83 259humor 17 33 8 8 9 13fiction: science 16 49 4 12 8 16press: reportage 94 86 66 36 50 387fiction: romance 79 195 11 51 46 43religion 84 59 79 12 54 64



Observe that the most frequent modal in the reportage genre is will, suggesting a focus on the future,while the most frequent modal in the romance genre is could, suggesting a focus on possibilities.

We can also measure the lexical diversity of a genre, by calculating the ratio of word types andword tokens, as shown in the following table. (Genres with lower diversity have a higher number oftokens per type.)

Word Types and Tokens in Brown Corpus, by GenreGenre Token Count Type Count Ratioskill and hobbies 82345 11935 6.9humor 21695 5017 4.3fiction: science 14470 3233 4.5press: reportage 100554 14394 7.0fiction: romance 70022 8452 8.3religion 39399 6373 6.2

We can carry out a variety of interesting explorations simply by counting words. In fact, the field ofCorpus Linguistics focuses almost exclusively on creating and interpreting such tables of word counts.So far, our method for identifying word tokens has been a little primitive, and we have not been able toseparate punctuation from the words. We will take up this issue in the next section.

3.5.2 Lining things up in columns

[TODO: discuss formatted print statements in more detail]

3.5.3 Example: Lexical Dispersion

Word tokens vary in their distribution throughout a text. We can visualize word distributions, to getan overall sense of topics and topic shifts. For example, consider the pattern of mention of the maincharacters in Jane Austen’s Sense and Sensibility: Elinor, Marianne, Edward and Willoughby. Thefollowing plot contains four rows, one for each name, in the order just given. Each row contains aseries of lines, drawn to indicate the position of each token.

Figure 1: Lexical Dispersion

As you can see, Elinor and Marianne appear rather uniformly throughout the text, while Edwardand Willoughby tend to appear separately. Here is the program that generated the above plot. [NB.Requires NLTK-Lite 0.6.7].

>>> from nltk_lite.corpora import gutenberg>>> from nltk_lite.draw import dispersion>>> words = [’Elinor’, ’Marianne’, ’Edward’, ’Willoughby’]>>> dispersion.plot(gutenberg.raw(’austen-sense’), words)



3.5.4 Frequency Distributions

We can do more sophisticated counting using frequency distributions. Abstractly, a frequency distri-bution is a record of the number of times each outcome of an experiment has occurred. For instance,a frequency distribution could be used to record the frequency of each word in a document (where the“experiment” is examining a word, and the “outcome” is the word’s type). Frequency distributions aregenerally created by repeatedly running an experiment, and incrementing the count for a sample everytime it is an outcome of the experiment. The following program produces a frequency distribution thatrecords how often each word type occurs in a text. It increments a separate counter for each word, andprints the most frequently occurring word:

>>> from nltk_lite.probability import FreqDist>>> from nltk_lite.corpora import genesis>>> fd = FreqDist()>>> for token in genesis.raw():... fd.inc(token)>>> fd.max()’the’

Once we construct a frequency distribution that records the outcomes of an experiment, we can use it toexamine a number of interesting properties of the experiment. Some of these properties are summarizedbelow:

Frequency Distribution ModuleName Sample DescriptionCount fd.count(’the’) number of times a given sample occurredFrequency fd.freq(’the’) frequency of a given sampleN fd.N() number of samplesSamples fd.samples() list of distinct samples recordedMax fd.max() sample with the greatest number of outcomes

We can also use a FreqDist to examine the distribution of word lengths in a corpus. For eachword, we find its length, and increment the count for words of this length.

>>> def length_dist(text):... fd = FreqDist() # initialize frequency distribution... for token in genesis.raw(text): # for each token... fd.inc(len(token)) # found a word with this length... for i in range(1,15): # for each length from 1 to 14... print "%2d" % int(100*fd.freq(i)), # print the percentage of words with this length... print

Now we can call length_dist on a text to print the distribution of word lengths. We see that themost frequent word length for the English sample is 3 characters, while the most frequent length forthe Finnish sample is 5-6 characters.

>>> length_dist(’english-kjv’)2 14 28 21 13 7 5 2 2 0 0 0 0 0>>> length_dist(’finnish’)0 9 6 10 16 16 12 9 6 3 2 2 1 0



3.5.5 Conditional Frequency Distributions

A condition specifies the context in which an experiment is performed. Often, we are interested in theeffect that conditions have on the outcome for an experiment. A conditional frequency distribution isa collection of frequency distributions for the same experiment, run under different conditions. Forexample, we might want to examine how the distribution of a word’s length (the outcome) is affectedby the word’s initial letter (the condition).

>>> from nltk_lite.corpora import genesis>>> from nltk_lite.probability import ConditionalFreqDist>>> cfdist = ConditionalFreqDist()>>> for text in genesis.items:... for word in genesis.raw(text):... cfdist[word[0]].inc(len(word))

To plot the results, we construct a list of points, where the x coordinate is the word length, and the ycoordinate is the frequency with which that word length is used:

>>> for cond in cfdist.conditions():... wordlens = cfdist[cond].samples()... wordlens.sort()... points = [(i, cfdist[cond].freq(i)) for i in wordlens]

We can plot these points using the Plot function defined in nltk_lite.draw.plot, as follows:

>>> Plot(points).mainloop()

3.5.6 Predicting the Next Word

Conditional frequency distributions are often used for prediction. Prediction is the problem of decidinga likely outcome for a given run of an experiment. The decision of which outcome to predict is usuallybased on the context in which the experiment is performed. For example, we might try to predict aword’s text (outcome), based on the text of the word that it follows (context).

To predict the outcomes of an experiment, we first examine a representative training corpus, wherethe context and outcome for each run of the experiment are known. When presented with a new runof the experiment, we simply choose the outcome that occurred most frequently for the experiment’scontext.

We can use a ConditionalFreqDist to find the most frequent occurrence for each context. First,we record each outcome in the training corpus, using the context that the experiment was run under asthe condition. Then, we can access the frequency distribution for a given context with the indexingoperator, and use the max() method to find the most likely outcome.

We will now use a ConditionalFreqDist to predict the most likely next word in a text. Tobegin, we load a corpus from a text file, and create an empty ConditionalFreqDist:

>>> from nltk_lite.corpora import genesis>>> from nltk_lite.probability import ConditionalFreqDist

>>> cfdist = ConditionalFreqDist()

We then examine each token in the corpus, and increment the appropriate sample’s count. We use thevariable prev to record the previous word.



>>> prev = None>>> for word in genesis.raw():... cfdist[prev].inc(word)... prev = word

Note

Sometimes the context for an experiment is unavailable, or does not exist. Forexample, the first token in a text does not follow any word. In these cases, wemust decide what context to use. For this example, we use None as the context forthe first token. Another option would be to discard the first token.

Once we have constructed a conditional frequency distribution for the training corpus, we can use it tofind the most likely word for any given context. For example, taking the word living as our context, wecan inspect all the words that occurred in that context.

>>> word = ’living’>>> cfdist[word].samples()[’creature,’, ’substance’, ’soul.’, ’thing’, ’thing,’, ’creature’]

We can set up a simple loop to generate text: we set an initial context, picking the most likely tokenin that context as our next word, and then using that word as our new context:

>>> word = ’living’>>> for i in range(20):... print word,... word = cfdist[word].max()living creature that he said, I will not be a wife of the landof the land of the land

This simple approach to text generation tends to get stuck in loops, as demonstrated by the textgenerated above. A more advanced approach would be to randomly choose each word, with morefrequent words chosen more often.

3.5.7 Exercises

1. Write a program to create a table of word frequencies by genre, like the one given abovefor modals. Choose your own words and try to find words whose presence (or absence) istypical of a genre. Discuss your findings.

2. Pick a text, and explore the dispersion of particular words. What does this tell you aboutthe words, or the text?

3. Use the Plot function defined in nltk_lite.draw.plot plot word-initial characteragainst word length, as discussed in this section.

1. Zipf’s Law: Let f(w) be the frequency of a word w in free text. Suppose that all the wordsof a text are ranked according to their frequency, with the most frequent word first. Zipf’slaw states that the frequency of a word type is inversely proportional to its rank (i.e. f.r =k, for some constant k). For example, the 50th most common word type should occur threetimes as frequently as the 150th most common word type.



a) Write a function to process a large text and plot word frequency against wordrank using the nltk_lite.draw.plot module. Do you confirm Zipf’s law?(Hint: it helps to set the axes to log-log.) What is going on at the extreme endsof the plotted line?

b) Generate random text, e.g. using random.choice("abcdefg "), takingcare to include the space character. You will need to import random first.Use the string concatenation operator to accumulate characters into a (very)long string. Then tokenize this string, and generate the Zipf plot as before, andcompare the two plots. What do you make of Zipf’s Law in the light of this?

2. Predicting the next word: The word prediction program we saw in this chapter quicklygets stuck in a cycle. Modify the program to choose the next word randomly, from a list ofthe n most likely words in the given context. (Hint: store the n most likely words in a listlwords then randomly choose a word from the list using random.choice().)

a) Select a particular genre, such as a section of the Brown Corpus, or a genesistranslation, or one of the Gutenberg texts. Train your system on this corpus andget it to generate random text. You may have to experiment with different startwords. How intelligible is the text? Discuss the strengths and weaknesses ofthis method of generating random text.

b) Try the same approach with different genres, and with different amounts oftraining data. What do you observe?

c) Now train your system using two distinct genres and experiment with generat-ing text in the hybrid genre. As before, discuss your observations.

1. Write a program to implement one or more text readability scores (see http://en.wikipedia.org/wiki/Readability).

2. (Advanced) Statistically Improbable Phrases: Design an algorithm to find the statisti-cally improbable phrases of a document collection. http://www.amazon.com/gp/search-inside/sipshelp.html/

3.6 Conclusion

In this chapter we saw that we can do a variety of interesting language processing tasks that focus solelyon words. Tokenization turns out to be far more difficult than expected. Other kinds of tokenization,such as sentence tokenization, are left for the exercises. No single solution works well across-the-board,and we must decide what counts as a token depending on the application domain. We also looked atnormalization (including lemmatization) and saw how it collapses distinctions between tokens. In thenext chapter we will look at word classes and automatic tagging.



3.7 Further Reading



4. Categorizing and Tagging Words

4.1 Introduction

In the last chapter we dealt with words in their own right. We saw that some distinctions can becollapsed using normalization, but we did not make any further generalizations. We looked at thedistribution of often, identifying the words that follow it; we noticed that often frequently modifiesverbs. We also assumed that you knew that words such as was, called and appears are all verbs, andthat you knew that often is an adverb. In fact, we take it for granted that most people have a rough ideaabout how to group words into different categories.

There is a long tradition of classifying words into categories called parts of speech. These aresometimes also called word classes or lexical categories. Apart from verb and adverb, other familiarexamples are noun, preposition, and adjective. One of the notable features of the Brown corpus isthat all the words have been tagged for their part-of-speech. Now, instead of just looking at the wordsthat immediately follow often, we can look at the part-of-speech tags (or POS tags). Here’s a list ofthe top eight, ordered by frequency, along with explanations of each tag. As we can see, the majorityof words following often are verbs.

Table 1: Part of Speech Tags Following often in the Brown Corpus

Tag Freq Example Commentvbn 61 burnt, gone verb: past participlevb 51 make, achieve verb: base formvbd 36 saw, looked verb: simple past tensejj 30 ambiguous, acceptable adjectivevbz 24 sees, goes verb: third-person singular presentin 18 by, in prepositionat 18 a, this article, 16 , comma

The process of classifying words into their parts-of-speech and labeling them accordingly is knownas part-of-speech tagging, POS-tagging, or simply tagging. The collection of tags used for aparticular task is known as a tag set. Our emphasis in this chapter is on exploiting tags, and taggingtext automatically.

Automatic tagging can bring a number of benefits. We have already seen an example of how toexploit tags in corpus analysis — we get a clear understanding of the distribution of often by lookingat the tags of adjacent words. Automatic tagging also helps predict the behavior of previously unseen

1

Introduction to Natural Language Processing (DRAFT) 4. Categorizing and Tagging Words

words. For example, if we encounter the word blogging we can probably infer that it is a verb, withthe root blog, and likely to occur after forms of the auxiliary to be (e.g. he was blogging). Parts ofspeech are also used in speech synthesis and recognition. For example, wind/nn, as in the wind blew,is pronounced with a short vowel, whereas wind/vb, as in wind the clock, is pronounced with a longvowel. Other examples can be found where the stress pattern differs depending on whether the wordis a noun or a verb, e.g. contest, insult, present, protest, rebel, suspect. Without knowing the part ofspeech we cannot be sure of pronouncing the word correctly.

In the next section we will see how to access and explore the Brown Corpus. Following this wewill take a more in depth look at the linguistics of word classes. The rest of the chapter will deal withautomatic tagging: simple taggers, evaluation, n-gram taggers, and the Brill tagger.

4.2 Getting Started with Tagging

4.2.1 Representing Tags and Reading Tagged Corpora

By convention in NLTK, a tagged token is represented using a Python tuple as follows. (A tuple is justlike a list, only it cannot be modified.)

>>> tok = (’fly’, ’nn’)>>> tok(’fly’, ’nn’)

We can access the properties of this token in the usual way, as shown below:

>>> tok[0]’fly’>>> tok[1]’nn’

We can create one of these special tuples from the standard string representation of a tagged token:

>>> from nltk_lite.tag import tag2tuple>>> ttoken = ’fly/nn’>>> tag2tuple(ttoken)(’fly’, ’nn’)

Several large corpora, such as the Brown Corpus and portions of the Wall Street Journal, havealready been tagged, and we will be able to process this tagged data. Tagged corpus files typicallycontain text of the following form (this example is from the Brown Corpus):

The/at grand/jj jury/nn commented/vbd on/in a/at number/nn of/inother/ap topics/nns ,/, among/in them/ppo the/at Atlanta/np and/ccFulton/np-tl County/nn-tl purchasing/vbg departments/nns which/wdt it/ppssaid/vbd ‘‘/‘‘ are/ber well/ql operated/vbn and/cc follow/vb generally/rbaccepted/vbn practices/nns which/wdt inure/vb to/in the/at best/jjt

interest/nn of/in both/abx governments/nns ’’/’’ ./.

We can construct tagged tokens directly from a string, with the help of two NLTK functions,tokenize.whitespace() and tag2tuple:

>>> from nltk_lite import tokenize>>> sent = ’’’



... The/at grand/jj jury/nn commented/vbd on/in a/at number/nn of/in

... other/ap topics/nns ,/, among/in them/ppo the/at Atlanta/np and/cc

... Fulton/np-tl County/nn-tl purchasing/vbg departments/nns which/wdt it/pps

... said/vbd ‘‘/‘‘ are/ber well/ql operated/vbn and/cc follow/vb generally/rb

... accepted/vbn practices/nns which/wdt inure/vb to/in the/at best/jjt

... interest/nn of/in both/abx governments/nns ’’/’’ ./.

... ’’’>>> [tag2tuple(t) for t in tokenize.whitespace(sent)][(’The’, ’at’), (’grand’, ’jj’), (’jury’, ’nn’), (’commented’, ’vbd’),(’on’, ’in’), (’a’, ’at’), (’number’, ’nn’), ... (’.’, ’.’)]

We can also conveniently access tagged corpora directly from Python. The first step is to loadthe Brown Corpus reader, brown. We then use one of its functions, brown.tagged() to produce asequence of sentences, where each sentence is a list of tagged words.

>>> from nltk_lite.corpora import brown, extract>>> extract(6, brown.tagged(’a’))[(’The’, ’at’), (’grand’, ’jj’), (’jury’, ’nn’), (’commented’, ’vbd’),(’on’, ’in’), (’a’, ’at’), (’number’, ’nn’), (’of’, ’in’), (’other’, ’ap’),(’topics’, ’nns’), (’,’, ’,’), ... (’.’, ’.’)]

4.2.2 Nouns and Verbs

Linguists recognize several major categories of words in English, such as nouns, verbs, adjectives anddeterminers. In this section we will discuss the most important categories, namely nouns and verbs.

Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intel-ligence. Nouns can appear after determiners and adjectives, and can be the subject or object of theverb:

Table 2: Syntactic Patterns involving some Nouns

Word After a determiner Subject of the verbwoman the woman who I saw yesterday ... the woman sat downScotland the Scotland I remember as a child ... Scotland has five million peoplebook the book I bought yesterday ... this book recounts the colonization of Aus-

traliaintelligence the intelligence displayed by the child ... Mary’s intelligence impressed her teachers

Nouns can be classified as common nouns and proper nouns. Proper nouns identify particularindividuals or entities, e.g. Moses and Scotland. Common nouns are all the rest. Another distinctionexists between count nouns and mass nouns. Count nouns are thought of as distinct entities whichcan be counted, such as pig (e.g. one pig, two pigs, many pigs). They cannot occur with the word much(i.e. *much pigs). Mass nouns, on the other hand, are not thought of as distinct entities (e.g. sand).They cannot be pluralized, and do not occur with numbers (e.g. *two sands, *many sands). However,they can occur with much (i.e. much sand).

Verbs are words which describe events and actions, e.g. fall, eat. In the context of a sentence, verbsexpress a relation involving the referents of one or more noun phrases.



Table 3: Syntactic Patterns involving some Verbs

Word Simple With modifiers and adjuncts (italicized)fall Rome fell Dot com stocks suddenly fell like a stoneeat Mice eat cheese John ate the pizza with gusto

Verbs can be classified according to the number of arguments (usually noun phrases) that theyrequire. The word fall is intransitive, requiring exactly one argument (the entity which falls). The wordeat is transitive, requiring two arguments (the eater and the eaten). Other verbs are more complex; forinstance put requires three arguments, the agent doing the putting, the entity being put somewhere, anda location. We will return to this topic when we come to look at grammars and parsing (see Chapter 7).

In the Brown Corpus, verbs have a range of possible tags, e.g.: give/vb (present), gives/vbz(present, 3ps), giving/vbg (present continuous; gerund) gave/vbd (simple past), and given/vbn(past participle). We will discuss these tags in more detail in a later section.

4.2.3 Nouns and verbs in tagged corpora

Now that we are able to access tagged corpora, we can write simple simple programs to garner statisticsabout the tags. In this section we will focus on the nouns and verbs.

What are the 10 most common verbs? We can write a program to find all words tagged with VB,VBZ, VBG, VBD or VBN.

>>> from nltk_lite.probability import FreqDist>>> fd = FreqDist()>>> for sent in brown.tagged():... for word, tag in sent:... if tag[:2] == ’vb’:... fd.inc(word+"/"+tag)>>> fd.sorted_samples()[:20][’said/vbd’, ’make/vb’, ’see/vb’, ’get/vb’, ’know/vb’, ’made/vbn’,’came/vbd’, ’go/vb’, ’take/vb’, ’went/vbd’, ’say/vb’, ’used/vbn’,’made/vbd’, ’United/vbn-tl’, ’think/vb’, ’took/vbd’, ’come/vb’,’knew/vbd’, ’find/vb’, ’going/vbg’]

Let’s study nouns, and find the most frequent nouns of each noun part-of-speech type. There aremany noun pos tags; the most important of these have $ for possessive nouns, s for plural nouns (sinceplural nouns typically end in s), p for proper nouns.

>>> from nltk_lite.probability import ConditionalFreqDist>>> cfd = ConditionalFreqDist()>>> for sent in brown.tagged():... for word, tag in sent:... if tag[:2] == ’nn’:... cfd[tag].inc(word)>>> for tag in sorted(cfd.conditions()):... print tag, cfd[tag].sorted_samples()[:5]nn [’time’, ’man’, ’Af’, ’way’, ’world’]nn$ ["man’s", "father’s", "year’s", "mother’s", "child’s"]nn$-hl ["Drug’s", "Golf’s", "Navy’s", "amendment’s", "drug’s"]nn$-tl ["President’s", "State’s", "Department’s", "Foundation’s", "Government’s"]



nn+bez ["name’s", "kid’s", "company’s", "fire’s", "sky’s"]nn+bez-tl ["Knife’s", "Pa’s"]nn+hvd-tl ["Pa’d"]nn+hvz ["Summer’s", "boat’s", "company’s", "guy’s", "rain’s"]nn+hvz-tl ["Knife’s"]nn+in [’buncha’]nn+md ["cowhand’d", "sun’ll"]nn+nn-nc [’stomach-belly’]nn-hl [’editor’, ’birth’, ’chemical’, ’name’, ’no.’]nn-nc [’water’, ’thing’, ’home’, ’linguist’, ’luggage’]nn-tl [’President’, ’State’, ’Dr.’, ’House’, ’Department’]nn-tl-hl [’Sec.’, ’Governor’, ’B’, ’Day’, ’Island’]nn-tl-nc [’State’]nns [’years’, ’people’, ’men’, ’eyes’, ’days’]nns$ ["children’s", "men’s", "women’s", "people’s", "years’"]nns$-hl ["Beginners’", "Dealers’", "Idols’", "Sixties’"]nns$-nc ["sisters’"]nns$-tl ["Women’s", "People’s", "States’", "Motors’", "Nations’"]nns$-tl-hl ["Women’s"]nns+md ["duds’d", "oystchers’ll"]nns-hl [’costs’, ’Problems’, ’Sources’, ’inches’, ’methods’]nns-nc [’people’, ’instructions’, ’friends’, ’things’, ’emeralds’]nns-tl [’States’, ’Nations’, ’Motors’, ’Communists’, ’Times’]nns-tl-hl [’Times’, ’Forces’, ’Nations’, ’States’, ’Juniors’]nns-tl-nc [’States’]

Some tags contain a plus sign; these are compound tags, and are assigned to words that contain twoparts normally treated separately. Some tags contain a minus sign; this indicates disjunction [MORE].

4.2.4 The Default Tagger

The simplest possible tagger assigns the same tag to each token. This may seem to be a rather banalstep, but it establishes an important baseline for tagger performance. In order to get the best result, wetag each word with the most likely word. (This kind of tagger is known as a majority class classifier).What then, is the most frequent tag? We can find out using a simple program:

>>> fd = FreqDist()>>> for sent in brown.tagged(’a’):... for word, tag in sent:... fd.inc(tag)>>> fd.max()’nn’

Now we can create a tagger, called default_tagger, which tags everything as nn.

>>> from nltk_lite import tag>>> tokens = tokenize.whitespace(’John saw 3 polar bears .’)>>> default_tagger = tag.Default(’nn’)>>> list(default_tagger.tag(tokens))[(’John’, ’nn’), (’saw’, ’nn’), (’3’, ’nn’), (’polar’, ’nn’),(’bears’, ’nn’), (’.’, ’nn’)]



Note

The tokenizer is a generator over tokens. We cannot print it directly, but we canconvert it to a list for printing, as shown in the above program. Note that we can onlyuse a generator once, but if we save it as a list, the list can be used many times over.

This is a simple algorithm, and it performs poorly when used on its own. On a typical corpus, itwill tag only about an eighth of the tokens correctly:

>>> tag.accuracy(default_tagger, brown.tagged(’a’))0.13089484257215028

Default taggers assign their tag to every single word, even words that have never been encounteredbefore. As it happens, most new words are nouns. Thus, default taggers they help to improve therobustness of a language processing system. We will return to them later, in the context of ourdiscussion of backoff.

4.2.5 Exercises

1. Write programs to process the Brown Corpus and find answers to the following questions:

1) Which nouns are more common in their plural form, rather than their singularform? (Only consider regular plurals, formed with the -s suffix.)

2) Which word has the greatest number of distinct tags. What are they?

3) List tags in order of decreasing frequency.

4) Which tags are nouns most commonly found after?

2. Generating statistics for tagged data:

a) What proportion of word types are always assigned the same part-of-speechtag?

b) How many words are ambiguous, in the sense that they appear with at least twotags?

c) What percentage of word occurrences in the Brown Corpus involve these am-biguous words?

3. Competition: Working with someone else, take turns to pick a word which can be eithera noun or a verb (e.g. contest); the opponent has to predict which one is likely to be themost frequent in the Brown corpus; check the opponents prediction, and tally the scoreover several turns.

4.3 Looking for Patterns in Words

4.3.1 Some morphology

English nouns can be morphologically complex. For example, words like books and women are plural.Words with the -ness suffix are nouns that have been derived from adjectives, e.g. happiness and illness.The -ment suffix appears on certain nouns derived from verbs, e.g. government and establishment.



English verbs can also be morphologically complex. For instance, the present participle of a verbends in -ing, and expresses the idea of ongoing, incomplete action (e.g. falling, eating). The -ing suffixalso appears on nouns derived from verbs, e.g. the falling of the leaves (this is known as the gerund).In the Brown corpus, these are tagged vbg.

The past participle of a verb often ends in -ed, and expresses the idea of a completed action (e.g.fell, ate). These are tagged vbd.

[MORE: Modal verbs, e.g. would ...]Common tag sets often capture some morpho-syntactic information; that is, information about

the kind of morphological markings which words receive by virtue of their syntactic role. Consider,for example, the selection of distinct grammatical forms of the word go illustrated in the followingsentences:

(1a) Go away!

(1b) He sometimes goes to the cafe.

(1c) All the cakes have gone.

(1d) We went on the excursion.

Each of these forms — go, goes, gone, and went — is morphologically distinct from the others.Consider the form, goes. This cannot occur in all grammatical contexts, but requires, for instance, athird person singular subject. Thus, the following sentences are ungrammatical.

(2a) *They sometimes goes to the cafe.

(2b) *I sometimes goes to the cafe.

By contrast, gone is the past participle form; it is required after have (and cannot be replaced in thiscontext by goes), and cannot occur as the main verb of a clause.

(3a) *All the cakes have goes.

(3b) *He sometimes gone to the cafe.

We can easily imagine a tag set in which the four distinct grammatical forms just discussed wereall tagged as VB. Although this would be adequate for some purposes, a more fine-grained tag set willprovide useful information about these forms that can be of value to other processors which try to detectsyntactic patterns from tag sequences. As we noted at the beginning of this chapter, the Brown tag setdoes in fact capture these distinctions, as summarized here:

Table 4: Some morpho-syntactic distinctions in the Brown tag set

Form Category Taggo base vbgoes 3rd singular present vbzgone past participle vbnwent simple past vbd



These differences between the forms are encoded in their Brown Corpus tags: be/be, being/beg,am/bem, been/ben and was/bedz. This means that an automatic tagger which uses this tag set isin effect carrying out a limited amount of morphological analysis.

Most part-of-speech tag sets make use of the same basic categories, such as noun, verb, adjective,and preposition. However, tag sets differ both in how finely they divide words into categories; and inhow they define their categories. For example, is might be just tagged as a verb in one tag set; but as adistinct form of the lexeme BE in another tag set (as in the Brown Corpus). This variation in tag setsis unavoidable, since part-of-speech tags are used in different ways for different tasks. In other words,there is no one ’right way’ to assign tags, only more or less useful ways depending on one’s goals.More details about the Brown corpus tag set can be found in the Appendix.

4.3.2 The Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns. For instance,we might guess that any word ending in ed is the past participle of a verb, and any word ending with ’sis a possessive noun. We can express these as a list of regular expressions:

>>> patterns = [... (r’.*ing$’, ’vbg’), # gerunds... (r’.*ed$’, ’vbd’), # simple past... (r’.*es$’, ’vbz’), # 3rd singular present... (r’.*ould$’, ’md’), # modals... (r’.*\’s$’, ’nn$’), # possessive nouns... (r’.*s$’, ’nns’), # plural nouns... (r’^-?[0-9]+(.[0-9]+)?$’, ’cd’), # cardinal numbers... (r’.*’, ’nn’) # nouns (default)... ]

Note that these are processed in order, and the first one that matches is applied.Now we can set up a tagger and use it to tag some text.

>>> regexp_tagger = tag.Regexp(patterns)>>> list(regexp_tagger.tag(brown.raw(’a’)))[3][(’‘‘’, ’nn’), (’Only’, ’nn’), (’a’, ’nn’), (’relative’, ’nn’),(’handful’, ’nn’), (’of’, ’nn’), (’such’, ’nn’), ( ’reports’, ’nns’),(’was’, ’nns’), (’received’, ’vbd’), ("’’", ’nn’), (’,’, ’nn’),(’the’, ’nn’), (’jury’, ’nn’), (’sa id’, ’nn’), (’,’, ’nn’), (’‘‘’, ’nn’),(’considering’, ’vbg’), (’the’, ’nn’), (’widespread’, ’nn’), ..., (’.’, ’nn’)]

How well does this do?

>>> tag.accuracy(regexp_tagger, brown.tagged(’a’))0.20326391789486245

The regular expression is a catch-all, which tags everything as a noun. This is equivalent to thedefault tagger (only much less efficient). Instead of re-specifying this as part of the regular expressiontagger, is there a way to combine this tagger with the default tagger? We will see how to do this later,under the heading of backoff taggers.



4.3.3 Exercises

1. Ambiguity resolved by part-of-speech tags: Search the web for “spoof newspaper head-lines”, to find such gems as: British Left Waffles on Falkland Islands, and Juvenile Courtto Try Shooting Defendant. Manually tag these headlines to see if knowledge of the part-of-speech tags removes the ambiguity.

2. Satisfy yourself that there are restrictions on the distribution of go and went, in the sensethat they cannot be freely interchanged in the kinds of contexts illustrated in (1).

3. Identifying particular words and phrases according to tags:

a) Produce an alphabetically sorted list of the distinct words tagged as md.

b) Identify words which can be plural nouns or third person singular verbs (e.g.deals, flies).

c) Identify three-word prepositional phrases of the form IN + DET + NN (eg. inthe lab).

d) What is the ratio of masculine to feminine pronouns?

4. Advanced tasks with tags: There are 264 distinct words having exactly three possibletags.

a) Print a table with the integers 1..10 in one column, and the number of distinctwords in the corpus having 1..10 distinct tags.

b) For the word with the greatest number of distinct tags, print out sentences fromthe corpus containing the word, one for each possible tag.

5. Write a program to classify contexts involving the word must according to the tag of thefollowing word. Can this be used to discriminate between the epistemic and deontic usesof must?

6. In the introduction we saw a table involving frequency counts for the adjectives adore, love,like, prefer and preceding qualifiers such as really. Investigate the full range of qualifiers(Brown tag ql) which appear before these four adjectives.

7. Regular Expression Tagging: We defined the regexp_tagger, which can be used as afall-back tagger for unknown words. This tagger only checks for cardinal numbers. Bytesting for particular prefix or suffix strings, it should be possible to guess other tags. Forexample, we could tag any word that ends with -s as a plural noun. Define a regularexpression tagger (using tag.Regexp which tests for at least five other patterns in thespelling of words. (Use inline documentation to explain the rules.)

8. Evaluating a Regular Expression Tagger: Consider the regular expression tagger devel-oped in the exercises in the previous section. Evaluate the tagger using tag.accuracy(),and try to come up with ways to improve its performance. Discuss your findings. Howdoes objective evaluation help in the development process?



4.4 Baselines and Backoff

So far the performance of our simple taggers has been disappointing. Before we embark on a processto get 90+% performance, we need to do two more things. First, we need to establish a more principledbaseline performance than the default tagger, which was too simplistic, and the regular expressiontagger, which was too arbitrary. Second, we need a way to connect multiple taggers together, so that ifa more specialized tagger is unable to assign a tag, we can “back off” to a more generalized tagger.

4.4.1 The Lookup Tagger

A lot of high-frequency words do not have the nn tag. Let’s find some of these words and their tags.The following function takes a list of sentences and counts up the words, and returns the n most frequentwords.

>>> def wordcounts(sents, n):... "find the n most frequent words"... fd = FreqDist()... for sent in sents:... for word in sent:... fd.inc(word) # count the word... return fd.sorted_samples()[:n]

Now let’s look at the 100 most frequent words:

>>> frequent_words = wordcounts(brown.raw(’a’), 100)>>> frequent_words[’the’, ’,’, ’.’, ’of’, ’and’, ’to’, ’a’, ’in’, ’for’, ’The’,’that’, ’‘‘’, ’is’, ’was’, "’’", ’on’, ’at’, ’w ith’, ’be’, ’by’,’as’, ’he’, ’said’, ’his’, ’will’, ’it’, ’from’, ’are’, ’;’, ’--’,’an’, ’has’, ’had’, ’who’, ’ have’, ’not’, ’Mrs.’, ’were’, ’this’,’which’, ’would’, ’their’, ’been’, ’they’, ’He’, ’one’, ..., ’now’]

Next, let’s inspect the tags that these words have. First we will do this in the most obvious (but highlyinefficient) way:

>>> [(w,t) for sent in brown.tagged(’a’)... for (w,t) in sent if w in frequent_words][(’The’, ’at’), (’of’, ’in’), (’‘‘’, ’‘‘’), ("’’", "’’"),(’that’, ’cs’), (’.’, ’.’), (’The’, ’at’), (’in’, ’in’),(’that’, ’cs’), (’the’, ’at’), ..., ("’’", "’’")]

A much better approach is to set up a dictionary which maps each of the 100 most frequent words toits most likely tag. We can do this by setting up a frequency distribution cfd over the tags, conditionedon each of the frequent words. This gives us, for each word, a count of the frequency of different tagsthat occur with the word. We

>>> from nltk_lite.probability import ConditionalFreqDist>>> def wordtags(tagged_sents, words):... "Find the most likely tag for these words in the tagged sentences"... cfd = ConditionalFreqDist()... for sent in tagged_sents:... for (w,t) in sent:... if w in words:... cfd[w].inc(t) # count the word’s tag... return dict((word, cfd[word].max()) for word in words)



Now for any word that appears in this section of the corpus, we can look up its most likely tag. Forexample, to find the tag for the word The we can access the corresponding frequency distribution, andask for its most frequent event:

>>> table = wordtags(brown.tagged(’a’), frequent_words)>>> table[’The’]’at’

Now we can create and evaluate a simple tagger that assigns tags to words based on this table:

>>> baseline_tagger = tag.Lookup(table)>>> tag.accuracy(baseline_tagger, brown.tagged(’a’))0.45578495136941344

This is surprisingly good; just knowing the tags for the 100 most frequent words enables us to tagnearly half the words correctly! Let’s see how it does on some untagged input text:

>>> list(baseline_tagger.tag(brown.raw(’a’)))[3][(’‘‘’, ’‘‘’), (’Only’, None), (’a’, ’at’), (’relative’, None),(’handful’, None), (’of’, ’in’), (’such’, None), (’reports’, None),(’was’, ’bedz’), (’received’, None), ("’’", "’’"), (’,’, ’,’),(’the’, ’at’), (’jury’, None), (’said’, ’vbd’), (’,’, ’,’),(’‘‘’, ’‘‘’), (’considering’, None), (’the’, ’at’), (’widespread’, None),(’interest’, None), (’in’, ’in’), (’the’, ’at’), (’election’, None),(’,’, ’,’), (’the’, ’at’), (’number’, None), (’of’, ’in’),(’voters’, None), (’and’, ’cc’), (’the’, ’at’), (’size’, None),(’of’, ’in’), (’this’, None), (’city’, None), ("’’", "’’"), (’.’, ’.’)]

Notice that a lot of these words have been assigned a tag of None. That is because they were notamong the 100 most frequent words. In these cases we would like to assign the default tag of nn, aprocess known as backoff.

4.4.2 Backoff

How do we combine these taggers? We want to use the lookup table first, and if it is unable to assigna tag, then use the default tagger. We do this by specifying the default tagger as an argument to thelookup tagger. The lookup tagger will call the default tagger just in case it can’t assign a tag itself.

>>> baseline_tagger = tag.Lookup(table, backoff=tag.Default(’nn’))>>> tag.accuracy(baseline_tagger, brown.tagged(’a’))0.58177695566561249

4.4.3 Choosing a good baseline

We can write a simple (but somewhat inefficient) program create and evaluate lookup taggers having arange of different sizes. We create two taggers, one with no backoff, and one that backs off to a defaulttagger that tags everything as a noun.

>>> def performance(size):... frequent_words = wordcounts(brown.raw(’a’), size)... table = wordtags(brown.tagged(’a’), frequent_words)... baseline_tagger = tag.Lookup(table, backoff=tag.Default(’nn’))



... return baseline_tagger, tag.accuracy(baseline_tagger, brown.tagged(’a’))>>> for size in (100,200,500,1000,2000,5000,10000,20000,50000):... print performance(size)(<Lookup Tagger: size=100>, 0.58177695566561249)(<Lookup Tagger: size=200>, 0.62016428983431793)(<Lookup Tagger: size=500>, 0.6791276329136583)(<Lookup Tagger: size=1000>, 0.7248443622332279)(<Lookup Tagger: size=2000>, 0.77832806253356401)(<Lookup Tagger: size=5000>, 0.85188058157805757)(<Lookup Tagger: size=10000>, 0.90473775284921532)(<Lookup Tagger: size=14394>, 0.9349006503968017)(<Lookup Tagger: size=14394>, 0.9349006503968017)

4.4.4 Exercises

1. Create a lookup tagger that uses the 1,000 most likely words. What is its performance?What happens to the performance when you include backoff to the default tagger?

2. What is the upper limit of performance for a lookup tagger, assuming no limit to the sizeof its table? (Hint: write a program to work out what percentage of tokens of a word areassigned the most likely tag for that word, on average.)

4.5 Getting Better Coverage

4.5.1 More English Word Classes

Two other important word classes are adjectives and adverbs. Adjectives describe nouns, and canbe used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). Englishadjectives can be morphologically complex (e.g. fallV+ing in the falling stocks). Adverbs modify verbsto specify the time, manner, place or direction of the event described by the verb (e.g. quickly in thestocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary’s teacher was really nice).

English has several categories of closed class words in addition to prepositions, such as articles(also often called determiners) (e.g., the, a), modals (e.g., should, may), and personal pronouns(e.g., she, they). Each dictionary and grammar classifies these words differently.

Part-of-speech tags are closely related to the notion of word class used in syntax. The assumptionin linguistics is that every distinct word type will be listed in a lexicon (or dictionary), with informationabout its pronunciation, syntactic properties and meaning. A key component of the word’s propertieswill be its class. When we carry out a syntactic analysis of an example like fruit flies like a banana, wewill look up each word in the lexicon, determine its word class, and then group it into a hierarchy ofphrases, as illustrated in the following parse tree.



Syntactic analysis will be dealt with in more detail in Part II. For now, we simply want to makethe connection between the labels used in syntactic parse trees and part-of-speech tags. The followingtable shows the correspondence:

Table 5: Word Class Labels and Brown Corpus Tags

Word Class Label Brown Tag Word ClassDet AT articleN NN nounV VB verbAdj JJ adjectiveP IN prepositionCard CD cardinal number-- . Sentence-ending punctuation

4.5.2 Some diagnostics

Now that we have examined word classes in detail, we turn to a more basic question: how do we decidewhat category a word belongs to in the first place? In general, linguists use three criteria: morphological(or formal); syntactic (or distributional); semantic (or notional). A morphological criterion is onewhich looks at the internal structure of a word. For example, -ness is a suffix which combines with anadjective to produce a noun. Examples are happy → happiness, ill → illness. So if we encounter aword which ends in -ness, this is very likely to be a noun.

A syntactic criterion refers to the contexts in which a word can occur. For example, assume thatwe have already determined the category of nouns. Then we might say that a syntactic criterion for anadjective in English is that it can occur immediately before a noun, or immediately following the wordsbe or very. According to these tests, near should be categorized as an adjective:

1. the near window

2. The end is (very) near.

A familiar example of a semantic criterion is that a noun is “the name of a person, place or thing”.Within modern linguistics, semantic criteria for word classes are treated with suspicion, mainly becausethey are hard to formalize. Nevertheless, semantic criteria underpin many of our intuitions about wordclasses, and enable us to make a good guess about the categorization of words in languages that we areunfamiliar with. For example, if we all we know about the Dutch verjaardag is that it means the sameas the English word birthday, then we can guess that verjaardag is a noun in Dutch. However, somecare is needed: although we might translate zij is vandaag jarig as it’s her birthday today, the wordjarig is in fact an adjective in Dutch, and has no exact equivalent in English!

All languages acquire new lexical items. A list of words recently added to the Oxford Dictionaryof English includes cyberslacker, fatoush, blamestorm, SARS, cantopop, bupkis, noughties, muggle,and robata. Notice that all these new words are nouns, and this is reflected in calling nouns an openclass. By contrast, prepositions are regarded as a closed class. That is, there is a limited set of wordsbelonging to the class (e.g., above, along, at, below, beside, between, during, for, from, in, near, on,



outside, over, past, through, towards, under, up, with), and membership of the set only changes verygradually over time.

With this background we are now ready to embark on our main task for this chapter, automaticallyassigning part-of-speech tags to words.

4.5.3 Unigram Tagging

The UnigramTagger class implements a simple statistical tagging algorithm: for each token, it assignsthe tag that is most likely for that particular token. For example, it will assign the tag jj to anyoccurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent word) moreoften than it is used as a verb (e.g. I frequent this cafe).

Before a UnigramTagger can be used to tag data, it must be trained on a tagged corpus. It usesthis corpus to determine which tags are most common for each word. UnigramTaggers are trainedusing the train() method, which takes a tagged corpus:

>>> from nltk_lite.corpora import brown>>> from itertools import islice>>> train_sents = list(islice(brown.tagged(), 500)) # sents 0..499>>> unigram_tagger = tag.Unigram()>>> unigram_tagger.train(train_sents)

Once a UnigramTagger has been trained, the tag() method can be used to tag new text:

>>> text = "John saw the book on the table">>> tokens = list(tokenize.whitespace(text))>>> list(unigram_tagger.tag(tokens))[(’John’, ’np’), (’saw’, ’vbd’), (’the’, ’at’), (’book’, None),(’on’, ’in’), (’the’, ’at’), (’table’, None)]

Unigram will assign the special tag None to any token that was not encountered in the trainingdata.

4.5.4 Affix Taggers

Affix taggers are like unigram taggers, except they are trained on word prefixes or suffixes of a specifiedlength. (NB. Here we use prefix and suffix in the string sense, not the morphological sense.) Forexample, the following tagger will consider suffixes of length 3 (e.g. -ize, -ion), for words having atleast 5 characters.

>>> affix_tagger = tag.Affix(-2, 3)>>> affix_tagger.train(train_sents)>>> list(affix_tagger.tag(tokens))[(’John’, ’np’), (’saw’, ’nn’), (’the’, ’at’), (’book’, ’vbd’),(’on’, None), (’the’, ’at’), (’table’, ’jj’)]

4.5.5 Exercises

1. Unigram Tagging: Train a unigram tagger and run it on some new text. Observe thatsome words are not assigned a tag. Why not?



2. Affix Tagging: Train an affix tagger tag.Affix() and run it on some new text. Exper-iment with different settings for the affix length and the minimum word length. Can youfind a setting which seems to perform better than the one described above? Discuss yourfindings.

3. Affix Tagging:: Write a program which calls tag.Affix() repeatedly, using differentsettings for the affix length and the minimum word length. What parameter values give thebest overall performance? Why do you think this is the case?

4.6 N-Gram Taggers

Earlier we encountered the UnigramTagger, which assigns a tag to a word based on the identity ofthat word. In this section we will look at taggers that exploit a larger amount of context when assigninga tag.

4.6.1 Bigram Taggers

Bigram taggers use two pieces of contextual information for each tagging decision, typically thecurrent word together with the tag of the previous word. Given the context, the tagger assigns themost likely tag. In order to do this, the tagger uses a bigram table, a fragment of which is shown below.Given the tag of the previous word (down the left), and the current word (across the top), it can look upthe preferred tag.

Table 6: Fragment of Bigram Table

tag ask Congress to increase grants to statesat nntl to tobd to nns tomd vb vbvb np to nns to nnsnp to toto vb vbnn np to nn nns tonns to toin np in in nnsjj to nns to nns

The best way to understand the table is to work through an example. Suppose we are processingthe sentence The President will ask Congress to increase grants to states for vocational rehabilitation .and that we have got as far as will/md. We can use the table to simply read off the tags that should beassigned to the remainder of the sentence. When preceded by md, the tagger guesses that ask has thetag vb (italicized in the table). Moving to the next word, we know it is preceded by vb, and lookingacross this row we see that Congress is assigned the tag np. The process continues through the rest



of the sentence. When we encounter the word increase, we correctly assign it the tag vb (unlike theunigram tagger which assigned it nn). However, the bigram tagger mistakenly assigns the infinitivaltag to the word to immediately preceding states, and not the preposition tag. This suggests that we mayneed to consider even more context in order to get the correct tag.

4.6.2 N-Gram Taggers

As we have just seen, it may be desirable to look at more than just the preceding word’s tag whenmaking a tagging decision. An n-gram tagger is a generalization of a bigram tagger whose contextis the current word together with the part-of-speech tags of the n-1: preceding tokens, as shown in thefollowing diagram. It then picks the tag which is most likely for that context. The tag to be chosen, tn,is circled, and the context is shaded in grey. In this example of an n-gram tagger, we have n=3; that is,we consider the tags of the two preceding words in addition to the current word.

Figure 1: Tagger Context :scale:80

Note

A 1-gram tagger is another term for a unigram tagger: i.e., the context used to taga token is just the text of the token itself. 2-gram taggers are also called bigramtaggers, and 3-gram taggers are called trigram taggers.

The tag.Ngram class uses a tagged training corpus to determine which part-of-speech tag is mostlikely for each context. Here we see a special case of an n-gram tagger, namely a bigram tagger:

>>> bigram_tagger = tag.Bigram()>>> bigram_tagger.train(brown.tagged([’a’,’b’]))

Once a bigram tagger has been trained, it can be used to tag untagged corpora:

>>> text = "John saw the book on the table">>> tokens = list(tokenize.whitespace(text))>>> list(bigram_tagger.tag(tokens))[(’John’, ’np’), (’saw’, ’vbd’), (’the’, ’at’), (’book’, ’nn’),(’on’, ’in’), (’the’, ’at’), (’table’, None)]

As with the other taggers, n-gram taggers assign the tag NONE to any token whose context was notseen during training.



As n gets larger, the specificity of the contexts increases, as does the chance that the data we wishto tag contains contexts that were not present in the training data. This is known as the sparse dataproblem, and is quite pervasive in NLP. Thus, there is a trade-off between the accuracy and the coverageof our results (and this is related to the precision/recall trade-off in information retrieval.)

Note

n-gram taggers should not consider context that crosses a sentence boundary.Accordingly, NLTK taggers are designed to work with lists of sentences, where eachsentence is a list of words. At the start of a sentence, tn−1 and preceding tags areset to None.

4.6.3 Combining Taggers

One way to address the trade-off between accuracy and coverage is to use the more accurate algorithmswhen we can, but to fall back on algorithms with wider coverage when necessary. For example, wecould combine the results of a bigram tagger, a unigram tagger, and a regexp_tagger, as follows:

1. Try tagging the token with the bigram tagger.

2. If the bigram tagger is unable to find a tag for the token, try the unigram tagger.

3. If the unigram tagger is also unable to find a tag, use a default tagger.

Each NLTK tagger other than tag.Default permits a backoff-tagger to be specified. The backoff-tagger may itself have a backoff tagger:

>>> t0 = tag.Default(’nn’)>>> t1 = tag.Unigram(backoff=t0)>>> t2 = tag.Bigram(backoff=t1)>>> t1.train(brown.tagged(’a’)) # section a: press-reportage>>> t2.train(brown.tagged(’a’))

Note

We specify the backoff tagger when the tagger is initialized, so that training can takeadvantage of the backing off. Thus, if the bigram tagger would assign the same tagas its unigram backoff tagger in a certain context, the bigram tagger discards thetraining instance. This keeps the bigram tagger model as small as possible. We canfurther specify that a tagger needs to see more than one instance of a context inorder to retain it, e.g. Bigram(cutoff=2, backoff=t1) will discard contexts whichhave only been seen once or twice.

As before we test the taggers against unseen data. Here we will use a different segment of thecorpus.

>>> accuracy0 = tag.accuracy(t0, brown.tagged(’b’)) # section b: press-editorial>>> accuracy1 = tag.accuracy(t1, brown.tagged(’b’))>>> accuracy2 = tag.accuracy(t2, brown.tagged(’b’))



>>> print ’Default Accuracy = %4.1f%%’ % (100 * accuracy0)Default Accuracy = 12.5%>>> print ’Unigram Accuracy = %4.1f%%’ % (100 * accuracy1)Unigram Accuracy = 81.0%>>> print ’Bigram Accuracy = %4.1f%%’ % (100 * accuracy2)Bigram Accuracy = 81.9%

4.6.4 Limitations of tagger performance

Unfortunately perfect tagging is impossible. Consider the case of a trigram tagger. How many cases ofpart-of-speech ambiguity does it encounter? We can determine the answer to this question empirically:

>>> from nltk_lite.corpora import brown>>> from nltk_lite.probability import ConditionalFreqDist>>> cfdist = ConditionalFreqDist()>>> for sent in brown.tagged(’a’):... p = [(None, None)] # empty token/tag pair... trigrams = zip(p+p+sent, p+sent+p, sent+p+p)... for (pair1,pair2,pair3) in trigrams:... context = (pair1[1], pair2[1], pair3[0]) # last 2 tags, this word... cfdist[context].inc(pair3[1]) # current tag>>> total = ambiguous = 0>>> for cond in cfdist.conditions():... if cfdist[cond].B() > 1:... ambiguous += cfdist[cond].N()... total += cfdist[cond].N()>>> print float(ambiguous) / total0.0509036201939

Thus, one out of twenty trigrams is ambiguous. Given the current word and the previous two tags,there is more than one tag that could be legitimately assigned to the current word according to thetraining data. Assuming we always pick the most likely tag in such ambiguous contexts, we can derivean empirical upper bound on the performance of a trigram tagger.

4.6.5 Storing Taggers

[Discussion of saving a trained tagger to a file, so that it can be re-used without being retrained.]

4.6.6 Smoothing

[Brief discussion of NLTK’s smoothing classes, for another approach to handling unknown words:Lidstone, Laplace, Expected Likelihood, Heldout, Witten-Bell, Good-Turing.]

4.6.7 Exercises

1. Bigram Tagging: Train a bigram tagger with no backoff tagger, and run it on some of thetraining data. Next, run it on some new data. What happens to the performance of thetagger? Why?

2. Combining taggers: Create a default tagger and various unigram and n-gram taggers,incorporating backoff, and train them on part of the Brown corpus.



a) Create three different combinations of the taggers. Test the accuracy of eachcombined tagger. Which combination works best?

b) Try varying the size of the training corpus. How does it affect your results?

3. Sparse Data Problem: How serious is the sparse data problem? Investigate the perfor-mance of n-gram taggers as n increases from 1 to 6. Tabulate the accuracy score. Estimatethe training data required for these taggers, assuming a vocabulary size of 105 and a tagsetsize of 102.

4.7 Conclusion

This chapter has introduced the language processing task known as tagging, with an emphasis on part-of-speech tagging. English word classes and their corresponding tags were introduced. We showedhow tagged tokens and tagged corpora can be represented, then discussed a variety of taggers: defaulttagger, regular expression tagger, unigram tagger and n-gram taggers. We also described some objectiveevaluation methods. In the process, the reader has been introduced to an important paradigm inlanguage processing, namely language modeling. This paradigm former is extremely general, andwe will encounter it again later.

Observe that the tagging process simultaneously collapses distinctions (i.e., lexical identity isusually lost when all personal pronouns are tagged PRP), while introducing distinctions and removingambiguities (e.g. deal tagged as VB or NN). This move facilitates classification and prediction. Whenwe introduce finer distinctions in a tag set, we get better information about linguistic context, but wehave to do more work to classify the current token (there are more tags to choose from). Conversely,with fewer distinctions, we have less work to do for classifying the current token, but less informationabout the context to draw on.

There are several other important approaches to tagging involving Transformation-Based Learning,Markov Modeling, and Finite State Methods. We will discuss these in a later chapter. Later we willsee a generalization of tagging called chunking in which a contiguous sequence of words is assigned asingle tag.

Part-of-speech tagging is just one kind of tagging, one that does not depend on deep linguisticanalysis. There are many other kinds of tagging. Words can be tagged with directives to a speechsynthesizer, indicating which words should be emphasized. Words can be tagged with sense numbers,indicating which sense of the word was used. Words can also be tagged with morphological features.Examples of each of these kinds of tags are shown below. For space reasons, we only show the tagfor a single word. Note also that the first two examples use XML-style tags, where elements in anglebrackets enclose the word that is tagged.

1. Speech Synthesis Markup Language (W3C SSML): That is a <emphasis>big</emphasis>car!

2. SemCor: Brown Corpus tagged with WordNet senses: Space in any <wf pos="NN"lemma="form" wnsn="4">form</wf> is completely measured by the threedimensions. (Wordnet form/nn sense 4: “shape, form, configuration, contour, confor-mation”)

3. Morphological tagging, from the Turin University Italian Treebank: E’ italiano ,come progetto e realizzazione , il primo (PRIMO ADJ ORDIN M SING) portoturistico dell’ Albania .



Tagging exhibits several properties that are characteristic of natural language processing. First,tagging involves classification: words have properties; many words share the same property (e.g. catand dog are both nouns), while some words can have multiple such properties (e.g. wind is a noun anda verb). Second, in tagging, disambiguation occurs via representation: we augment the representationof tokens with part-of-speech tags. Third, training a tagger involves sequence learning from annotatedcorpora. Finally, tagging uses simple, general, methods such as conditional frequency distributions andtransformation-based learning.

We have seen that ambiguity in the training data leads to an upper limit in tagger performance.Sometimes more context will resolve the ambiguity. In other cases however, as noted by Abney(1996), the ambiguity can only resolved with reference to syntax, or to world knowledge. Despitethese imperfections, part-of-speech tagging has played a central role in the rise of statistical approachesto natural language processing. In the early 1990s, the surprising accuracy of statistical taggers wasa striking demonstration that it was possible to solve one small part of the language understandingproblem, namely part-of-speech disambiguation, without reference to deeper sources of linguisticknowledge. Can this idea be pushed further? In the next chapter, on chunk parsing, we shall seethat it can.

4.8 Further Reading

Tagging: Jurafsky and Martin, Chapter 8Brill tagging: Manning and Schutze 361ff; Jurafsky and Martin 307ffHMM tagging: Manning and Schutze 345ffAbney, Steven (1996). Tagging and Partial Parsing. In: Ken Church, Steve Young, and Gerrit

Bloothooft (eds.), Corpus-Based Methods in Language and Speech. Kluwer Academic Publishers,Dordrecht. http://www.vinartus.net/spa/95a.pdf

Wikipedia: http://en.wikipedia.org/wiki/Part-of-speech_taggingList of available taggers: http://www-nlp.stanford.edu/links/statnlp.html

4.9 Further Exercises

1. Impossibility of exact tagging: Write a program to determine the upper bound for accu-racy of an n-gram tagger. Hint: how often is the context seen during training inadequatefor uniquely determining the tag to assign to a word?

2. Impossibility of exact tagging: Consult the Abney reading and review his discussion ofthe impossibility of exact tagging. Explain why correct tagging of these examples requiresaccess to other kinds of information than just words and tags. How might you estimate thescale of this problem?

3. Application to other languages: Obtain some tagged data for another language, and trainand evaluate a variety of taggers on it. If the language is morphologically complex, or ifthere are any orthographic clues (e.g. capitalization) to word classes, consider developinga regular expression tagger for it (ordered after the unigram tagger, and before the defaulttagger). How does the accuracy of your tagger(s) compare with the same taggers run onEnglish data? Discuss any issues you encounter in applying these methods to the language.



4. Comparing n-gram taggers and Brill taggers (advanced): Investigate the relative per-formance of n-gram taggers with backoff and Brill taggers as the size of the training datais increased. Consider the training time, running time, memory usage, and accuracy, for arange of different parameterizations of each technique.

5. HMM taggers: Explore the Hidden Markov Model tagger nltk_lite.tag.hmm.

6. (Advanced) Estimation: Use some of the estimation techniques in nltk_lite.probability,such as Lidstone or Laplace estimation, to develop a statistical tagger that does a better jobthan ngram backoff taggers in cases where contexts encountered during testing were notseen during training. Read up on the TnT tagger, since this provides useful technicalbackground: http://www.aclweb.org/anthology/A00-1031

4.10 Appendix: Brown Tag Set

The following table gives a sample of closed class words, following the classification of the BrownCorpus. (Note that part-of-speech tags may be presented as either upper-case or lower-case strings --the case difference is not significant.)

Some English Closed Class Words, with Brown Tagap determiner/pronoun,

post-determinermany other next more last former little several enough most least only veryfew fewer past same

at article the an no a every th’ ever’ yecc conjunction, coordi-

natingand or but plus & either neither nor yet ’n’ and/or minus an’

cs conjunction, subor-dinating

that as after whether before while like because if since for than until sounless though providing once lest till whereas whereupon supposing albeitthen

in preposition of in for by considering to on among at through with under into regardingthan since despite ...

md modal auxiliary should may might will would must can could shall ought need wiltpn pronoun, nominal none something everything one anyone nothing nobody everybody every-

one anybody anything someone no-one nothin’ppl pronoun, singular,

reflexiveitself himself myself yourself herself oneself ownself

pp$ determiner, posses-sive

our its his their my your her out thy mine thine

pp$$ pronoun, possessive ours mine his hers theirs yourspps pronoun, personal,

nom, 3rd pers sngit he she thee

ppss pronoun, personal,nom, not 3rd perssng

they we I you ye thou you’uns

wdt WH-determiner which what whatever whichever



wps WH-pronoun, nomi-native

that who whoever whosoever what whatsoever

4.10.1 Acknowledgments



5. Chunking

5.1 Introduction

Chunking is an efficient and robust method for identifying short phrases in text, or “chunks”. Chunksare non-overlapping spans of text, usually consisting of a head word (such as a noun) and the adjacentmodifiers and function words (such as adjectives and determiners). For example, here is some WallStreet Journal text with noun phrase chunks marked using brackets (this data is distributed with NLTK):

[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digi-tal/NNP ] [ ’s/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT gi-ant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RBthere/RB ./.

There are two motivations for chunking: to locate information, and to ignore information. In theformer case, we may want to extract all noun phrases so they can be indexed. A text retrieval systemcould use such an index to support efficient retrieval for queries involving terminological expressions.

The reverse side of the coin is to ignore information. Suppose that we want to study syntacticpatterns, finding particular verbs in a corpus and displaying their arguments. For instance, here aresome uses of the verb gave in the Wall Street Journal (in the Penn Treebank corpus sample). Afterdoing NP-chunking, the internal details of each noun phrase have been suppessed, allowing us to seesome higher-level patterns:

gave NPgave up NP in NPgave NP upgave NP NP

gave NP to NP

In this way we can acquire information about the complementation patterns of a verb like gave, foruse in the development of a grammar (see Chapter 7).

Chunking in NLTK begins with tagged text, represented as a flat tree:

>>> from nltk_lite import chunk>>> tagged_text = "the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN">>> input = chunk.tagstr2tree(tagged_text)>>> input.draw()

1

Introduction to Natural Language Processing (DRAFT) 5. Chunking

Next, we write regular expressions over tag sequences. The following example identifies nounphrases that consist of an optional determiner, followed by any number of adjectives, then a noun.

>>> cp = chunk.Regexp("NP: {<DT>?<JJ>*<NN>}")

We create a chunker cp which can then be used repeatedly to parse tagged input. The result of chunkingis also a tree, but with some extra structure:

>>> cp.parse(input).draw()

In this chapter we explore chunking in depth, beginning with the definition and representation ofchunks. We will see regular expression and n-gram approaches to chunking, and will develop andevaluate chunkers using the CoNLL-2000 chunking corpus.

5.2 Defining and Representing Chunks

5.2.1 An Analogy

Two of the most common operations in language processing are segmentation and labeling. Recall thatin tokenization, we segment a sequence of characters into tokens, while in tagging we label each ofthese tokens. Moreover, these two operations of segmentation and labeling go hand in hand. We breakup a stream of characters into linguistically meaningful segments (e.g. words) so that we can classifythose segments with their part-of-speech categories. The result of such classification is represented byadding a label to the segment in question.

In this chapter we do this segmentation and labeling at a higher level, as illustrated in Figure 2. Thesolid boxes show word-level segmentation and labeling, while the dashed boxes show a higher-levelsegmentation and labeling. These larger pieces are called chunks, and the process of identifying themis called chunking.

Figure 1: Segmentation and Labeling at both the Token and Chunk Levels

Like tokenization, chunking can skip over material in the input. Tokenization omits white spaceand punctuation characters. Chunking uses only a subset of the tokens and leaves others out.



5.2.2 Chunking vs Parsing

Chunking is akin to parsing in the sense that it can be used to build hierarchical structure over text.There are several important differences, however. First, as noted above, chunking is not exhaustive, andtypically omits items in the surface string. Second, where parsing constructs deeply nested structures,chunking creates structures of fixed depth, (typically depth 2). These chunks often correspond to thelowest level of grouping identified in the full parse tree, as illustrated in the parsing and chunkingexamples in (1) below:

(1a)

(1b)

A significant motivation for chunking is its robustness and efficiency relative to parsing. Parsinguses recursive phrase structure grammars and arbitrary-depth trees. Parsing has problems with robust-ness, given the difficulty in getting broad coverage and in resolving ambiguity. Parsing is also relativelyinefficient: the time taken to parse a sentence grows with the cube of the length of the sentence, whilethe time taken to chunk a sentence only grows linearly.

5.2.3 Representing Chunks: Tags vs Trees

As befits its intermediate status between tagging and parsing, chunk structures can be represented usingeither tags or trees. The most widespread file representation uses so-called IOB tags. In this scheme,each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A tokenis tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I.All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Ofcourse, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these arejust labeled O. An example of this scheme is shown in Figure 3.

Figure 2: Tag Representation of Chunk Structures



IOB tags have become the standard way to represent chunk structures in files, and we will also beusing this format. Here is an example of the file representation of the information in Figure 3:

He PRP B-NPsaw VBD Othe DT B-NPbig JJ I-NP

dog NN I-NP

In this representation, there is one token per line, each with its part-of-speech tag and its chunk tag.We will see later that this format permits us to represent more than one chunk type, so long as thechunks do not overlap. This file format was developed as part of the chunking evaluation task run bythe Conference on Natural Language Learning in 2000, and has come to be called the IOB Format. Asection of Wall Street Journal text has been annotated in this format.

As we saw earlier, chunk structures can also be represented using trees. These have the benefit thateach chunk is a constituent that can be manipulated directly. An example is shown in Figure 4:

Figure 3: Tree Representation of Chunk Structures

NLTK uses trees for its internal representation of chunks, and provides methods for reading and writingsuch trees to the IOB format. By now you should understand what chunks are, and how they arerepresented. In the next section you will see how to build a simple chunker.

5.3 Chunking

A chunker finds contiguous, non-overlapping spans of related tokens and groups them together intochunks. Chunkers often operate on tagged texts, and use the tags to make chunking decisions. In thissection we will see how to write a special type of regular expression over part-of-speech tags, and thenhow to combine these into a chunk grammar. Then we will set up a chunker to chunk some tagged textaccording to the grammar.

5.3.1 Tag Patterns

A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT><JJ><NN>.Tag patterns are the same as the regular expression patterns we have already seen, except for twodifferences which make them easier to use for chunking. First, angle brackets group their contentsinto atomic units, so “<NN>+” matches one or more repetitions of the tag NN; and “<NN|JJ>” matchesthe NN or JJ. Second, the period wildcard operator is constrained not to cross tag delimiters, so that“<N.*>” matches any single tag starting with N.

Now, consider the following noun phrases from the Wall Street Journal:



another/DT sharp/JJ dive/NNtrade/NN figures/NNSany/DT new/JJ policy/NN measures/NNSearlier/JJR stages/NNS

Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP

We can match these using a slight refinement of the first tag pattern above: <DT>?<JJ.*>*<NN.*>+.This can be used to chunk any sequence of tokens beginning with an optional determiner DT, followedby zero or more adjectives of any type JJ.* (including relative adjectives like earlier/JJR), fol-lowed by one or more nouns of any type NN.*. It is easy to find many more difficult examples:

his/PRP$ Mansion/NNP House/NNP speech/NNthe/DT price/NN cutting/VBG3/CD %/NN to/TO 4/CD %/NNmore/JJR than/IN 10/CD %/NNthe/DT fastest/JJS developing/VBG trends/NNS

’s/POS skill/NN

Your challenge will be to come up with tag patterns to cover these and other examples.

5.3.2 Chunking with Regular Expressions

The chunker begins with a flat structure in which no tokens are chunked. Patterns are applied in turn,successively updating the chunk structure. Once all of the patterns have been applied, the resultingchunk structure is returned. Here is a simple chunk grammar consisting of two patterns. The firstpattern matches an optional determiner, zero or more adjectives, then a noun. We also define someinput to be chunked.

>>> grammar = r"""... NP:... {<DT>?<JJ>*<NN>} # chunk determiners, adjectives and nouns... {<NNP>+} # chunk sequences of proper nouns... """>>> tagged_text = "the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN">>> input = chunk.tagstr2tree(tagged_text)

Next we can set up a chunker and run it on the input:

>>> cp = chunk.Regexp(grammar)>>> print cp.parse(input)(S:

(NP: (’the’, ’DT’) (’little’, ’JJ’) (’cat’, ’NN’))(’sat’, ’VBD’)(’on’, ’IN’)(NP: (’the’, ’DT’) (’mat’, ’NN’)))

If a tag pattern matches at multiple overlapping locations, the first match takes precedence. Forexample, if we apply a rule that matches two consecutive nouns to a text containing three consecutivenouns, then the first two nouns will be chunked:

>>> nouns = chunk.tagstr2tree("money/NN market/NN fund/NN")>>> grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns">>> cp = chunk.Regexp(grammar)>>> print cp.parse(nouns)(S: (NP: (’money’, ’NN’) (’market’, ’NN’)) (’fund’, ’NN’))



5.3.3 Developing Chunkers

Creating a good chunker usually requires several rounds of development and testing, during whichexisting rules are refined and new rules are added. In order to diagnose any problems, it often helpsto trace the execution of a chunker, using its trace argument. The tracing output shows the rulesthat are applied, and uses braces to show the chunks that are created at each stage of processing. Inthe following example, two chunk patterns are applied to the input sentence. The first rule finds allsequences of three tokens whose tags are DT, JJ, and NN, and the second rule finds any sequence oftokens whose tags are either DT or NN.

>>> grammar = r"""... NP:... {<DT><JJ><NN>} # Chunk det+adj+noun... {<DT|NN>+} # Chunk sequences of NN and DT... """>>> cp = chunk.Regexp(grammar)>>> print cp.parse(input, trace=1)# Input:<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>

# Chunk det+adj+noun:{<DT> <JJ> <NN>} <VBD> <IN> <DT> <NN># Chunk sequences of NN and DT:{<DT> <JJ> <NN>} <VBD> <IN> {<DT> <NN>}(S:


Observe that when we chunk material that is already partially chunked, the chunker will only createchunks that do not partially overlap existing chunks. Thus, if we apply these two rules in reverse order,we will get a different result:

>>> grammar = r"""... NP:... {<DT|NN>+} # Chunk sequences of NN and DT... {<DT><JJ><NN>} # Chunk det+adj+noun... """>>> cp = chunk.Regexp(grammar)>>> print cp.parse(input, trace=1)# Input:<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>

# Chunk sequences of NN and DT:{<DT>} <JJ> {<NN>} <VBD> <IN> {<DT> <NN>}# Chunk det+adj+noun:{<DT>} <JJ> {<NN>} <VBD> <IN> {<DT> <NN>}(S:

(NP: (’the’, ’DT’))(’little’, ’JJ’)(NP: (’cat’, ’NN’))(’sat’, ’VBD’)(’on’, ’IN’)(NP: (’the’, ’DT’) (’mat’, ’NN’)))



Here, rule 2 did not find any chunks, since all chunks that matched its tag pattern overlapped withexisting chunks.

5.3.4 Exercises

1. Chunking Demonstration: Run the chunking demonstration:

from nltk_lite.parse import chunk

chunk.demo() # the chunker

2. IOB Tags: The IOB format categorizes tagged tokens as I, O and B. Why are three tagsnecessary? What problem would be caused if we used I and O tags exclusively?

3. Write a tag pattern to match noun phrases containing plural head nouns, e.g. “many/JJresearchers/NNS”, “two/CD weeks/NNS”, “both/DT new/JJ positions/NNS”. Try to dothis by generalizing the tag pattern that handled singular noun phrases.

4. Write tag pattern to cover noun phrases that contain gerunds, e.g. “the/DT receiving/VBGend/NN”, “assistant/NN managing/VBG editor/NN”. Add these patterns to the grammar,one per line. Test your work using some tagged sentences of your own devising.

5. Write one or more tag patterns to handle coordinated noun phrases, e.g. “July/NNPand/CC August/NNP”, “all/DT your/PRP$ managers/NNS and/CC supervisors/NNS”,“company/NN courts/NNS and/CC adjudicators/NNS”.

6. Sometimes a word is incorrectly tagged, e.g. the head noun in “12/CD or/CC so/RBcases/VBZ”. Instead of requiring manual correction of tagger output, good chunkers areable to work with the erroneous output of taggers. Look for other examples of correctlychunked noun phrases with incorrect tags.

5.4 Scaling Up

Now that you have a taste of what chunking can do, you are ready to look at a chunked corpus, anduse it in developing and testing more complex chunkers. We will begin by looking at the mechanics ofconverting IOB format into an NLTK tree, then at how this is done on a larger scale using the corpusdirectly. We will see how to use the corpus to score the accuracy of a chunker, then look some moreflexible ways to manipulate chunks. Throughout our focus will be on scaling up the coverage of achunker.

5.4.1 Reading IOB Format and the CoNLL 2000 Corpus

Using the nltk_lite.corpora module we can load Wall Street Journal text that has been tagged,then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP and PP.As we have seen, each sentence is represented using multiple lines, as shown below:

he PRP B-NPaccepted VBD B-VPthe DT B-NPposition NN I-NP

...



A conversion function chunk.conllstr2tree() builds a tree representation from one of thesemulti-line strings. Moreover, it permits us to choose any subset of the three chunk types to use. Theexample below produces only NP chunks:

>>> text = ’’’... he PRP B-NP... accepted VBD B-VP... the DT B-NP... position NN I-NP... of IN B-PP... vice NN B-NP... chairman NN I-NP... of IN B-PP... Carlyle NNP B-NP... Group NNP I-NP... , , O... a DT B-NP... merchant NN I-NP... banking NN I-NP... concern NN I-NP... . . O... ’’’>>> chunk.conllstr2tree(text, chunk_types=(’NP’,)).draw()

We can use the NLTK corpus module to access a larger amount of chunked text. The CoNLL 2000corpus contains 270k words of Wall Street Journal text, with part-of-speech tags and chunk tags in theIOB format. We can access this data using an NLTK corpus reader called conll2000. Here is anexample:

>>> from nltk_lite.corpora import conll2000, extract>>> print extract(2000, conll2000.chunked())(S:

(NP: (’Health-care’, ’JJ’) (’companies’, ’NNS’))(VP: (’should’, ’MD’) (’get’, ’VB’))(’healthier’, ’JJR’)(PP: (’in’, ’IN’))(NP: (’the’, ’DT’) (’third’, ’JJ’) (’quarter’, ’NN’))(’.’, ’.’))

This just showed three chunk types, for NP, VP and PP. We can also select which chunk types to read:

>>> from nltk_lite.corpora import conll2000, extract>>> print extract(2000, conll2000.chunked(chunk_types=(’NP’,)))(S:

(NP: (’Health-care’, ’JJ’) (’companies’, ’NNS’))(’should’, ’MD’)



(’get’, ’VB’)(’healthier’, ’JJR’)(’in’, ’IN’)(NP: (’the’, ’DT’) (’third’, ’JJ’) (’quarter’, ’NN’))(’.’, ’.’))

5.4.2 Simple Evaluation and Baselines

Armed with a corpus, it is now possible to do some simple evaluation. The first evaluation is to establisha baseline for the case where nothing is chunked:

>>> cp = chunk.Regexp("")>>> print chunk.accuracy(cp, conll2000.chunked(chunk_types=(’NP’,)))0.440845995079

Now let’s try a naive regular expression chunker that looks for tags beginning with letters that aretypical of noun phrase tags:

>>> grammar = r"""NP: {<[CDJNP].*>+}""">>> cp = chunk.Regexp(grammar)>>> print chunk.accuracy(cp, conll2000.chunked(chunk_types=(’NP’,)))0.874479872666

We can extend this approach, and create a function chunked_tags() that takes some chunkeddata, and sets up a conditional frequency distribution. For each tag, it counts up the number of timesthe tag occurs inside a chunk (the True case), or outside a chunk (the False case). It returns a list ofthose tags that occur inside chunks more often than outside chunks.

>>> def chunked_tags(train):... """Generate a list of tags that tend to appear inside chunks"""... from nltk_lite.probability import ConditionalFreqDist... cfdist = ConditionalFreqDist()... for t in train:... for word, tag, chtag in chunk.tree2conlltags(t):... if chtag == "O":... cfdist[tag].inc(False)... else:... cfdist[tag].inc(True)... return [tag for tag in cfdist.conditions() if cfdist[tag].max() == True]

The next step is to convert this list of tags into a tag pattern. To do this we need to “escape” allnon-word characters, by preceding them with a backslash. Then we need to join them into a disjunction.This process would convert a tag list [’NN’, ’NN$’] into the tag pattern <NN|NN\$>. The followingfunction does this work, and returns a regular expression chunker:

>>> def baseline_chunker(train):... import re... chunk_tags = [re.sub(r’(\W)’, r’\\\1’, tag) for tag in chunked_tags(train)]... grammar = ’NP: {<’ + ’|’.join(chunk_tags) + ’>+}’... return chunk.Regexp(grammar)

The final step is to train this chunker and test its accuracy (this time on data not seen during training):

>>> cp = baseline_chunker(conll2000.chunked(files=’train’, chunk_types=(’NP’,)))>>> print chunk.accuracy(cp, conll2000.chunked(files=’test’, chunk_types=(’NP’,)))0.914262194736



5.4.3 Splitting and Merging (incomplete)

[Notes: the above approach creates chunks that are too large, e.g. the cat the dog chased would be givena single NP chunk because it does not detect that determiners introduce new chunks. For this we wouldneed a rule to split an NP chunk prior to any determiner, using a pattern like: "NP: <.*>}{<DT>".We can also merge chunks, e.g. "NP: <NN>{}<NN>".]

5.4.4 Chinking

Sometimes it is easier to define what we don’t want to include in a chunk than it is to define what wedo want to include. In these cases, it may be easier to build a chunker using a method called chinking.

The word chink initially meant a sequence of stopwords, according to a 1975 paper by Ross andTukey (cited by Abney in the recommended reading for this chapter). Following Abney, we definea chink is a sequence of tokens that is not included in a chunk. In the following example, sat/VBDon/IN is a chink:

[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]

Chinking is the process of removing a sequence of tokens from a chunk. If the sequence of tokensspans an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in themiddle of the chunk, these tokens are removed, leaving two chunks where there was only one before.If the sequence is at the beginning or end of the chunk, these tokens are removed, and a smaller chunkremains. These three possibilities are illustrated in the following table:

ChinkingEntire chunk Middle of a chunk End of a chunk

Input [a/DT big/JJ cat/NN] [a/DT big/JJ cat/NN] [a/DT big/JJ cat/NN]Operation Chink “DT JJ NN” Chink “JJ” Chink “DT”Pattern “}DT JJ NN{” “}JJ{” “}DT{”Output a/DT big/JJ cat/NN [a/DT] big/JJ [cat/NN] [a/DT big/JJ] cat/NN

In the following grammar, we put the entire sentence into a single chunk, then excise the chink:

>>> grammar = r"""... NP:... {<.*>+} # Chunk everything... }<VBD|IN>+{ # Chink sequences of VBD and IN... """>>> cp = chunk.Regexp(grammar)>>> print cp.parse(input)(S:


>>> print chunk.accuracy(cp, conll2000.chunked(files=’test’, chunk_types=(’NP’,)))0.581041433607

A chunk grammar can use any number of chunking and chinking patterns in any order.



5.4.5 Multiple Chunk Types (incomplete)

So far we have only developed NP chunkers. However, as we saw earlier in the chapter, the CoNLLchunking data is also annotated for PP and VP chunks. Here is an example, to show the structure weget from the corpus and the flattened version that will be used as input to the parser.

>>> example = extract(2000, conll2000.chunked())>>> print example(S:

(NP: (’Health-care’, ’JJ’) (’companies’, ’NNS’))(VP: (’should’, ’MD’) (’get’, ’VB’))(’healthier’, ’JJR’)(PP: (’in’, ’IN’))(NP: (’the’, ’DT’) (’third’, ’JJ’) (’quarter’, ’NN’))(’.’, ’.’))

>>> print example.flatten()(S:

(’Health-care’, ’JJ’)(’companies’, ’NNS’)(’should’, ’MD’)(’get’, ’VB’)(’healthier’, ’JJR’)(’in’, ’IN’)(’the’, ’DT’)(’third’, ’JJ’)(’quarter’, ’NN’)(’.’, ’.’))

Now we can set up a multi-stage chunk grammar. It will have one stage for each of the chunk types.

>>> grammar = r"""... NP: {<DT>?<JJ>*<NN.*>+} # noun phrase chunks... VP: {<TO>?<VB.*>} # verb phrase chunks... PP: {<IN>} # prepositional phrase chunks... """>>> cp = chunk.Regexp(grammar)>>> print cp.parse(example.flatten(), trace=1)# Input:<JJ> <NNS> <MD> <VB> <JJR> <IN> <DT> <JJ> <NN> <.>

# noun phrase chunks:{<JJ> <NNS>} <MD> <VB> <JJR> <IN> {<DT> <JJ> <NN>} <.># Input:<NP> <MD> <VB> <JJR> <IN> <NP> <.>

# verb phrase chunks:<NP> <MD> {<VB>} <JJR> <IN> <NP> <.>

# Input:<NP> <MD> <VP> <JJR> <IN> <NP> <.>

# prepositional phrase chunks:<NP> <MD> <VP> <JJR> {<IN>} <NP> <.>

(S:(NP: (’Health-care’, ’JJ’) (’companies’, ’NNS’))(’should’, ’MD’)(VP: (’get’, ’VB’))(’healthier’, ’JJR’)



(PP: (’in’, ’IN’))(NP: (’the’, ’DT’) (’third’, ’JJ’) (’quarter’, ’NN’))(’.’, ’.’))

5.4.6 Exercises

1. Simple Chunker: Pick one of the three chunk types in the CoNLL corpus. Inspectthe CoNLL corpus and try to observe any patterns in the POS tag sequences that makeup this kind of chunk. Develop a simple chunker using the regular expression chunkerchunk.Regexp. Discuss any tag sequences that are difficult to chunk reliably.

2. Automatic Analysis: Pick one of the three chunk types in the CoNLL corpus. Writefunctions to do the following tasks for your chosen type:

a) List all the tag sequences that occur with each instance of this chunk type.b) Count the frequency of each tag sequence, and produce a ranked list in order

of decreasing frequency; each line should consist of an integer (the frequency)and the tag sequence.

c) Inspect the high-frequency tag sequences. Use these as the basis for developinga better chunker.

3. Chinking: An early definition of chunk was the material that occurs between chinks.Develop a chunker which starts by putting the whole sentence in a single chunk, and thendoes the rest of its work solely by chinking. Determine which tags (or tag sequences) aremost likely to make up chinks with the help of your own utility program. Compare theperformance and simplicity of this approach relative to a chunker based entirely on chunkrules.

4. Complex Chunker: Develop a chunker for one of the chunk types in the CoNLL corpususing a regular-expression based chunk grammar RegexpChunk. Use any combination ofrules for chunking, chinking, merging or splitting.

5. Inherent ambiguity: We saw in the tagging chapter that it is possible to establish an upperlimit to tagging performance by looking for ambiguous n-grams, n-grams that are taggedin more than one possible way in the training data. Apply the same method to determinean upper limit on the performance of an n-gram chunker.

6. Baseline NP Chunker: The baseline chunker presented in the evaluation section tendsto create larger chunks than it should. For example, the phrase: [every/DT time/NN][she/PRP] sees/VBZ [a/DT newspaper/NN] contains two consecutive chunks, andour baseline chunker will incorrectly combine the first two: [every/DT time/NN she/PRP].Write a program that finds which of these chunk-internal tags typically occur at the start ofa chunk, then devise a SplitRule that will split up these chunks. Combine this rule withthe existing baseline chunker and re-evaluate it, to see if you have discovered an improvedbaseline.

7. Predicate structure: Develop an NP chunker which converts POS-tagged text into a listof tuples, where each tuple consists of a verb followed by a sequence of noun phrasesand prepositions, e.g. the little cat sat on the mat becomes (’sat’, ’on’,’NP’)...



8. Penn Treebank NP Chunking Format: The Penn Treebank contains a section of taggedWall Street Journal text which has been chunked into noun phrases. The format usessquare brackets, and we have encountered it several times during this chapter. It can be ac-cessed by importing the Treebank corpus reader (from nltk_lite.corpora importtreebank), then iterating over its chunked items (for sent in treebank.chunked():).These items are flat trees, just as we got using conll2000.chunked().

a) Consult the documentation for the NLTK chunk package to find out how to gen-erate Treebank and IOB strings from a tree. Write functions chunk2brackets()and chunk2iob() which take a single chunk tree as their sole argument, andreturn the required multi-line string representation.

b) Write command-line conversion utilities bracket2iob.py and iob2bracket.pythat take a file in Treebank or CoNLL format (resp) and convert it to the otherformat. (Obtain some raw Treebank or CoNLL data from the NLTK Corpora,save it to a file, and then use open(filename).readlines() to access itfrom Python.)

5.5 N-Gram Chunking

Our approach to chunking has been to try to detect structure based on the part-of-speech tags. We haveseen that the IOB format represents this extra structure using another kind of tag. The question arisesthen, as to whether we could use the same n-gram tagging methods we saw in the last chapter, appliedto a different vocabulary.

The first step is to get the word,tag,chunk triples from the CoNLL corpus and map these totag,chunk pairs:

>>> from nltk_lite import tag>>> chunk_data = [[(t,c) for w,t,c in chunk.tree2conlltags(chtree)]... for chtree in conll2000.chunked()]

5.5.1 A Unigram Chunker

Now we can train and score a unigram chunker on this data, just as if it was a tagger:

>>> unigram_chunker = tag.Unigram()>>> unigram_chunker.train(chunk_data)>>> print tag.accuracy(unigram_chunker, chunk_data)0.781378851068

This chunker does reasonably well. Let’s look at the errors it makes. Consider the opening phraseof the first sentence of the chunking data, here shown with part of speech tags:

Confidence/NN in/IN the/DT pound/NN is/VBZ widely/RB expected/VBN to/TO take/VBanother/DT sharp/JJ dive/NN

We can try the unigram chunker out on this first sentence by creating some “tokens” using [t fort,c in chunk_data[0]], then running our chunker over them using list(unigram_chunker.tag(tokens)).The unigram chunker only looks at the tags, and tries to add chunk tags. Here is what it comes up with:



NN/I-NP IN/B-PP DT/B-NP NN/I-NP VBZ/B-VP RB/O VBN/I-VP TO/B-PP VB/I-VPDT/B-NP JJ/I-NP NN/I-NP

Notice that it tags the first noun Confidence/NN incorrectly as I-NP and not B-NP, becausenouns usually do not occur at the start of noun phrases in the training data. It correctly tags the secondpound/NN as I-NP (this noun occurs after a determiner). It incorrectly tags widely/RB as outside O,and it incorrectly tags the infinitival to/TO as B-PP, as if it was a preposition starting a prepositionalphrase.

5.5.2 A Bigram Chunker (incomplete)

[Why these problems might go away if we look at the previous chunk tag?]Let’s run a bigram chunker:

>>> bigram_chunker = tag.Bigram(backoff=unigram_chunker)>>> bigram_chunker.train(chunk_data)>>> print tag.accuracy(bigram_chunker, chunk_data)0.89312652614

We can run the bigram chunker over the same sentence as before using list(bigram_chunker.tag(tokens)).Here is what it comes up with:

NN/B-NP IN/B-PP DT/B-NP NN/I-NP VBZ/B-VP RB/I-VP VBN/I-VP TO/I-VP VB/I-VP DT/B-NP JJ/I-NP NN/I-NP

This is 100% correct.

5.5.3 Exercises

1. Bigram chunker: The bigram chunker scores about 90% accuracy. Study its errors andtry to work out why it doesn’t get 100% accuracy.

2. Trigram chunker: Experiment with trigram chunking. Are you able to improve theperformance any more?

3. (Advanced) N-Gram Chunking Context: An n-gram chunker can use information otherthan the current part-of-speech tag and the n-1 previous chunk tags. Investigate othermodels of the context, such as the n-1 previous part-of-speech tags, or some combinationof previous chunk tags along with previous and following part-of-speech tags.

4. (Advanced) Modularity: Consider the way an n-gram tagger uses recent tags to informits tagging choice. Now observe how a chunker may re-use this sequence information. Forexample, both tasks will make use of the information that nouns tend to follow adjectives(in English). It would appear that the same information is being maintained in two places.Is this likely to become a problem as the size of the rule sets grows? If so, speculate aboutany ways that this problem might be addressed.



5.6 Cascaded Chunkers

So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally groupedunder a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth,simply by creating a multi-stage chunk grammar.

So far, our chunk grammars have consisted of a single stage: a chunk type followed by one or morepatterns. However, chunk grammars can have two or more such stages. These stages are processedin the order that they appear. The patterns in later stages can refer to a mixture of part-of-speech tagsand chunk types. Here is an example, which has patterns for noun phrases, prepositional phrases, verbphrases, and sentences.

>>> grammar = """... NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN... PP: {<IN><NP>} # Chunk prepositions followed by NP... VP: {<VB.*><NP|PP|S>+$} # Chunk rightmost verbs and arguments/adjuncts... S: {<NP><VP>} # Chunk NP, VP... """

This is a four-stage chunk grammar, and can be used to create structures having a depth of at mostfour. The next step is to create the corresponding chunker in the usual way.

>>> cp = chunk.Regexp(grammar)>>> input = chunk.tagstr2tree("""Mary/NN saw/VBD the/DT cat/NN... sit/VB on/IN the/DT mat/NN""")>>> print cp.parse(input)(S:

(NP: (’Mary’, ’NN’))(’saw’, ’VBD’)(S:

(NP: (’the’, ’DT’) (’cat’, ’NN’))(VP:

(’sit’, ’VB’)(PP: (’on’, ’IN’) (NP: (’the’, ’DT’) (’mat’, ’NN’))))))

Unfortunately this result misses the VP headed by saw. It has other shortcomings too. Let’s seewhat happens when we apply this chunker to a sentence having deeper nesting.

>>> input = chunk.tagstr2tree("""John/NNP thinks/VBZ Mary/NN saw/VBD... the/DT cat/NN sit/VB on/IN the/DT mat/NN""")>>> print cp.parse(input)(S:

(NP: (’John’, ’NNP’))(’thinks’, ’VBZ’)(NP: (’Mary’, ’NN’))(’saw’, ’VBD’)(S:

(NP: (’the’, ’DT’) (’cat’, ’NN’))(VP:

(’sit’, ’VB’)(PP: (’on’, ’IN’) (NP: (’the’, ’DT’) (’mat’, ’NN’))))))

The solution to these problems is to get the chunker to loop over its patterns: after trying all ofthem, it repeats the process. We add an optional second argument loop to specify the number of timesthe set of patterns should be run:



>>> cp = chunk.Regexp(grammar, loop=2)>>> print cp.parse(input)(S:

(NP: (’John’, ’NNP’))(’thinks’, ’VBZ’)(S:

(NP: (’Mary’, ’NN’))(VP:

(’saw’, ’VBD’)(S:

(NP: (’the’, ’DT’) (’cat’, ’NN’))(VP:

(’sit’, ’VB’)(PP: (’on’, ’IN’) (NP: (’the’, ’DT’) (’mat’, ’NN’))))))))

This cascading process enables us to create deep structures. However, creating and debugging acascade is quite difficult, and there comes a point where it is more effective to do full parsing (seeChapter 7).

5.7 Conclusion

In this chapter we have explored efficient and robust methods that can identify linguistic structures intext. Using only part-of-speech information for words in the local context, a “chunker” can successfullyidentify simple structures such as noun phrases and verb groups. We have seen how chunking methodsextend the same lightweight methods that were successful in tagging. The resulting structured infor-mation is useful in information extraction tasks and in the description of the syntactic environments ofwords. The latter will be invaluable as we move to full parsing.

There are a surprising number of ways to chunk a sentence using regular expressions. The patternscan add, shift and remove chunks in many ways, and the patterns can be sequentially ordered in manyways. One can use a small number of very complex rules, or a long sequence of much simpler rules.One can hand-craft a collection of rules, and one can write programs to analyze a chunked corpus tohelp in the development of such rules. The process is painstaking, but generates very compact chunkersthat perform well and that transparently encode linguistic knowledge.

It is also possible to chunk a sentence using the techniques of n-gram tagging. Instead of assigningpart-of-speech tags to words, we assign IOB tags to the part-of-speech tags. Bigram tagging turned outto be particularly effective, as it could be sensitive to the chunk tag on the previous word. This statisticalapproach requires far less effort than rule-based chunking, but creates large models and delivers fewlinguistic insights.

Like tagging, chunking cannot be done perfectly. For example, as pointed out by Abney (1996),we cannot correctly analyze the structure of the sentence I turned off the spectroroute without knowingthe meaning of spectroroute; is it a kind of road or a type of device? Without knowing this, we cannottell whether off is part of a prepositional phrase indicating direction (tagged B-PP), or whether off ispart of the verb-particle construction turn off (tagged I-VP).

A recurring theme of this chapter has been diagnosis. The simplest kind is manual, when weinspect the tracing output of a chunker and observe some undesirable behavior that we would liketo fix. Sometimes we discover cases where we cannot hope to get the correct answer because thepart-of-speech tags are too impoverished and do not give us sufficient information about the lexicalitem. A second approach is to write utility programs to analyze the training data, such as counting the



number of times a given part-of-speech tag occurs inside and outside an NP chunk. A third approach isto evaluate the system against some gold standard data to obtain an overall performance score. We caneven use this to parameterize the system, specifying which chunk rules are used on a given run, andtabulating performance for different parameter combinations. Careful use of these diagnostic methodspermits us to optimize the performance of our system. We will see this theme emerge again later inchapters dealing with other topics in natural language processing.

5.8 Further Reading

Abney, Steven (1996). Tagging and Partial Parsing. In: Ken Church, Steve Young, and GerritBloothooft (eds.), Corpus-Based Methods in Language and Speech. Kluwer Academic Publishers,Dordrecht. http://www.vinartus.net/spa/95a.pdf

Abney’s Cass system: http://www.vinartus.net/spa/97a.pdf



Documents

Introduction to Natural Language Processing ...ce.aut.ac.ir/islab/courses/NLP/archive/1388/s1/nltk-book.pdf · This textbook provides a comprehensive introduction to the eld of natural