1 COMP 791A: Statistical Language Processing Introduction Chap. 1

1

COMP 791A: Statistical Language Processing

Introduction Chap. 1

2

Course information

Prof: Leila Kosseim Office: LB 903-7 Email: [email protected] Office hours: TBA

3

Goal of NLP

Develop techniques and tools to build practical and robust systems that can communicate with users in one or more natural language

Natural Lang. Artificial Lang.

Lexical >100 000 words ~100 words

Syntax Complex Simple

Semantic 1 word --> several meanings

1 word --> 1 meaning

4

References Foundations of Statistical Natural

Language Processing, by Chris Manning and Hinrich Schutze, MIT Press, 1999.

Speech and Language Processing, Daniel Jurafsky & James H. Martin. Prentice Hall, 2000.

Current literature available on the Web. See course Web page:

www.cs.concordia.ca/~kosseim/Teaching/COMP791-W04/

5

Other References

Proceedings of major conferences ACL: Association for Computational Linguistics EACL: European chapter of ACL ANLP: Applied NLP COLING: Computational Linguistics TREC: Text Retrieval Conference

6

Who studies languages? Linguist

What constraints the possible meanings of a sentence? Uses mathematical models (ex. formal grammars)

Psycholinguist How do people produce a discourse from an idea? Uses: experimental observations with human subjects

Philosopher What is meaning anyways? How do words identify objects in the world? Uses: argumentations, examples and counter-examples

Computational Linguist (NLP) How can we identify the structure of sentences

automatically? Uses: data structures, algorithms, AI techniques (search,

knowledge-representation, machine learning, …)

7

Why study NLP? necessary to many useful applications:

information retrieval, information extraction, filtering, spelling and grammar checking, automatic text summarization, understanding and generation of natural

language, machine translation…

8

Who needs NLP? Too many texts to manipulate

On Internet E-mails Various corporate documentation

Too many languages 39000 languages and dialects

9

Source: Global Reach (www.glreach.com)

Languages on the Internet

10

Source: Global Reach (www.glreach.com)

11

Applications of NLP

Text-based: processing of written texts (ex. Newspaper articles, e-mails, Web pages…) Text understanding/analysis (NLU)

IR, IE, MT, … Text generation (NLG)

Dialog-based systems (human-machine communication) Ex: QA, tutoring systems, …

12

Brief history of NLP 1940s - 1950s Foundational Insights

Automata, finite-state machines & formal languages (Turing, Chomsky, Backus&Naur)

Probability and information theory (Shannon) Noisy channel and decoding (Shannon)

1960s - 1970s Two Camps Symbolic: Linguists & Computer Scientists

Transformational grammars (Chomsky, Harris) Artificial Intelligence (Minsky, McCarthy) Theorem Proving, heuristics, general problem solver

(Newell&Simon) Stochastic: Statisticians & Electrical Engineers

Bayesian reasoning for character recognition Authorship attribution Corpus Work

13

Brief history of NLP (con’t) 1970s - 1980s 4 Paradigms

Stochastic approaches Logic-based / Rule-based approaches Scripts and plans for NL understanding of “toy worlds” Discourse modeling (discourse structures &

coreference resolution)

Late 1980s - 1990s Rise of probabilistic models Data-driven probabilistic approaches (more robust) Engineering practical solutions using automatic

learning Strict evaluation of work

14

Why study NLP Statistically?

Up to about 10 years, NLP was mainly investigated using a rule-based approach.

But: Rules are often too strict to characterize people’s

use of language (people tend to stretch and bend rules in order to meet their communicative needs.)

Need (expert) people to develop rules (knowledge acquisition bottleneck)

Statistical methods are more flexible & more robust

15

Tools and Resources Needed

Probability/Statistical Theory: Statistical Distributions, Bayesian Decision Theory.

Linguistics Knowledge: Morphology, Syntax, Semantics, Pragmatics…

Corpora: Bodies of marked or unmarked text to which statistical methods and current linguistic

knowledge can be applied in order to discover novel linguistic theories or

interesting and useful knowledge to build applications.

16

The Alphabet Soup

NLP Natural Language Processing CL Computational Linguistics NLE Natural Language Engineering HLT Human Language Technology

IE Information Extraction IR Information Retrieval MT Machine Translation QA Question-Answering POS Part-of-speech

NLG Natural Language Generation NLU Natural Language Understanding

17

Why is NLP difficult? Because Natural Language is highly ambiguous.

Syntactic ambiguity I made her duck. has 2 parses (i.e., syntactic analysis)

The president spoke to the nation about the problem of drug use in the schools from one coast to the other.

has 720 parses. Ex:

“to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb 6 places

“from one coast” has 5 places to attach …

(S (NP I) (VP (V made) (NP (PRO her) (N duck)))

(S (NP I) (VP (V made) (NP (PRO her) (VP (V duck))))

18

Why is NLP difficult? (con’t) Word category ambiguity

book --> verb? or noun? Word sense ambiguity

bank --> financial institution? building? or river side? Words can mean more than their sum of parts

make up a story Fictitious worlds

People on mars can fly. Defining scope

People like ice-cream. Does this mean that all (or some?) people like ice cream?

Language is changing and evolving I’ll email you my answer. This new S.U.V. as a compartment for your mobile phone.

19

Methods that do not work well Hand-coded rules

produce a knowledge acquisition bottleneck perform poorly on naturally occurring text

Ex: Hand-coded syntactic constraints and preference rules

Ex: selectional restrictions

animate being --> swallow--> physical object

I swallowed his story / line. The supernova swallowed the planet.

20

What Statistical NLP can do seeks to solve the acquisition bottelneck:

by automatically learning preferences from corpora (ex, lexical or syntactic preferences).

offers a solution to the problem of ambiguity and "real" data because statistical models are robust generalize well behave gracefully in the presence of errors and

new data.

21

Some standard corpora Brown corpus

~1 million words Tagged corpus (POS) Balanced (representative sample of American English in

the 1960-1970) (different genres) Lancaster-Oslo-Bergen (LOB) corpus

British replication of the Brown corpus Susanne corpus

Free subset of Brown corpus (130 000 words) Syntactic structure

Penn Treebank Syntactic structure Articles from Wall Street Journal

Canadian Hansard Bilingual corpus of parallel texts

22

What to do with text corpora? Count words Count words to find:

What are the most common words in the text?

How many words are in the text? word tokens vs word types

What is the average frequency of each word in the text?

23

What’s a word anyways? I have a can opener; but I can’t open these cans. how many words?

Word form inflected form as it appears in the text can and cans ... different word forms

Lemma a set of lexical forms having the same stem, same POS and same

meaning can and cans … same lemma

Word token: an occurrence of a word I have a can opener; but I can’t open these cans. 11 word tokens

(not counting punctuation)

Word type: a different realization of a word I have a can opener; but I can’t open these cans. 10 word types (not

counting punctuation)

24

An example Mark Twain’s Tom Sawyer

71,370 word tokens 8,018 word types tokens/type ratio = 8.9 (indication of text

complexity)

Complete Shakespeare work 884,647 word tokens 29,066 word types tokens/type ratio = 30.4

25

Common words in Tom Sawyer

but words in NL have an uneven distribution…

26

Frequency of frequencies most words are rare

3993 (50%) word types appear only once

they are called happax legomena (read only once)

but common words are very common

100 words account for 51% of all tokens (of all text)

27

Word counts are interesting...

As an indication of a text’s style As an indication of a text’s author

But, because most words appear very infrequently, it is hard to predict much about the

behavior of words (if they do not occur often in a corpus)

--> Zipf’s Law

28

Zipf’s Law1. Count the frequency of each word type in a

large corpus2. List the word types in order of their frequency Let:

f = frequency of a word type r = its rank in the list

Zipf’s Law says: f 1/r In other words:

there exists a constant k such that: f × r = k The 50th most common word should occur with 3

times the frequency of the 150th most common word.

29

Zipf’s Law on Tom Saywer

k ≈ 8000-9000 except for

The 3 most frequent wordsWords of frequency ≈ 100

30

Plot of Zipf’s LawOn chap. 1-3 of Tom Sawyer (≠ numbers from p. 25&26)

f×r = kZipf

0

50

100

150

200

250

300

350

0 500 1000 1500 2000

Rank

Fre

q

31

Plot of Zipf’s Law (con’t)On chap. 1-3 of Tom Sawyer f×r = k ==> log(f×r) = log(k) ==> log(f)+log(r) = log(k)

Zipf's Law

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8

log(rank)

log

(fre

q)

32

Zipf’s Law, so what? There are:

A few very common words A medium number of medium frequency words A large number of infrequent words

Principle of Least effort: Tradeoff between speaker and hearer’s effort Speaker communicates with a small vocabulary of common

words (less effort) Hearer disambiguates messages through a large vocabulary

of rare words (less effort)

Significance of Zipf’s Law for us: For most words, our data about their use will be very

sparse Only for a few words will we have a lot of examples

33

Another Zipf law on language Nb of meanings of a word is correlated to its frequency

the more frequent a word, the more senses it can have

Ex: Words at rank 2,000 have 4.6 meanings Words at rank 5,000 have 3 meanings Words at rank 10,000 have 2.1 meanings

Ex: Verb senses in WordNet:

serve has 13 senses but most verbs have only 1 sense

r

1m or fm f = frequency of word

m = num of sensesr = rank of word

34

Yet another Zipf law on language Content words tend to "clump" together

if we take a text and count the distance between identical words (tokens)

then the freq of intervals of size s between identical tokens is inversely proportional to the size s

i.e. we have a large number of small intervals i.e. we have a small number of large intervals --> most content words occur near each other

s1

f pf = frequency of intervals of size ss = size of intervalp = varied between 1 and 1.3

xxx xxx

xxx xxx

35

What to do with text corpora? Find Collocations

Collocation: a phrase where the whole expression is perceived as having an existence beyond the sum of its parts disk drive, make up, bacon and eggs…

important for machine translation strong tea --> thé fort strong argument -->?argument fort (convainquant)

can be extracted from a text find the most common bigrams however, since these bigrams are often

insignificant (ex, “at the”, “of a”) they can be filtered.

36

Collocations

Raw bigrams Filtered bigrams

37

What to do with text corpora? Concordances Find the different contexts in which a word occurs. Key Word In Context (KWIC) concordancing program.

38

Concordances

useful for: Finding syntactic frames of verbs

Transitive? Intransitive? Building dictionaries for learners of foreign

languages Guiding statistical parsers

Documents

1 COMP 791A: Statistical Language Processing Introduction Chap. 1