28
1 LING001 Language and Computers 4-13-2009

1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

1

LING001

Language and Computers4-13-2009

Page 2: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

2

I’m sorry Dave, I’m afraid I can’t do that

• Computational linguistics

• Use computational tools to understand how humans (and human languages) work

• Use computational tools to make computers “understand” humans (and human languages)

Page 3: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

3

The Turing Test

• Language has always been viewed as the essence of being human

• Alan Turing (1912-1954), a pioneer in computer science, proposed that a machine would be considered “intelligent” if it could fool a human into believing it to be another human via teletyping (or IRC, IM, ...)

• This test has led to many philosophical controversies but one thing is clear, no machine has ever passed the Test

• ELIZA: A computer program created by Joseph Weizenbaum in the 1960s that played a psychiatrist and was disturbingly effective at tricking people into confessions

• See demo

Page 4: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

4

Part I: Language as Computation• Modern linguistics was/is computer science

• representations and rules were conceived as components in a mechanical device that generates linguistic expressions--an insight that goes back many centuries

• rules such as S→NP VP became the foundation of computer science

• logic representations of meanings (the SEMANTICS lectures) were used to represent what programming languages express

• Exp→ Exp OP Exp

• OP→ +, -, *, ÷

• Exp→ 0, 1, 2, ....

Page 5: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

5

Case Study: Language processing

• Linguistic Problems

• How do we analyze sentences on the fly?

• Why are some sentences more difficult to process than others?

• Computational Problems

• computers have a processor and a memory

• the processor carries out an algorithm (a precise series of steps) that, in effect, draws a syntax tree much like what you do in your homework

• the “tree” is shipped to semantics where the logic-based type of calculation takes over to derive meanings

Page 6: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

6

Simple Grammar

How do we parse in real time?

Page 7: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

7

What comes down must go up

Make the left-most expansion, which is goingto be the leftmost word, to see if the predictedcategory is actually found in the input sentence

Page 8: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

8

Does this flight include a meal?

Page 9: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

9

Does this flight include a meal? (Cont)

Page 10: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

10

Does this flight include a meal? (Cont)

Page 11: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

11

Breakdown

• Recall the structure of “center embedding”: NP -> NP S (relative clause)

• The cheese the rat the cat the dog worried chased ate lay in the house

• Red indicates a predicted rule, and blue indicates a confirmed rule

• S-> NP VP -> Det N VP -> the N VP -> the cheese VP -> NP VP

• we are now at “The cheese”

• VP would next predict V, which is contradicted by “the” in “the rat”

• we need to trace back the NP expansion (the green box) to try out other rules in the grammar for NP (namely, “NP->NP S”)

• Note that the red VP is still held in the memory because it’s predicted: more and more will stack up, causing memory load problem and hence parsing difficulty

Page 12: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

12

Case II: Language learning

• Recall the strategies in word segmentation (lecture on 3-12)

• stress information: English is predominantly initial stress

• statistical information: the transitional probability between syllables tends to be lower at word boundaries

• infants have been shown to be sensitive to both, but how/whether do conflicting cues work together?

• in experimental psychology, most researchers will focus on one of the cues, because it is inherent in the research to eliminate confounds from other cues in order to establish the empirical effectiveness of the cue that the researchers are interested in

Page 13: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

13

Language Learning

• In many cases where it is difficult to carry out experiments, or the resulting experiment would be too complicated (for young children), computational modeling is often the best--perhaps the only--way of testing hypotheses about language learning

• computer models must be faithful to the findings in child language acquisition: e.g., it cannot presuppose unrealistic computing power on the part of the learner

• computer models must produce behavior consistent with what children actually do in language learning

• they also allow the integration of quantitative factors (e.g., frequency) with the more abstract representations and rules we have been discussing in this class

Page 14: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

14

From Learning to Change

• The transmission of language is by language learning: this process is strikingly similar to genetics and evolution

• Language learning can be modeled as an adaptive process to the variant forms of grammar in the environment: some competing variants are from Universal Grammar, while others are contingent on social and cultural factors (e.g., “soda” vs. “pop”)

• There is a healthy amount of work now that tries to use the mathematical models of biological evolution to develop mathematical models of language change

14

Page 15: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

15

II: Computing Language

• Another branch of computational linguistics develops applications for engineering purposes

• In most cases, the engineering techniques are inspired by how language works as revealed by linguistics

• Manual construction of the system is usually labor intensive so people have been looking for automatic ways of “learning” the linguistic system

• Speech synthesis

• Ambiguity in language

• machine translation

• Spam filtering

Page 16: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

16

Speech Synthesis

• Virtually all systems were modeled after the human vocal tract: mechanical before, electronic now

Wolfgang von Kempelen (late 1700s)

Page 17: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

17

Modern Speech Synthesis Systems• Many follow the pipeline

of linguistic representations that we have been talking about

• Note that running speech involves more than word synthesis: prosody is very important as well

• this often requires the system to parse the sentence into tree like structures

• see demo

http://www.research.att.com/~ttsweb/tts/demo.php

Page 18: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

18

Ambiguity

• Ambiguity is pervasive in language

• word level: “two” vs. “too” vs. “to”

• part of speech: “bark” is a verb as well as a noun

• word senses: “bank” as an institution vs. “bank” as an object

• syntax: “I shot an elephant in my pajamas”

• semantics: “everyone likes someone”

• Humans can make rapid decisions on ambiguity by tapping into both linguistic and non-linguistic knowledge and thus ignoring the majority of ambiguities

• How does a computer do that?

Page 19: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

19

Part of Speech Disambiguation

• Observation: some categories appear more often than others

• “the” is far more likely to be followed by a noun or adjective, and never verbs

• Idea: look at a lot of pairs of words in the preanalyzed data (e.g., “book that flight” = “V Det N”), and try to discover the regularities

• we can then generalize these regularities to novel texts

• here the problem is solved by having a “teacher”

• The US military has paid a lot of money to produce tons of preanalyzed linguistic data, hoping that useful regularities can be extracted out of it.

Page 20: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

20

The dog saw the icecream

• (The, Dog): Det-N, Det-V

• (Dog, Saw): N-N, N-V, V-N, V-V

• (Saw, The): N-Det, V-Det

• (The, Icecream): Det-N

• Proceed from left to right, and pick out more likely POS pairings (with some technical tricks that we omit)

• In practice, this gets English POS correctly above 95%

• but just assigning the most likely tag gets it correctly about 91%

• reason: English has fairly rigid word order such that this kind of technique works reasonably well

Page 21: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

21

Machine Translation• Machine translation systems in practice differ wrt how

“deep” their linguistic analysis goes

• Not surprisingly, the deeper systems work better but are more expensive to construct

• One of the many challenges are idioms: “the spirit is willing but the flesh is weak”=”the vodka is strong but the meat is rotten”

Page 22: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

22

Babelfish: Needs oxygen

• Try out websites such as babelfish.altavista.com

• Translate a passage from English to German to French and the back to English

• See demo

Page 23: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

23

Spam Filtering

• 1978: apparently the first email spam (advertising for a product demo within a computer company)

• 2007: 90 billions per day (>85% of all emails), at huge costs to servers as well as users

• this is a multibillion dollar industry

• One of more successful filtering systems developed out of computational linguistics

• we will briefly review how it works, and why it fails.

Page 24: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

24

Conditional Probability

• P(A|B) is the prob. of A happening given than B has happened

• A=a person living in Philadelphia

• B=a person going to UPenn

• Both P(A) and P(B) are very small

• but P(A|B) is much larger

Page 25: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

25

Conditional Probability in Spams

• P(spam|document containing the word “Nigeria”): not very high

• P(spam|document containing the word “investment”): not very high

• P(spam|document containing the word “Nigeria” AND “investment”): very high

• Note that there is nothing inherently spammy about “Nigera” or “investment”: it is a fact of the world, and the Spam Filter must be tuned (or trained”) to it

• Spam filter must adapt to the world

Page 26: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

26

How a Computer does it?

• Collect email messages and use human judges to classify them into spam and non-spam file

• every time you “report spam” in gmail, you are contributing to Google’s business

• It generates profiles of spam and non-spam messages

• in practice, this is just based on occurrences of words

• e.g., suppose “Nigeria” has the probability of 1 in 200,000 in non-spams, but 1 in 1,000 in Spams, then “Nigeria” will be treated as a spam flag

Page 27: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

27

Difficulties

• Note that spam is often signified by co-occurrences of words (Nigeria investment)

• The technique from the previous slide requires comparisons of two probabilities: the normal and the spam

• for one word (say, a noun, about 20,000), it is possible to gather enough data to get a sense of their probabilities

• but for word pairs, triples, etc., very quickly we run out of data

• there are 20,0002=400 million combinations: many of word pairs will have zero occurrence in the data

• and this is not even talking about how words are structured in the message: the spam filters assume if messages are bags of words

Page 28: 1 LING001 Language and Computers 4-13-2009. 2 I’m sorry Dave, I’m afraid I can’t do that Computational linguistics Use computational tools to understand

28

Summary

• Computers can be effectively used to model and study human linguistic behavior

• The infinity and ambiguity inherent in human language poses a significant challenge to engineers