24
1 *Textbooks you need Manning, C. D., Schütze, H.: Foundations of Statistical Natural Language Processing . The MIT Press. 1999. ISBN 0-262-13360-1. [required] Allen, J.: Natural Language Understanding . The Benjamins/Cummins Publishing Co. 1995. 2 nd edition Jurafsky, D. and J. H. Martin: Speech and Language Processing . Prentice- Hall. 2009. 2 nd edition

*Textbooks you need

  • Upload
    peigi

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

*Textbooks you need. Manning, C. D., Sch ü tze, H.: Foundations of Statistical Natural Language Processing . The MIT Press. 1999. ISBN 0-262-13360-1. [required] Allen, J.: Natural Language Understanding . The Benjamins/Cummins Publishing Co. 1995. 2 nd edition - PowerPoint PPT Presentation

Citation preview

Page 1: *Textbooks you need

1

*Textbooks you need

• Manning, C. D., Schütze, H.: • Foundations of Statistical Natural Language Processing. The

MIT Press. 1999. ISBN 0-262-13360-1. [required]

• Allen, J.: • Natural Language Understanding. The Benjamins/Cummins

Publishing Co. 1995. 2nd edition

• Jurafsky, D. and J. H. Martin: • Speech and Language Processing. Prentice-Hall. 2009. 2nd

edition

Page 2: *Textbooks you need

2

Other reading• Charniak, E:

– Statistical Language Learning. The MIT Press. 1996. ISBN 0-262-53141-0.

• Cover, T. M., Thomas, J. A.:– Elements of Information Theory. Wiley. 1991. ISBN 0-471-06259-6.

• Jelinek, F.:– Statistical Methods for Speech Recognition. The MIT Press. 1998. ISBN 0-262-10066-5

• Proceedings of major conferences:– ACL (Assoc. of Computational Linguistics)– NAACL HLT (North American Chapter of ACL)– COLING (Intl. Committee of Computational Linguistics)– ACM SIGIR– Interspeech/ASRU/SLT

Page 3: *Textbooks you need

3

Course segments

• Intro & Probability & Information Theory – The very basics: definitions, formulas, examples.

• Language Modeling – n-gram models, parameter estimation

– smoothing (EM algorithm)

• A Bit of Linguistics – phonology, morphology, syntax, semantics, discourse

• Words and the Lexicon – word classes, mutual information, bit of lexicography.

Page 4: *Textbooks you need

4

Course segments (cont.)

• Hidden Markov Models – background, algorithms, parameter estimation

• Tagging: Methods, Algorithms, Evaluation – tagsets, morphology, lemmatization

– HMM tagging, Transformation-based, Feature-based

• NL Grammars and Parsing: Data, Algorithms – Grammars and Automata, Deterministic Parsing

– Statistical parsing. Algorithms, parameterization, evaluation

• Applications (MT, ASR, IR, Q&A, ...)

Page 5: *Textbooks you need

5

Goals of the HLT

Computers would be a lot more useful if they could handle our email, do our library research, talk to us …

But they are fazed by natural human language.

How can we make computers have abilities to handle human language? (Or help them learn it as kids do?)

Page 6: *Textbooks you need

6

A few applications of HLT

• Spelling correction, grammar checking …(language learning and evaluation e.g. TOEFL essay score)

• Better search engines

• Information extraction, gisting

• Psychotherapy; Harlequin romances; etc.

• New interfaces:– Speech recognition (and text-to-speech)

– Dialogue systems (USS Enterprise onboard computer)

– Machine translation; speech translation (the Babel tower??)

• Trans-lingual summarization, detection, extraction …

Page 7: *Textbooks you need

Dan Jurafsky

Question Answering: IBM’s Watson

• Won Jeopardy on February 16, 2011!

7

WILLIAM WILKINSON’S “AN ACCOUNT OF THE PRINCIPALITIES OF

WALLACHIA AND MOLDOVIA”INSPIRED THIS AUTHOR’S

MOST FAMOUS NOVEL

Bram Stoker

Page 8: *Textbooks you need

Dan Jurafsky

Information Extraction

Subject: curriculum meeting Date: January 15, 2012

To: Dan Jurafsky

Hi Dan, we’ve now scheduled the curriculum meeting.It will be in Gates 159 tomorrow from 10:00-11:30.-Chris

8

Create new Calendar entry

Event: Curriculum mtgDate: Jan-16-2012Start: 10:00amEnd: 11:30amWhere: Gates 159

Page 9: *Textbooks you need

Dan Jurafsky

Information Extraction & Sentiment Analysis

• nice and compact to carry! • since the camera is small and light, I won't need to carry

around those heavy, bulky professional cameras either! • the camera feels flimsy, is plastic and very light in weight you

have to be very delicate in the handling of this camera

9

Size and weight

Attributes: zoom affordability size and weight flash ease of use

Page 10: *Textbooks you need

Dan Jurafsky

Machine Translation

• Fully automatic

10

• Helping human translators

Enter Source Text:

Translation from Stanford’s Phrasal:

  这 不过 是 一 个 时间 的 问题  .

This is only a matter of time.

Page 11: *Textbooks you need

Dan Jurafsky

Language Technology

Coreference resolution

Question answering (QA)

Part-of-speech (POS) tagging

Word sense disambiguation (WSD)

Paraphrase

Named entity recognition (NER)

ParsingSummarization

Information extraction (IE)

Machine translation (MT)

Dialog

Sentiment analysis

mostly solved

making good progress

still really hard

Spam detection

Let’s go to Agra!Let’s go to Agra!

Buy V1AGRA …Buy V1AGRA …

Colorless green ideas sleep furiously.Colorless green ideas sleep furiously.

ADJ ADJ NOUN VERB ADV

Einstein met with UN officials in PrincetonEinstein met with UN officials in Princeton

PERSON ORG LOC

You’re invited to our dinner party, Friday May 27 at 8:30You’re invited to our dinner party, Friday May 27 at 8:30

PartyMay 27add

Best roast chicken in San Francisco!Best roast chicken in San Francisco!

The waiter ignored us for 20 minutes.The waiter ignored us for 20 minutes.

Carter told Mubarak he shouldn’t run again.Carter told Mubarak he shouldn’t run again.

I need new batteries for my mouse.I need new batteries for my mouse.

The 13th Shanghai International Film Festival…The 13th Shanghai International Film Festival…

第 13届上海国际电影节开幕…第 13届上海国际电影节开幕…

The Dow Jones is upThe Dow Jones is up

Housing prices roseHousing prices rose

Economy is good

Economy is good

Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?

I can see Alcatraz from the window!I can see Alcatraz from the window!

XYZ acquired ABC yesterdayXYZ acquired ABC yesterday

ABC has been taken over by XYZABC has been taken over by XYZ

Where is Citizen Kane playing in SF? Where is Citizen Kane playing in SF?

Castro Theatre at 7:30. Do you want a ticket?

Castro Theatre at 7:30. Do you want a ticket?

The S&P500 jumpedThe S&P500 jumped

Page 12: *Textbooks you need

Dan Jurafsky

Ambiguity makes NLP hard:“Crash blossoms”

Violinist Linked to JAL Crash BlossomsTeacher Strikes Idle KidsRed Tape Holds Up New BridgesHospitals Are Sued by 7 Foot DoctorsJuvenile Court to Try Shooting DefendantLocal High School Dropouts Cut in Half

100%REAL

Page 13: *Textbooks you need

Dan Jurafsky

Ambiguity is pervasive

Fed raises interest rates

New York Times headline (17 May 2000)

Fed raises interest rates

Fed raises interest rates 0.5%

Page 14: *Textbooks you need

Dan Jurafsky

non-standard English

Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥

segmentation issues idioms

dark horseget cold feet

lose facethrow in the towel

neologisms

unfriendRetweet

bromance

tricky entity names

Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …

world knowledge

Mary and Sue are sisters.Mary and Sue are mothers.

But that’s what makes it fun!

the New York-New Haven Railroadthe New York-New Haven Railroad

Why else is natural language understanding difficult?

Page 15: *Textbooks you need

Dan Jurafsky

Making progress on this problem…

• The task is difficult! What tools do we need?• Knowledge about language• Knowledge about the world• A way to combine knowledge sources

• How we generally do this:• probabilistic models built from language data

• P(“maison” “house”) high• P(“L’avocat général” “the general avocado”) low

• Luckily, rough text features can often do half the job.

Page 16: *Textbooks you need

Dan Jurafsky

This class

• Teaches key theory and methods for statistical NLP:• Viterbi• Naïve Bayes, Maxent classifiers• N-gram language modeling• Statistical Parsing• Inverted index, tf-idf, vector models of meaning

• For practical, robust real-world applications• Information extraction• Spelling correction• Information retrieval• Sentiment analysis

Page 17: *Textbooks you need

17

Levels of Language

• Phonetics/phonology/morphology: what words (or subwords) are we dealing with?

• Syntax: What phrases are we dealing with? Which words modify one another?

• Semantics: What’s the literal meaning?• Pragmatics: What should you conclude from the

fact that I said something? How should you react?

Page 18: *Textbooks you need

18

What’s hard – ambiguities, ambiguities, all different levels of ambiguities

John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. [from J. Eisner]

- donut: To get a donut (doughnut; spare tire) for his car?

- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or made of donut?

- From work: Well, actually, he stopped there from hunger and exhaustion, not just from work.

- Every few hours: That’s how often he thought it? Or that’s for coffee?

- it: the particular coffee that was good every few hours? the donut store? the situation

- Too expensive: too expensive for what? what are we supposed to conclude about what John did?

Page 19: *Textbooks you need

19

NLP: The Main Issues• Why is NLP difficult?

– many “words”, many “phenomena” --> many “rules”• OED: 400k words; Finnish lexicon (of forms): ~2 . 107

• sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more!

• irregularity (exceptions, exceptions to the exceptions, ...)• potato -> potato es (tomato, hero,...); photo -> photo s, and

even: both mango -> mango s or -> mango es• Adjective / Noun order: new book, electrical engineering,

general regulations, flower garden, garden flower, ...: but Governor General

Page 20: *Textbooks you need

20

Difficulties in NLP (cont.)– ambiguity

• books: NOUN or VERB?– you need many books vs. she books her flights online

• No left turn weekdays 4-6 pm / except transit vehicles (Charles Street at Cold Spring)

– when may transit vehicles turn: Always? Never?• Thank you for not smoking, drinking, eating or playing

radios without earphones. (MTA bus)– Thank you for not eating without earphones??– or even: Thank you for not drinking without earphones!?

• My neighbor’s hat was taken by wind. He tried to catch it.– ...catch the wind or ...catch the hat ?

Page 21: *Textbooks you need

21

(Categorical) Rules or Statistics?• Preferences:

– clear cases: context clues: she books --> books is a verb– rule: if an ambiguous word (verb/nonverb) is preceded by a

matching personal pronoun -> word is a verb

– less clear cases: pronoun reference– she/he/it refers to the most recent noun or pronoun (?) (but maybe

we can specify exceptions)

– selectional:– catching hat >> catching wind (but why not?)

– semantic: – never thank for drinking in a bus! (but what about the earphones?)

Page 22: *Textbooks you need

22

Solutions

• Don’t guess if you know:• morphology (inflections)

• lexicons (lists of words)

• unambiguous names

• perhaps some (really) fixed phrases

• syntactic rules?

• Use statistics (based on real-world data) for preferences (only?)

• No doubt: but this is the big question!

Page 23: *Textbooks you need

23

Statistical NLP

• Imagine:– Each sentence W = { w1, w2, ..., wn } gets a probability

P(W|X) in a context X (think of it in the intuitive sense for now)

– For every possible context X, sort all the imaginable sentences W according to P(W|X):

– Ideal situation:best sentence (most probable in context X) NB: same for

interpretation

P(W) “ungrammatical” sentences

Page 24: *Textbooks you need

24

Real World Situation

• Unable to specify set of grammatical sentences today using fixed “categorical” rules (maybe never)

• Use statistical “model” based on REAL WORLD DATA and care about the best sentence only (disregarding the “grammaticality” issue)

best sentence

P(W)

Wbest Wworst