39
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst August 30, 2004

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst August 30, 2004

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstAugust 30, 2004 

 

2

Today

Motivation: SIMS student projectsCourse GoalsWhy NLP is difficultHow to solve it? Corpus-based statistical approachesWhat we’ll do in this course

3

ANLP Motivation:SIMS Masters Projects

Breaking Story (2002)Summarize trends in news feedsNeeds categories and entities assigned to all news articles

http://dream.sims.berkeley.edu/newshound/

BriefBank (2002) System for entering legal briefsNeeds a topic category system for browsing

http://briefbank.samuelsonclinic.org/

Chronkite (2003)Personalized RSS feedsNeeds categories and entities assigned to all web pages

Paparrazi (2004)Analysis of blog activityNeeds categories assigned to blog content

4

5

6

7

8

9

10

Goals of this CourseLearn about the problems and possibilities of natural language analysis:

What are the major issues?What are the major solutions?

– How well do they work– How do they work (but to a lesser extent than CS 295-4)

At the end you should:Agree that language is subtle and interesting!Feel some ownership over the algorithmsBe able to assess NLP problems

– Know which solutions to apply when, and howBe able to read papers in the field

11

Today

Motivation: SIMS student projectsCourse GoalsWhy NLP is difficultHow to solve it? Corpus-based statistical approachesWhat we’ll do in this course

12

We’ve past the year 2001,but we are not closeto realizing the dream(or nightmare …)

Dave Bowman: “Open the pod bay doors, HAL”

HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

14

Why is NLP difficult?

Computers are not brainsThere is evidence that much of language understanding is built-in to the human brain

Computers do not socializeMuch of language is about communicating with people

Key problems:Representation of meaningLanguage presupposed knowledge about the worldLanguage only reflects the surface of meaningLanguage presupposes communication between people

15Adapted from Robert Berwick's 6.863J

Hidden Structure

English plural pronunciationToy + s toyz ; add zBook + s books ; add sChurch + s churchiz ; add iz

Box + s boxiz ; add izSheep + s sheep ; add nothing

What about new words?Bach + ‘s boxs ; why not boxiz?

16

Language subtleties

Adjective order and placementA big black dogA big black scary dogA big scary dogA scary big dogA black big dog

AntonymsWhich sizes go together?

– Big and little– Big and small– Large and small

Large and little

17Adapted from Robert Berwick's 6.863J

World Knowledge is subtle

He arrived at the lecture.He chuckled at the lecture.

He arrived drunk.He chuckled drunk.

He chuckled his way through the lecture.He arrived his way through the lecture.

18Adapted from Robert Berwick's 6.863J

Words are ambiguous(have multiple meanings)

I know that.

I know that block.

I know that blocks the sun.

I know that block blocks the sun.

19Adapted from Robert Berwick's 6.863J

Headline Ambiguity

Iraqi Head Seeks ArmsJuvenile Court to Try Shooting DefendantTeacher Strikes Idle KidsKids Make Nutritious SnacksBritish Left Waffles on Falkland IslandsRed Tape Holds Up New BridgesBush Wins on Budget, but More Lies AheadHospitals are Sued by 7 Foot Doctors

20

The Role of Memorization

Children learn words quicklyAs many as 9 words/dayOften only need one exposure to associate meaning with word

– Can make mistakes, e.g., overgeneralization“I goed to the store.”

Exactly how they do this is still under study

21

The Role of Memorization

Dogs can do word association too!Rico, a border collie in GermanyKnows the names of each of 100 toys Can retrieve items called out to him with over 90% accuracy. Can also learn and remember the names of unfamiliar toys after just one encounter, putting him on a par with a three-year-old child.

http://www.nature.com/news/2004/040607/pf/040607-8_pf.html

22Adapted from Robert Berwick's 6.863J

But there is too much to memorize!

establishestablishment

the church of England as the official state church.

disestablishmentantidisestablishmentantidisestablishmentarianantidisestablishmentarianism

is a political philosophy that is opposed to the separation of church and state.

23

Rules and MemorizationCurrent thinking in psycholinguistics is that we use a combination of rules and memorization

However, this is very controversialMechanism:

If there is an applicable rule, apply itHowever, if there is a memorized version, that takes precedence. (Important for irregular words.)

– Artists paint “still lifes” Not “still lives”

– Past tense of think thought blink blinked

This is a simplification; for more on this, see Pinker’s “Words and Language” and “The Language Instinct”.

24

Representation of Meaning

I know that block blocks the sun.How do we represent the meanings of “block”?How do we represent “I know”? How does that differ from “I know that.”? Who is “I”?How do we indicate that we are talking about earth’s sun vs. some other planet’s sun?When did this take place? What if I move the block? What if I move my viewpoint? How do we represent this?

25

How to tackle these problems?

The field was stuck for quite some time.A new approach started around 1990

Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz

Main idea: combine memorizing and rulesHow to do it:

Get large text collections (corpora)Compute statistics over the words in those collections

Surprisingly effectiveEven better now with the Web

26

Corpus-based Example: Pre-Nominal Adjective Ordering

Important for translation and generationExamples:

big fat Greek weddingfat Greek big wedding

Some approaches try to characterize this as semantic rules, e.g.:

Age < color, value < dimension

Data-intensive approachesAssume adjective ordering is independent of the noun they modifyCompare how often you see {a, b} vs {b, a}

Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04

27

Corpus-based Example: Pre-Nominal Adjective Ordering

Data-intensive approachesCompare how often you see {a, b} vs {b, a}What happens when you encounter an unseen pair?

– Shaw and Hatzivassiloglou ’99 use transitive closutres– Malouf ’00 uses a back-off bigram model

P(<a,b>|{a,b}) vs. P(<b,a>|{a,b}) He also uses morphological analysis, semantic similarity

calculations and positional probabilitiesKeller and Lapata ’04 use just the very simple algorithm

– But they use the web as their training set– Gets 90% accuracy on 1000 sequences– As good as or better than the complex algorithms

Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04

28Adapted from Robert Berwick's 6.863J

Real-World Applications of NLP

Spelling Suggestions/CorrectionsGrammar CheckingSynonym GenerationInformation ExtractionText CategorizationAutomated Customer ServiceSpeech Recognition (limited)Machine TranslationIn the (near?) future:

Question AnsweringImproving Web Search Engine resultsAutomated Metadata AssignmentOnline Dialogs

29

NLP in the Real World

Synonym generation forSuggesting advertising keywordsSuggesting search result refinement and expansion

30

Synonym Generation

31

Synonym Generation

32

Synonym Generation

33

Synonym Generation

34

What We’ll Do in this Course

Read research papers and tutorialsUse NLTK (Natural Language ToolKit) to try out various algorithms

Some homeworks will be to do some NLTK exercises

Three mini-projects Two involve a selected collectionThe third is your choice, can also be on the selected collection

35

What We’ll Do in this Course

Adopt a large text collectionUse a wide range of NLP techniques to process itRelease the results for others to use

36

Which Text Collection?

37

How to analyze a big collection?

Your ideas go here

38

Python

A terrific languageInterpretedObject-orientedEasy to interface to other things (web, DBMS, TK)Good stuff from: java, lisp, tcl, perlEasy to learn

– I learned it this summer by reading Learning Python

FUN!

39

Questions?