69
Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

  • View
    229

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Introduction & Tokenization

Ling570Shallow Processing Techniques for NLP

September 28, 2011

Page 2: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

RoadmapCourse Overview

Tokenization

Homework #1

Page 3: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Course Overview

Page 4: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Course InformationCourse web page:

http://courses.washington.edu/ling570Syllabus:

Schedule and readingsLinks to other readings, slides, links to class recordingsSlides posted before class, but may be revised

Catalyst tools: GoPost discussion board for class issuesCollectIt Dropbox for homework submission and TA

commentsGradebook for viewing all grades

Page 5: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

GoPost Discussion BoardMain venue for course-related questions, discussion

What not to post:Personal, confidential questions; Homework solutions

What to post: Almost anything else course-related

Can someone explain…? Is this really supposed to take this long to run?

Key location for class participationPost questions or answersYour discussion space: Sanghoun & I will not jump in

often

Page 6: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

GoPostEmily’s 5-minute rule:

If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!

Mechanics:Please use your UW NetID as your user idPlease post early and often !

Don’t wait until the last minuteNotifications:

Decide how you want to receive GoPost postings

Page 7: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

EmailShould be used only for personal or confidential

issuesGrading issues, extended absences, other problems

General questions/comments go on GoPost

Please send email from your UW account Include Ling570 in the subject If you don’t receive a reply in 24 hours, please follow-

up

Page 8: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Homework SubmissionAll homework should be submitted through CollectIt

Tar cvf hw1.tar hw1_dir

Homework due 11:45 Wednesdays

Late homework receives 10%/day penalty (incremental)

Most major programming languages accepted C/C++/C#, Java, Python, Perl, Ruby

If you want to use something else, please check first

Please follow naming, organization guidelines in HW

Expect to spend 10-20 hours/week, including HW docs

Page 9: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

GradingAssignments: 90%

Class participation: 10%

No midterm or final exams

Grades in Catalyst Gradebook

TA feedback returned through CollectIt

Incomplete: only if all work completed up last two weeksUW policy

Page 10: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

RecordingsAll classes will be recorded

Links to recordings appear in syllabusAvailable to all students, DL and in class

Please remind me to:Record the meeting (look for the red dot)Repeat in-class questions

Note: Instructor’s screen is projected in classAssume that chat window is always public

Page 11: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Contact InfoGina: Email: [email protected]

Office hour:Fridays: 10-11 (before Treehouse meeting)Location: Padelford B-201Or by arrangement

Available by Skype or Adobe ConnectAll DL students should arrange a short online meeting

TA: Sanghoun Song: Email: [email protected] hour: Time: TBD, see GoPostLocation:

Page 12: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Online OptionPlease check you are registered for correct section

CLMS online: Section AState-funded: Section BCLMS in-class: Section CNLT/SCE online (or in-class): Section D

Online attendance for in-class studentsNot more than 3 times per term (e.g. missed bus, ice)

Please enter meeting room 5-10 before start of classTry to stay online throughout class

Page 13: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Online TipIf you see:

You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. you do not have a Connect account but need to have

one. For UWEO students: If you have just created your UW NetID or just enrolled

in a course…..

Clear your cache, close and restart your browser

Page 14: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Course Description

Page 15: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Course PrerequisitesProgramming Languages:

Java/C++/Python/Perl/..

Operating Systems: Basic Unix/linux

CS 326 (Data structures) or equivalentLists, trees, queues, stacks, hash tables, …Sorting, searching, dynamic programming,..

Automata, regular expressions,…

Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….

Page 16: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TextbookJurafsky and Martin, Speech and Language

Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008Available from UW Bookstore, Amazon, etc

Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing

Page 17: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Topics in Ling570Unit #1: Formal Languages and Automata (2-3

weeks)Formal languagesFinite-state AutomataFinite-state TransducersMorphological analysis

Unit #2: Ngram Language Models and HMMsNgram Language Models and SmoothingPart-of-speech (POS) tagging:

HMMNgram

Page 18: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Topics in Ling570Unit #3: Classification (2-3 weeks)

Intro to classificationPOS tagging with classifiersChunkingNamed Entity (NE) recognition

Other topics (2 weeks) Intro, tokenizationClustering Information ExtractionSummary

Page 19: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

RoadmapMotivation:

Applications

Language and Thought

Knowledge of LanguageCross-cutting themes

Ambiguity, Evaluation, & Multi-linguality

Course Overview

Page 20: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Motivation: ApplicationsApplications of Speech and Language Processing

Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….

Page 21: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Shallow vs Deep Processing

Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)

Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking

Page 22: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Shallow vs Deep Processing

Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)

Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking

Deep processing (Ling 571)Relies on more elaborate linguistic representations

Deep syntactic analysis (Parsing)Rich spoken language understanding (NLU)

Page 23: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Shallow or Deep?Applications of Speech and Language Processing

Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….

Page 24: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Language & IntelligenceTuring Test: (1949) – Operationalize intelligence

Two contestants: human, computer Judge: humanTest: Interact via text questionsQuestion: Can you tell which contestant is human?

Crucially requires language use and understanding

Page 25: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Limitations of Turing TestELIZA (Weizenbaum 1966)

Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT

AGGRESSIVE…

Page 26: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Limitations of Turing TestELIZA (Weizenbaum 1966)

Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT

AGGRESSIVE...Passes the Turing Test!! (sort of)

Page 27: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Limitations of Turing TestELIZA (Weizenbaum 1966)

Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT

AGGRESSIVE...Passes the Turing Test!! (sort of)“You can fool some of the people....”

Page 28: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Limitations of Turing TestELIZA (Weizenbaum 1966)

Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE...

Passes the Turing Test!! (sort of)“You can fool some of the people....”

Simple pattern matching techniqueVery shallow processing

Page 29: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Turing Test Revived“On the web, no one knows you’re a….”

Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human

Test: Something human can do, ‘bot can’t

Page 30: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Turing Test Revived“On the web, no one knows you’re a….”

Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human

Test: Something human can do, ‘bot can’t

Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot

Page 31: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Turing Test Revived“On the web, no one knows you’re a….”

Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human

Test: Something human can do, ‘bot can’t

Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot

Key: Perception, not reasoning

Page 32: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey)

need to know to converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Page 33: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to

know to converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Phonetics & Phonology (Ling 450/550) Sounds of a language, acoustics Legal sound sequences in words

Page 34: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to

converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Morphology (Ling 570) Recognize, produce variation in word forms Singular vs. plural: Door + sg: -> door; Door + plural -> doors Verb inflection: Be + 1st person, sg, present -> am

Page 35: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to

know to converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Part-of-speech tagging (Ling 570) Identify word use in sentence Bay (Noun) --- Not verb, adjective

Page 36: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know

to converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Syntax (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) Order and group words in sentence

I’m I do , sorry that afraid Dave I can’t.

Page 37: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to

converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Semantics (Ling 571) Word meaning:

individual (lexical), combined (compositional)

‘Open’ : AGENT cause THEME to become open;

‘pod bay doors’ : (pod bay) doors

Page 38: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to

converse?

Dave: Open the pod bay doors, HAL. (request)

HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement)

Pragmatics/Discourse/Dialogue (Ling 571, maybe) Interpret utterances in context Speech act (request, statement) Reference resolution: I = HAL; that = ‘open doors’ Politeness: I’m sorry, I’m afraid I can’t

Page 39: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Cross-cutting ThemesAmbiguity

How can we select among alternative analyses?

Evaluation How well does this approach perform:

On a standard data set?When incorporated into a full system?

Multi-linguality Can we apply this approach to other languages? How much do we have to modify it to do so?

Page 40: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity“I made her duck”

Means....

Page 41: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity“I made her duck”

Means.... I caused her to duck down

Page 42: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity“I made her duck”

Means.... I caused her to duck down I made the (carved) duck she has

Page 43: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity“I made her duck”

Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her

Page 44: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity“I made her duck”

Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned

Page 45: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity“I made her duck”

Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck

Page 46: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity: POS“I made her duck”

Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck

V

N

Pron

Poss

Page 47: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity: Syntax“I made her duck”

Means.... I made the (carved) duck she has

((VP (V made) (NP (POSS her) (N duck)))

Page 48: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Ambiguity: Syntax“I made her duck”

Means.... I made the (carved) duck she has

((VP (V made) (NP (POSS her) (N duck)))

I cooked duck for her((VP (V made) (NP (PRON her)) (NP (N (duck)))

Page 49: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

AmbiguityPervasive

Pernicious

Particularly challenging for computational systems

Problem we will return to again and again in class

Page 50: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TokenizationGiven input text, split into words or sentences

Tokens: words, numbers, punctuation

Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "

Why tokenize?

Page 51: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TokenizationGiven input text, split into words or sentences

Tokens: words, numbers, punctuation

Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "

Why tokenize? Identify basic units for downstream processing

Page 52: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TokenizationProposal 1: Split on white space

Good enough

Page 53: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TokenizationProposal 1: Split on white space

Good enough? No

Why not?

Page 54: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TokenizationProposal 1: Split on white space

Good enough? No

Why not?Multi-linguality:

Languages without white space delimiters: Chinese, Japanese

Agglutinative languages (Finnish)Compounding languages (German)

E.g. Lebensversicherungsgesellschaftsangestellter

“Life insurance company employee”

Page 55: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

TokenizationProposal 1: Split on white space

Good enough? No

Why not?Multi-linguality:

Languages without white space delimiters: Chinese, JapaneseAgglutinative languages (Finnish)Compounding languages (German)

E.g. Lebensversicherungsgesellschaftsangestellter

“Life insurance company employee”Even with English, misses punctuation

Page 56: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Tokenization - againProposal 2: Split on white space and punctuation

For English

Good enough?

Page 57: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Tokenization - againProposal 2: Split on white space and punctuation

For English

Good enough? No

Problems:

Page 58: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Tokenization - againProposal 2: Split on white space and punctuation

For English

Good enough? No

Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc

Page 59: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Tokenization - againProposal 2: Split on white space and punctuation

For English

Good enough? No

Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc

Problems: no-splitting whitespaceNames: New York; Collocations: pick up

Page 60: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Tokenization - againProposal 2: Split on white space and punctuation

For English

Good enough? No

Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc

Problems: no-splitting whitespaceNames: New York; Collocations: pick up

What’s a word?

Page 61: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Sentence SegmentationSimilar issues

Proposal: Split on ‘.’

Problems?

Page 62: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Sentence SegmentationSimilar issues

Proposal: Split on ‘.’

Problems?Other punctuation: ?,!,etcNon-boundary periods:

1.23Mr. SherwordM.p.g.….

Solutions?

Page 63: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Sentence SegmentationSimilar issues

Proposal: Split on ‘.’

Problems? Other punctuation: ?,!,etc Non-boundary periods:

1.23Mr. SherwordM.p.g.….

Solutions? Rules, dictionaries esp. abbreviations Marked up text + machine learning

Page 64: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Homework #1

Page 65: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

General Notes and Naming

All code must run on the CL cluster under Condor

Each programming question needs a corresponding condor command file named:E.g. hw1_q1.cmdAnd should run with: $ condor_submit hw1_q1.cmd

General comments should appear in hw1.(pdf|txt)

Your code may be tested on new data

Page 66: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Q1-Q2: Building and Testing a Rule-Based English

TokenizerQ1: eng_tokenize.*

Condor file must contain Input = <input_file> Output = <output_file> You may use additional arguments ith input line should correspond to the ith output line Don’t waste too much time making it perfect!!

Q2: Explain how your tokenizer handles different

phenomena – numbers, hyphenated words, etc – and identify remaining problems.

Page 67: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Q3: make_vocab.*Given the input text:

The teacher bought the bike. The bike is expensive.

The output should be:The 2

the 1

teacher 1

bought 1

bike. 1

bike 1

is 1

expensive. 1

Page 68: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Q4: Investigating Tokenization

Run programs from Q1, Q3

Compare vocabularies with and without tokenization

Page 69: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Next TimeProbability

Formal languages

Finite-state automata