56
Ling 570 Introduction and Overview 1

Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

Embed Size (px)

Citation preview

Page 1: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

1

Ling 570

Introduction and Overview

Page 2: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

2

Roadmap

• Course Overview

• Tokenization

• Homework #1

Page 3: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

3

COURSE OVERVIEW

Page 4: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

4

Course Information

• Course web page: – http://courses.washington.edu/ling570– Syllabus:

• Schedule and readings• Links to other readings, slides, links to class recordings• Slides posted before class, but may be revised

– Catalyst tools: • GoPost discussion board for class issues• CollectIt Dropbox for homework submission and TA comments• Gradebook for viewing all grades

Page 5: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

5

GoPost Discussion Board

• Main venue for course-related questions, discussion– What not to post:

• Personal, confidential questions; Homework solutions

– What to post: • Almost anything else course-related

– Can someone explain…?– Is this really supposed to take this long to run?

– Key location for class participation• Post questions or answers• Your discussion space: Sanghoun & instructors will not jump in

often

Page 6: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

6

GoPost

• Emily’s 5-minute rule:– If you’ve been stuck on a problem for more

than 5 minutes, post to the GoPost! • Mechanics:

– Please use your UW NetID as your user id– Please post early and often !

• Don’t wait until the last minute

– Notifications:• Decide how you want to receive GoPost postings

Page 7: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

7

Email

• Should be used only for personal or confidential issues– Grading issues, extended absences, other

problems• General questions/comments go on GoPost• Please send email from your UW account

– Include Ling570 in the subject– If you don’t receive a reply in 24 hours, please

follow-up

Page 8: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

8

Homework Submission

• All homework should be submitted through CollectIt– Tar cvf hw1.tar hw1_dir

• Homework due 11:45 Tuesday (unless otherwise noted)

• Late homework receives 10%/day penalty (incremental)• Most major programming languages accepted

– C/C++/C#, Java, Python, Perl, Ruby• If you want to use something else, please check first

• Please follow naming, organization guidelines in HW• Expect to spend 10-20 hours/week, including HW docs

Page 9: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

9

Grading

• Homeworks: 50%• Projects: 40%• Class participation: 10%• No midterm or final exams

• Grades in Catalyst Gradebook• TA feedback returned through CollectIt• Incomplete: only if all work completed up last two weeks

– UW policy

Page 10: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

10

Broadcast

• Class will (mostly) be taught in 99/1915• Broadcast live to CSE 305• Streamed to remote students (also on

syllabus):– http://www.cs.washington.edu/events/webcast/ling570.asx– http://www.cs.washington.edu/events/webcast/ling570.wbv (with

ink)– Requires the ConferenceXP WebViewer software to view– Other formats have no ink (but students can follow slides by

downloading the Powerpoints)

Page 11: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

11

Interacting• Live interaction between CSE 305 and 99/1915• Remote students: chat channel over Skype• Please send friend request to:

– ling570uw• Skype chat session will be started 5-10 mins. prior

to class• Will be monitored the whole class

Page 12: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

12

Recordings

• All classes will be recorded – Links to recordings appear in syllabus– They’ll be recorded in multiple formats– Available to all students, DL and in class– Multiple formats supported for recordings– Ink may be only available in Conference XP

WebViewer

Page 13: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

13

Contact Info

• Will: Email: [email protected]• Chris: Email: [email protected]• Office hours:

– Thursday: 9-11 – Location: Padelford – Or by arrangement (e.g., on MS campus)

• TA: Sanghoun Song: Email: [email protected]– Office hour: Tuesday 9-11– Location: Lewis Annex 110

Page 14: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

14

Online Option

• Please check you are registered for correct section– CLMS online: Section A– State-funded: Section B– CLMS in-class: Section C– NLT/SCE online (or in-class): Section D

• Online attendance for in-class students– Not more than 3 times per term (e.g. missed bus, ice)

• Please enter meeting room 5-10 before start of class– Try to stay online throughout class

• Note: if attending on Microsoft Campus– If not a Microsofty, please arrive at 99 before 5:00 for admission

to building

Page 15: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

16

COURSE DESCRIPTION

Page 16: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

17

Course Prerequisites

• Programming Languages:– Java/C++/Python/Perl/..

• Operating Systems: Basic Unix/linux• CS 326 (Data structures) or equivalent

– Lists, trees, queues, stacks, hash tables, …– Sorting, searching, dynamic programming,..

• Automata, regular expressions,…• Stat 391 (Probability and statistics): random

variables, conditional probability, Bayes’ rule, ….

Page 17: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

18

Textbooks

• Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008– Available from UW Bookstore, Amazon, etc

• Manning and Schuetze, Foundations of Statistical Natural Language Processing

Page 18: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

19

Roadmap

• Motivation: – Applications

• Language and Thought

• Knowledge of Language– Cross-cutting themes

• Ambiguity, Evaluation, & Multi-linguality

• Course Overview

Page 19: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

20

Motivation: Applications

• Applications of Speech and Language Processing– Call routing– Information retrieval– Question-answering– Machine translation– Dialog systems– Spam tagging– Spell- , Grammar- checking– Sentiment Analysis – Information extraction….

Page 20: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

21

Shallow vs Deep Processing

• Shallow processing (Ling 570)– Usually relies on surface forms (e.g., words)

• Less elaborate linguistic representations

– E.g. Part-of-speech tagging; Morphology; Chunking

Page 21: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

22

Shallow vs Deep Processing

• Shallow processing (Ling 570)– Usually relies on surface forms (e.g., words)

• Less elaborate linguistic representations

– E.g. Part-of-speech tagging; Morphology; Chunking

• Deep processing (Ling 571)– Relies on more elaborate linguistic

representations• Deep syntactic analysis (Parsing)• Rich spoken language understanding (NLU)

Page 22: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

23

Language & Intelligence

• Turing Test: (1949) – Operationalize intelligence– Two contestants: human, computer– Judge: human– Test: Interact via text questions– Question: Can you tell which contestant is

human?• Crucially requires language use and

understanding

Page 23: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

24

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL.• HAL: I'm sorry, Dave. I'm afraid I can't do

that.

Page 24: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

25

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL.• HAL: I'm sorry, Dave. I'm afraid I can't do that.

• Phonetics & Phonology (Ling 450/550)– Sounds of a language, acoustics– Legal sound sequences in words

Page 25: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

26

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL.• HAL: I'm sorry, Dave. I'm afraid I can't do that.

• Morphology (Ling 570)– Recognize, produce variation in word forms– Singular vs. plural: Door + sg: -> door; Door + plural ->

doors– Verb inflection: Be + 1st person, sg, present -> am

Page 26: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

27

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL.• HAL: I'm sorry, Dave. I'm afraid I can't do that.

• Part-of-speech tagging (Ling 570)– Identify word use in sentence– Bay (Noun) --- Not verb, adjective

Page 27: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

28

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL.• HAL: I'm sorry, Dave. I'm afraid I can't do that.

• Syntax– (Ling 566: analysis; Ling 570 – chunking; Ling 571-

parsing)– Order and group words in sentence

• I’m I do , sorry that afraid Dave I can’t.

Page 28: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

29

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL.• HAL: I'm sorry, Dave. I'm afraid I can't do that.

• Semantics (Ling 571)– Word meaning:

• individual (lexical), combined (compositional)

• ‘Open’ : AGENT cause THEME to become open;• ‘pod bay doors’ : (pod bay) doors

Page 29: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

30

Knowledge of Language

• What does HAL (of 2001, A Space Odyssey) need to know to converse?

• Dave: Open the pod bay doors, HAL. (request)• HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement)

• Pragmatics/Discourse/Dialogue (Ling 571, maybe)– Interpret utterances in context– Speech act (request, statement)– Reference resolution: I = HAL; that = ‘open doors’– Politeness: I’m sorry, I’m afraid I can’t

Page 30: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

31

Cross-cutting Themes

• Ambiguity– How can we select among alternative analyses?

• Evaluation– How well does this approach perform:

• On a standard data set?• When incorporated into a full system (e.g., E2E)?

• Multi-linguality– Can we apply this approach to other languages?– How much do we have to modify it to do so?

Page 31: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

32

Ambiguity

• “Flying planes made her duck”• Means....

– Planes flying overhead caused her to duck down

Page 32: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

33

Ambiguity

• “Flying planes made her duck”• Means....

– Planes flying overhead caused her to duck down– While flying planes, she ducked

Page 33: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

34

Ambiguity

• “Flying planes made her duck”• Means....

– Planes flying overhead caused her to duck down– While flying planes, she ducked– Planes flying overhead made her a (carved)

duck

Page 34: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

35

Ambiguity

• “Flying planes made her duck”• Means....

– Planes flying overhead caused her to duck down– While flying planes, she ducked– Planes flying overhead made her a (carved)

duck– Planes flying overhead cooked a duck for her

Page 35: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

36

Ambiguity

• “Flying planes made her duck”• Means....

– Planes flying overhead caused her to duck down– While flying planes, she ducked– Planes flying overhead made her a (carved)

duck– Planes flying overhead cooked a duck for her– Planes flying overhead cooked her duck

Page 36: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

37

Ambiguity

• “Flying planes made her duck”• Means....

– Planes flying overhead caused her to duck down– While flying planes, she ducked– Planes flying overhead made her a (carved)

duck– Planes flying overhead cooked a duck for her– Planes flying overhead cooked her duck

V

N

Poss

Pron

Page 37: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

38

Ambiguity: Syntax

• “I made her duck”• Means....

– I made the (carved) duck she has• ((VP (V made) (NP (POSS her) (N duck)))

Page 38: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

39

Ambiguity: Syntax

• “…made her duck”• Means....

– …made the (carved) duck she has• ((VP (V made) (NP (POSS her) (N duck)))

– …cooked duck for her• ((VP (V made) (NP (PRON her)) (NP (N (duck)))

Page 39: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

40

Ambiguity

• Pervasive• Pernicious• Particularly challenging for computational

systems• Problem we will return to again and again

in class

Page 40: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

41

Tokenization

• Given input text, split into words or sentences• Tokens: words, numbers, punctuation• Example:

– Sherwood said reaction has been “very positive.”– Sherwood said reaction has been “ very positive .

”• Why tokenize?

Page 41: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

42

Tokenization

• Given input text, split into words or sentences• Tokens: words, numbers, punctuation• Example:

– Sherwood said reaction has been “very positive.”– Sherwood said reaction has been “ very positive .

”• Why tokenize?

• Identify basic units for downstream processing

Page 42: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

43

Tokenization

• Proposal 1: Split on white space• Good enough

Page 43: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

44

Tokenization

• Proposal 1: Split on white space• Good enough? No• Why not?

Page 44: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

45

Tokenization

• Proposal 1: Split on white space• Good enough? No• Why not?

– Multi-linguality:• Languages without white space delimiters: Chinese, Japanese• Agglutinative languages (Hungarian, Korean)

– meggazdagiithatnok

• Compounding languages (German)– Lebensversicherungsgesellschaftsangestellter

“Life insurance company employee”

– Even with English, misses punctuation

Page 45: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

46

Tokenization - again

• Proposal 2: Split on white space and punctuation– For English

• Good enough?

Page 46: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

47

Tokenization - again

• Proposal 2: Split on white space and punctuation– For English

• Good enough? No• Problems:

Page 47: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

48

Tokenization - again

• Proposal 2: Split on white space and punctuation– For English

• Good enough? No• Problems: Non-splitting punctuation

– 1.23 1 . 23 X; 1,23 1 , 23 X– don’t don t X; E-mail E mail X, etc

Page 48: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

49

Tokenization - again

• Proposal 2: Split on white space and punctuation– For English

• Good enough? No• Problems: Non-splitting punctuation

– 1.23 1 . 23 X; 1,23 1 , 23 X– don’t don t X; E-mail E mail X, etc

• Problems: no-splitting whitespace– Names: New York; Collocations: pick up

Page 49: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

50

Tokenization - again

• Proposal 2: Split on white space and punctuation– For English

• Good enough? No• Problems: Non-splitting punctuation

– 1.23 1 . 23 X; 1,23 1 , 23 X– don’t don t X; E-mail E mail X, etc

• Problems: no-splitting whitespace– Names: New York; Collocations: pick up

• What’s a word?

Page 50: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

51

Sentence Segmentation

• Similar issues– Proposal: Split on ‘.’

• Problems?

Page 51: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

52

Sentence Segmentation

• Similar issues– Proposal: Split on ‘.’

• Problems?– Other punctuation: ?,!,etc– Non-boundary periods:

• 1.23• Mr. Sherword• M.p.g.• ….

• Solutions?

Page 52: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

53

Sentence Segmentation

• Similar issues– Proposal: Split on ‘.’

• Problems?– Other punctuation: ?,!,etc– Non-boundary periods:

• 1.23• Mr. Sherword• M.p.g.• ….

• Solutions?– Rules, dictionaries esp. abbreviations– Marked up text + machine learning

Page 53: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

54

HOMEWORK #1

Page 54: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

55

General Notes and Naming

• All code must run on the CL cluster under Condor

• Each programming question needs a corresponding condor command file named:– E.g. hw1_q1.cmd– And should run with: $ condor_submit

hw1_q1.cmd• General comments should appear in hw1.(pdf|

txt)• Your code may be tested on new (unseen) data

Page 55: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

56

HW #1

• Link on Syllabus.• Or click here.

Page 56: Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2

57

FSA/T Conventions

• Next class