64
Survey of NLP JILLIAN K. CHAVES CUBRC, Inc.

Survey of NLP

  • Upload
    magnar

  • View
    68

  • Download
    4

Embed Size (px)

DESCRIPTION

Survey of NLP. JILLIAN K. CHAVES CUBRC, Inc. Survey of NLP. Module 1 Introduction Tokenization Sentence Breaking Module 2 Part-of-Speech (POS) Tagging N-gram Analysis Module 3 Phrase Structure Parsing Syntactic Parsing Module 4 Semantic Analysis NLP & Ontologies. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Survey of NLP

Survey of NLPJILLIAN K. CHAVES

CUBRC, Inc.

Page 2: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 3: Survey of NLP

Introduction What is Natural Language?

A set of subconscious rules about the pronunciation (phonology), order (syntax), and meaning (semantics) of linguistic expressions.

What is Linguistics? The scientific study of language use, acquisition, and evolution.

What is Computation? Computation is the manipulation of information according to a

specific method (e.g., algorithm) for determining an output value from a set of input values.

What is Computational Linguistics? The study of the computational processes that are necessary for

the generation and understanding of natural language.

Page 4: Survey of NLP

Introduction Processing natural language is far from trivial Language is:

based on very large vocabularies (± 20,000 words) rich in meaning (sometimes vague and context-dependent) regulated by complicated patterns and subconscious rules massively ambiguous (resolved only by world knowledge) noisy (speakers routinely produce and are tolerant to errors) produced and comprehended very quickly (and usually effortlessly)

Humans are specially equipped to handle these difficulties, but machines are not (yet). Is it possible to make a machine understand and use natural language as a human does, or even approximate the same utility?

Page 5: Survey of NLP

A Typical NLP Pipeline More-or-less standardized approach

Tokenization: Isolate all words and word parts Sentence Segmentation: Isolate each individual sentence POS Tagging: Assign part(s) of speech for each word Phrase Structure Parsing: Isolate constituent boundaries Syntactic Parsing: Identify argument structures Semantic Analysis: Divine the meaning of a sentence Ontology Translation: Map meaning to a concept model

Page 6: Survey of NLP

Problems for NLP: Ambiguity Speech Segmentation

Misheard song lyrics, for example Discourse phenomena such as casual speech

Lexical Categorization I saw her duck. She fed her baby carrots.

Lexical/Phrasal Structure British Prime Minister

The Prime Minister of Britain? A Prime Minister (of some unknown country) who is of British descent?

Unlockable Something that can be unlocked? Something that can not be locked?

• Analogous to mathematical order of operations: 12 ÷ 2 + 1 = 7 or 4?

Page 7: Survey of NLP

Problems for NLP: Ambiguity Sentence Structure

People with kids who use drugs should be locked up. I forgot how good beer tastes.

Semantic Structure Someone always wins the game. [reference ambiguity] Every arrow hit a target. [scope ambiguity]

Implicitness Can you open the door?

A) Are you able to open the door? B) Open the door! What is the dog doing in the garage?

A) What activity is the dog carrying out? B) The dog doesn’t belong there. Yeah, right.

A) Yes, that is correct. (= agreement) B) No, that is incorrect. (= sarcasm)

Page 8: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Segmentation

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 9: Survey of NLP

Tokenization Type (= )

The set of “word form” types in language is the lexicon

Token (= ) A single instance of a linguistic type (word or contracted word)

I am hungry. { I | am | hungry | . } (=4; =4) He’s Mary’s friend? { He | ’s | Mary | ’s | friend | ? } (=5; =6) The blue car chased the red car. (=6; =8)

Types vs. Tokens in Comparative Corpora

Corpus Types () Tokens ()Switchboard Corpus 20,000 2,400,000Shakespeare 31,000 884,000Google Books (Ngram Viewer)

13,000,000

1,000,000,000

Page 10: Survey of NLP

Tokenization Tokenization

The process of individuating/indexing all tokens in a text Very difficult in writing systems with lax compounding rules or

flexible word boundaries German: der Donaudampfschifffahrtsgesellschaftskapitän

THE DANUBE· STEAMBOAT· VOYAGE· COMPANY· CAPTAIN

(“The Danube Steamship Company captain”) English: gonna, wanna, shoulda, hafta, …

Every token has a unique (within context) part-of-speech category and semantics Cross-POS homography

• Verb/Noun: record, progress, attribute, ... Syncretism

• Simple past and past participle: bought, cost, led, meant, …

Page 11: Survey of NLP

Tokenization The problem is token delineation

Spaces: United States of America Hyphens: well-rounded; father-in-law Multiple “spellings”:

US, USA, U.S., U.S.A., United States, … 1/11/11, 01/11/11, January 11, 2011, 11 January 2011, 2011-01-11, … (716) 555-5555, 716-555-5555, 716.555.55.55, …

The solution is normalization Lemmatization: identifying the root (lemma) of each token

Lemma: open• Inflectional Paradigm: open, opens, opening, opened, …

Lemma: be• Inflectional Paradigm: am, is, are, was, were, being, been, isn’t, aren’t, …

Page 12: Survey of NLP

Lemmatization Lemma linguistic type The set of possible words is much bigger than , thanks to

derivation and inflection Nouns/verbs

bike, skate, shelf, fax, email, Facebook, Google, … Plural (-s) combines with most singular common nouns

Cat(s), table(s), day(s), idea(s), … Genitive (-’s) combines with most nominals (simple or complex)

John’s cat, the black cat’s food, the Queen of England’s hat, the girl I met yesterday’s car

Progressive (-ing) attaches to almost any verb Biking, skating, shelving, faxing, emailing, Facebooking, Googling, … …which again can be ambiguous with another POS, e.g., shelving

Page 13: Survey of NLP

Inflection and Derivation Inflection

The paradigm (aka conjugation) of a single verb to account for person, number, and tense agreement

Regular I act, he acts, you acted, we are acting, they have acted, he will act

Irregular I go, he goes, you went, we are going, they have gone, she will go I catch, he catches, you caught, we are catching, they have caught, she will

catch New/introduced verbs (e.g., tweet, Google) have regular inflection

Derivation The process of deriving new words from a single root word

Nation (n.) national (adj.) nationalize (v.) nationalization (n.)

Page 14: Survey of NLP

The Importance of Accurate Tokenization Better downstream syntactic parsing

Stochastic (statistical) parsing thrives on high-quality input Better downstream semantic assessment

Stable but rare lexical composition patterns Anti-tank-missile (= a missile that targets tanks)

• Anti-missile-missile (= a missile that targets missiles)• Anti-anti-missile-missile-missile (= a missile that targets anti-missile-missiles)

Great-grandfather (= a grandparent’s father)• Great-great-grandfather (= a grandparent’s parent’s father)• Great-great-great-grandfather …

Reliable lexical decomposition, especially with new/nonce words I Yandexed it. {v|Yandex}simple past

I’m a Yandexer. {v|Yandex}agentive nominalization

I can’t stop Yandexing. {v|Yandex}progressive aspect

Page 15: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 16: Survey of NLP

Sentence Segmentation Naïve approach to identifying a sentence boundary:

1. If the current token is a period, it’s the end of sentence2. If the preceding token is on a list of known abbreviations, then the period

might not end the sentence3. If the following token is capitalized, then the period ends the sentence Shockingly: 95% accuracy!

Demo: An Online Sentence Breaker1. Mr. and Mrs. Jack Giancarlo of Lancaster celebrated their 50th wedding anniversary with a family cruise to the Bahamas.

Mr. Giancarlo and Patricia Keenan were married September 28, 1963, in Holy Angels Catholic Church, Buffalo. He is a retired inspector for the Ford Motor Co. Buffalo Stamping Plant; she is working as a tax preparer for H&R Block. They have five children and 13 grandchildren.1

2. The bookkeeper/office manager at an Amherst jewelry store has admitted stealing more than $51,000.00 in cash from daily sales at the business. Rena Carrow, 44, of Lancaster, pleaded guilty to third-degree grand larceny in the theft at Andrews Jewelers on Transit Road, according to Erie County District Attorney Frank A. Sedita III. Carrow admitted that between Aug. 31, 2011 and Dec. 5, 2012 she stole $51,069.14. She faces up to seven years in prison when she is sentenced Jan. 16 by Erie County Judge Kenneth F. Case.2

1 Adapted from http://www.buffalonews.com/life-arts/golden-weddings/patricia-and-jack-giancarlo-20131010, accessed 10 October 2013.2 Adapted from http://www.buffalonews.com/city-region/amherst/jewelry-store-bookkeeper-admits-to-stealing-more-than-51000-20131010, accessed 10 October 2013.

Page 17: Survey of NLP

End of Module 1

Questions?

Page 18: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 19: Survey of NLP

Parts of Speech Closed class (function words)

Pronouns: I, me, you, he, his, she, her, it, … Possessive: my, mine, your, his, her, their, its, … Wh-pronouns: who, what, which, when, whom,

whomever, …

Prepositions: in, under, to, by, for, about, … Determiners: a, an, the, each, every, some, ... Conjunctions

Coordinating: and, or, but, as, … Subordinating: that, then, who, because, …

Particles: up, down, off, on, .. Numerals: one, two, three, first, second, … Auxiliary verbs: can, may, should, could, …

Open class (content words) Nouns

Proper nouns: Jackie, Microsoft, France, Jupiter, … Common nouns

• Count nouns: cat, table, dream, height, …• Mass (non-count) nouns: milk, oil, mail, music,

furniture, fun, …

Verbs: read, eat, paint, think, tell, sleep, … Adjectives: purple, bad, false, original, … Adverbs: quietly, always, very, often, never, …

Page 20: Survey of NLP

POS Annotation Tagsets Penn Treebank

A syntactically-annotated corpus of 5M words, using a set of 45 POS tags devised by UPenn (sampling of tagset below)

CC Coordinating conjunction NNS Noun, plural UH Discourse interjection

CD Cardinal number NNP Proper noun, singular

VB Verb, infinitive (base)

DT Determiner NNPS Proper noun, plural VBD Verb, past tense

EX Existential there POS Possessive marker VBG Verb, gerund

IN Preposition/subordinating conjunction

PRP Personal pronoun VBN Verb, past participle

JJ Adjective, bare PRP$ Possessive pronoun

VBP Verb, non-3rd Sing. Pres. Form

JJR Adjective, comparative RB Adverb, bare VBZ Verb, 3rd Sing. Pres. Form

JJS Adjective, superlative RBR Adverb, comparative

. Sentence-final punct (. ? !)

MD Modal verb RBS Adverb, superlative LRB Left-rounded parenthesis

NN Noun, singular TO to RRB Right-rounded parenthesis

Page 21: Survey of NLP

POS Annotation Tagsets Comparison (Corpus : Word Count: Tagset Size)

Penn Treebank 4.5M n = 45

British National Corpus (BNC) 100M n = 61

Brown Corpus (Brown University) 1M n = 82

Corpus of Contemporary American English (COCA) 450M n = 137

Global Web-Based English (GloWBE) 1.9B n = 137

Why such a range across tagsets? Occurrence of “complex” tags

• Penn: [isn’t] is/VBZ n’t/RB• Brown: [isn’t] VBZ* (‘*’ indicates negation)

Most category distinctions are recoverable by context A more exhaustive list of available corpora is available here.

Page 22: Survey of NLP

POS Annotation Tagsets Each token is assigned its possible POS tags

Ambiguity resolved with statistical likelihood measures• e.g., nouns more likely than verbs to begin sentences, etc.

41 x 33 x 23 x 11 = 864 possible tag combinations Given the syntactic patterns of English, only 1 is statistically likely:Bill/NNP saw/VBD her/PRP$ father/NN ’s/VBZ bike/NN yesterday/RB /.

Bill saw her father ’s bike yesterday .

/NN /NN /PRP /NN /POS /NN /NN /.

/NNP /VB /PRP$ /VBP /VBZ /VBP /RB

/VBP /VBD /VB /VB

/VB

Page 23: Survey of NLP

POS Annotation Tagsets Lexical ambiguity metrics: Brown Corpus

11.5% of words (tokens) are ambiguous However, those 11.5% tend to be the most frequent types:

• I know that/IN she is honest.• Yes, that/DT concert was fun.• I’m not that/RB hungry.

In fact, those 11.5% of types account for 40% of the Brown corpus!

Page 24: Survey of NLP

Methods & Accuracy Rule-based POS Tagging 50.0% - 90.0% Probability-based (Trigram HMM) 55.0% - 95.0% Maximum Entropy P(t|w) 93.7% - 82.6% TnT (HMM++) 96.2% - 86.9% MEMM Tagger 96.9% - 86.9% Dependency Parser (Stanford) 97.2% - 90.0% Manual (Human) 98% upper bound

“Current part-of-speech taggers work rapidly and reliably, with per-token accuracies of slightly over 97%. [...] Good taggers have sentence accuracies around 55-57%.”Source: Manning 2011

Page 25: Survey of NLP

Rule-based Method Create a list of words with their most likely parts of speech For each word in a sentence, tag it by looking up its most

likely tag e.g., dog/NN > dog/VB > dog/VBP

Correct for errors with tag-changing rules Contextual rules: revise the tag based on the surrounding words

or the tags of the surrounding words• IN DT NEXTTAG NN (IN becomes DT if next tag is NN)

• that/IN cat/NN that/DT cat/NN

Lexical rules: revise the tag based on an analysis of the stemmed word, in concert with the understanding of derivational rules of English

Page 26: Survey of NLP

Stemming Affixation

Regular but not universal• -ize modernize, legalize, finalize

*newize, *lawfulize, *permanentize• un- unhealthy, unhappy, unstable

*unsick, *unsad, *unmiserable• -s (plural) cats, dogs, birds*oxs (oxen), *mouses (mice)*hippopotamuss (hippopotami or hippopotamuses)

Irregular verbs Root form changes for tense/aspect

• sink sank sunk• begin began begun• go went gone• do did done

Unstable paradigms• dive dove? dived? (= usually a dialectal variation)

Page 27: Survey of NLP

Stemming: Variation Predictability Pluralization via affix

cf. root change, e.g., man men

1. A singular root that does not end in “s”, “z”, “sh”, ch”, “dg” sounds or a vowel will take ‘-s’ in the plural form.• cat, dog, lab, map, batter, seagull, button, firm, …

2. A singular root ending in “s”, “z”, “sh”, “ch”, or “dg” sounds will take ‘-es’ in the plural form; if this results in an overlapping orthographic ‘e’, they will collapse.• loss + es = losses / bus + es = buses / house + es /…• buzz + es = buzzes / waltz + es = waltzes / …• ash + es = ashes / match + es = matches / hedge + s = hedges / …

A. Corollary: A singular root ending in a singular ‘z’ will geminate in the plural form.• quiz + es = quizzes / …

Predictable variation can be captured with rules

Page 28: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 29: Survey of NLP

N-grams Probabilistic language modeling

Goal: determine probability P of a sequence of words Applications:

POS Tagging• P(ShePRP bikesVBG) > P(ShePRP bikesNNS)

Spellchecking• P(their cat is sick) > P(there cat is sick)

Speech Recognition• P(I can forgive you) > P(I can for give you)

Machine translation, natural language generation, language identification, authorship (genre) identification, word similarity, sentiment analysis, etc.

Page 30: Survey of NLP

N-grams N-gram: a sequence of n words

Unigram: occurrence of a single isolated word Bigram: a sequence of two words Trigram: a sequence of three words 4-gram: a sequence of four words …

Resources/demonstrations Online N-gram calculator GoogleBooks N-gram Viewer Automatic random language generation

(based on N-gram probabilities of input text)

Page 31: Survey of NLP

N-grams: Scope of Usefulness In a text…

The set of bigrams is large and exhibits high frequencies The set of trigrams is fewer than the bigrams and also less frequent … The set of 15-grams is small and each probably occurs only once Zipf’s Law (long tail phenomenon): the frequency of a word is

inversely correlated with its semantic specificity Related Task

Compute probability of an upcoming word: “The probability of the next word being w5 given the preceding environment

w1 followed by w2 followed by w3 followed by w4.”

Example: What is the value of P(the|is,easy,to,see)?

Page 32: Survey of NLP

N-grams: Scope of Usefulness What is the value of P (the|it,is,easy,to,see)?

Approach #1: Counting!

Per Google (as of 22-Oct-2013):

Problem: not all possible sequences occur very often

Page 33: Survey of NLP

N-grams: Scope of Usefulness What is the value of P (the|it,is,easy,to,see)?

Approach #2: Estimate with N-grams Joint probabilitiesP (w) * P (w2|w1) * P (w3|w1,w2) * … * P (wn|w1,w2,…,wn-1)

Complex, time-consuming, and, in the end, not very helpful Limitations

N-gram probability analysis doesn’t give the whole picture “Garden path” sentences

The man that I saw with her bikes to work every day. The man that I saw with her bikes was a thief.

News headlines (“Journalese”) Corn maze cutter stalks fall fun across country After Earth Lost To Both Fast & Curious And Now You See Me At Friday Bo

x Office Jury awards $6.5M in CA case of nozzle thought gun

Page 34: Survey of NLP

Recurring Problem: Non-linearity Predictive sequence models fail because they assume that:

Syntax is linear (cf. hierarchical)• “She sent a postcard to her friend from Australia.”

• L: She sent a postcard to [her friend from Australia].• H: [She sent a postcard] to her friend [from Australia].

All dependencies are local (cf. long-distance)• Which instrument did you play?

• Deconstruction: Determine the value of x such that x is an instrument and you play

x• Which instrument did your college roommate try to annoy you by playing?

• Deconstruction: Define set v that is identical to the set of your roommates Define subset x of set v as the set of roommates from college Define subset y of set v that played an instrument w Define subset z of set v that played w to annoy you Determine the value of w

Page 35: Survey of NLP

End of Module 2

Questions?

Page 36: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 37: Survey of NLP

Phrase Structures Computational Analogy: base-10 arithmetic

Lexicon: N 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 O + | - | x | =

Grammar: N N O NN (1+2) x (3+4) N 9 – ((2 x 3) + 1)

N 3 x 7 N 9 – (6 + 1)

21 3 x 7 N 9 – 7

2 9 – 7

Page 38: Survey of NLP

Phrase Structures Natural language has a bigger lexicon and more rules

How? Recursion: a phrase defined in terms of itself

• A noun phrase can be rewritten as (for instance):• NP DT N “the dog”• N N PP “dog in the yard”

• A prepositional phrase is rewritten as a preposition (relational term) and a noun phrase.• PP P NP

• These three rules alone allow for infinite recursion! Example:

• “Put the ring in the box on the table at the end of the hallway.”• Where is the ring now? Where is it going?

Page 39: Survey of NLP

Phrase Structures Phrasal rewrite rules

Additional rules of English S NP VP [the dog] [barked]

• NP DT N [the dog]• VP IV [barked]

N AdjP N [big] [dog] VP TV NP [gnawed] [the bone] VP DTV NP NP [gave] [Mary] [a kiss] VP DTV NP PP [gave] [a kiss] [to Mary] PDV DTV [was given]

• VP DTV NP PP [was given] [a kiss] [by the dog] VP VP PP [went] [to the park]

• PP P NP [to] [the park]

Page 40: Survey of NLP

Phrase Structures

Parse #1:The woman [called] a friend [from Australia].

Parse #2:The woman called [a friend from Australia].

Syntactic tree structure “The woman called a friend from Australia.”

Is this parse predicted by the grammar rules?[The woman] called a friend [from Australia].

Page 41: Survey of NLP

Phrase Structures Other common sources of recursion

Complex/non-canonical phrases VP AUX VP

• By this time next month, I [will [have [been [married]]]] for 10 years.

Complex/non-canonical phrases NP GerundVP

• [Swimming] is fun. GerundVP VBG• [Going to the beach] is a great way to relax. GerundVP VBG PP• [Visiting the cemetery] was very sad. GerundVP VBG NP

Reiteration within rules NP DT AdjP N “the big dog” AdjP Adj* “big brown furry” AdjP (Adv*) Adj* “[awesomely [big]] [really [furry]]”

Page 42: Survey of NLP

Phrase Structures How do we know phrase structure rules exist?

Ability to parse novel grammatical sentences “They laboriously cavorted with intrepid neighbors.”

Ability to intuit when a sentence is ungrammatical. “Like almost eyes feel been have fully indigo.”

How many rules are there? Nobody knows! Open problem since the 1950s. The statistical universals have been identified –

Existing phrase structure rules account for 97% of natural language constructions

Psycholinguists focus on the remaining 3% via the grammaticality/acceptability interface

Page 43: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 44: Survey of NLP

Online Parsers Phrase Structure Parsers

Probabilistic LFG F-structure parsing Link Grammar ZZCad

Dependency Parsers Stanford Parser Connexor

ROOT The woman called a friend.

det(woman-2, the-1)nsubj(called-3, woman-2)root(root-0, called-3)det(friend-5, a-4)dobj(called-3, friend-5)

Page 45: Survey of NLP

Long-distance Dependencies Local

Which instrument did you play?

det(instrument-2, which-1)dobj(play-5, instrument-2)aux(play-5, did-3)nsubj(play-5, you-4)root(root-0, play-5)

Long-distance Which instrument did your college

roommate try to annoy you by playing?

det(instrument-2, which-1)dep(try-7, instrument-2)aux(try-7, did-3)poss(roommate-6, your-4)nn(roommate-6, college-5)nsubj(try-7, roommate-6)xsubj(annoy-9, roommate-6)root(root-0, try-7)aux(annoy-9, to-8)xcomp(try-7, annoy-9)dobj(annoy-9, you-10)prep(annoy-9, by-11)pobj(by-11, playing-12)

Page 46: Survey of NLP

End of Module 3

Questions?

Page 47: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 48: Survey of NLP

The Syntax-Semantics Interface Can we automate the process of associating semantic

representations with parsed natural language expressions? Is the association even systematic?

Page 49: Survey of NLP

The Syntax-Semantics Interface The meaning of an expression is a function of the

meanings of its parts and the way the parts are combined syntactically

[The cat] chased the dog. [The cat] was chased by the dog. The dog chased [the cat].

The meaning of [the cat] is fairly stable, but its role in the sentence is determined by syntax

The primary tenet of the syntax-semantics interface is this Principle of Compositionality

Page 50: Survey of NLP

Compositionality Semantic -calculus

Notational extension of First-Order Logic Grammar is extended with semantic representations

Proper names:(PN; tom) Tom; (PN; mia) Mia Intrans. verbs: (IV; snores Transitive verbs: (TV; likes Phrasal rules:

Sentence (S; ()()) (NP; )(VP; ) Noun Phrase (NP; ) (PN; ) Intransitive VP (VP; ) (IV; ) Transitive VP (VP; () ()) (TV; )(NP; )

Page 51: Survey of NLP

Compositionality

Page 52: Survey of NLP

Event Structure The problem of determining the number of arguments for a

given verb is complicated by the additional of non-essential expressions

I ate. I ate a sandwich. I ate a sandwich in my car. I ate in my car. I ate a sandwich for lunch. I ate a sandwich for lunch yesterday. I ate a sandwich around noon.

Linguistic approach: [in my car], [for lunch], [yesterday], and [around noon] are not required arguments of the verb; rather, they are modifiers.

Page 53: Survey of NLP

Event Structure For that approach to work, we must assert that there

are mutually-exclusive sets of events and states. State: A fact that is true of a single point in time

Larry died. *Larry died for two hours.

Event: A state change Activities: have no particular endpoint

• Larry ran in the park. Accomplishments: have a natural endpoint

• Larry ran to the park. Achievements: true of a single point in time but yield a result state

• Larry found his car.• The tire popped.

Page 54: Survey of NLP

Event Structure Event/state distinctions remove the need to know the

number of arguments of a verb Instead, participants are categorized by thematic role

Thematic Role

Definition Example

AGENT Volitional causer of an event The waiter spilled the soup.EXPERIENCER Experiencer of an event John has a headache.FORCE Non-volitional causer of an event The wind blew debris into the

yard.THEME Participant most directly affected by an

eventThe skaters broke the ice.

RESULT The end product of an event They built a golf course…CONTENT The proposition or content of a

propositional eventHe asked, “Have you graduated yet?”

INSTRUMENT An instrument used in an event He hit the nail with a hammer.BENEFICIARY The beneficiary of an event She booked the flight for her

boss.SOURCE Origin of an object of a transfer event I just arrived from Paris.GOAL Destination of an object of a transfer

eventI sailed to Cape Cod.

Page 55: Survey of NLP

Computational Lexical Semantics Hypernyms/Hyponyms

primate

simian

ape

orangutan

gorilla

silverback

chimpanzee

monkey

baboon

macaque

vervet

hominid

homo erectus

homo sapiens

Cro-magnon

homo sapiens sapiens

Page 56: Survey of NLP

Computational Lexical Semantics If X is a hyponym of Y, then:

Example: daffodil is a hyponym of flower. Every daffodil is a flower, but not every flower is a daffodil.

If X is a hypernym of Y, then: Example: jet is a hypernym of Boeing 737.

Not every jet is a Boeing 737, but every Boeing 737 is a jet.

Entailment X entails Y whenever X is true, Y is also true. Downward-entailing verbs

Hate, dislike, fear, … Upward-entailing verbs

See, have, buy, …

Page 57: Survey of NLP

Computational Lexical Semantics WordNet (English WordNet: Link)

Hierarchical lexical database of open-class synonyms, antonyms, hypernyms/hyponyms, and meronyms/holonyms

• 115000+ entries Each entry belongs to a synset, a set of sense-based synonyms

Example: “bank”• {08437235}<noun.group> depository financial institution, banking company (a financial

institution that accepts deposits) “He cashed a check at the bank.”• {02790795}<noun.artifact> bank building (a building in which the business of banking

transacted) “The bank is on the corner of Main and Elm.”• {02315835}<verb.possession> deposit (put into a bank account) “She banked the check.”• {00714537}<verb.cognition> count, bet, depend, swear, rely, reckon (have faith or

confidence in) “He’s banking on that promotion.”

Page 58: Survey of NLP

Computational Lexical Semantics Word-sense similarity technology is applied to:

Intelligent web searches Questing answering Plagiarism detection

Word-sense disambiguation (WSD) Supervised

Input: hand-annotated corpora• Time-intensive and unreliable

1. Start with sense-annotated training data

2. Extract features describing the contexts of the target word

3. Train a classifier with some machine-learning algorithm

4. Apply the classifier to unlabeled data

Page 59: Survey of NLP

Computational Lexical Semantics Precision and Recall

Machine-learning algorithms and training models are calibrated and scored with precision and recall metrics Precision: How specifically relevant are my results?

• The number of correct answers retrieved relative to the total number of retrieved answers

Recall: How generally relevant are my results?• The number of answers retrieved relative to the total number of

correct answers retrieved F-score

• The weighted mean of precision and recall

F =

Page 60: Survey of NLP

Survey of NLP Module 1

Introduction Tokenization Sentence Breaking

Module 2 Part-of-Speech (POS) Tagging N-gram Analysis

Module 3 Phrase Structure Parsing Syntactic Parsing

Module 4 Semantic Analysis NLP & Ontologies

Page 61: Survey of NLP

NLP & Ontologies WordNet is a primitive ontology

Hierarchical organization of concepts Noun

• Act• Animal• Artifact• …

Verb• Motion• Perception• Stative• …

Ontologies are a model-specific mechanism for knowledge representation

Page 62: Survey of NLP

NLP & Ontologies The input to NLP (for sake of argument) can be any

disparate data The output of NLP is an index of extracted linguistic

phenomena Sentences, words, verb semantics, argument structure, etc.

When aligned to an ontology model, the output of NLP is easily integrated with information extraction efforts Semantic concepts (entities, events) are mapped to classes Arcs (relations, attributes) are mapped with properties

Page 63: Survey of NLP

NLP & Ontologies Domain specificity

In most industry applications, a whole-world representative model is neither required nor useful

Domain-specific ontologies exploit the set of target entities and properties e.g., biomedical ontologies, military ground-force ontologies, etc.

Page 64: Survey of NLP

End of Module 4

Questions?