82
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006 This research supported in part by NSF DBI-0317510

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

Embed Size (px)

DESCRIPTION

Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006. Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems. This research supported in part by NSF DBI-0317510. Natural Language Processing. - PowerPoint PPT Presentation

Citation preview

Page 1: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Unambiguous + Unlimited = Unsupervised

Using the Web for Natural Language Processing Problems

Marti HearstSchool of Information, UC Berkeley

UCB Neyman SeminarOctober 25, 2006

This research supported in part by NSF DBI-0317510

Page 2: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Natural Language Processing

The ultimate goal: write programs that read and understand stories and conversations. This is too hard! Instead we tackle sub-problems.

There have been notable successes lately: Machine translation is vastly improved Speech recognition is decent in limited circumstances Text categorization works with some accuracy

Page 3: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Automatic Help Desk Translation at MS

Page 4: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

How can a machine understand these differences?

Get the cat with the gloves.

Page 5: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

How can a machine understand these differences?

Get the sock from the cat with the gloves.

Get the glove from the cat with the socks.

Page 6: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

How can a machine understand these differences?

Decorate the cake with the frosting. Decorate the cake with the kids. Throw out the cake with the frosting. Throw out the cake with the kids.

Page 7: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Why is this difficult?

Same syntactic structure, different meanings.

Natural language processing algorithms have to deal with the specifics of individual words.

Enormous vocabulary sizes. The average English speaker’s vocabulary is around

50,000 words, Many of these can be combined with many others, And they mean different things when they do!

Page 8: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

How to tackle this problem?

The field was stuck for quite some time. Hand-enter all semantic concepts and relations

A new approach started around 1990 Get large text collections Compute statistics over the words in those

collections

There are many different algorithms.

Page 9: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Size Matters

Recent realization: bigger is better than smarter!Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL

Page 10: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Example Problem

Grammar checker example:Which word to use? <principal> <principle>

Solution: use well-edited text and look at which words surround each use: I am in my third year as the principal of Anamosa

High School.

School-principal transfers caused some upset.

This is a simple formulation of the quantum mechanical uncertainty principle.

Power without principle is barren, but principle without power is futile. (Tony Blair)

Page 11: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Using Very, Very Large Corpora

Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: Principal: “high school” Principle: “rule”

At grammar-check time, choose the spelling best predicted by the surrounding words.

Surprising results: Log-linear improvement even to a billion words! Getting more data is better than fine-tuning

algorithms!

Page 12: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

The Effects of LARGE Datasets

From Banko & Brill ‘01

Page 13: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

How to Extend this Idea?

This is an exciting result … BUT relies on having huge amounts of

text that has been appropriately annotated!

Page 14: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

How to Avoid Manual Labeling?

“Web as a baseline” (Lapata & Keller 04,05)

Main idea: apply web-determined counts to every problem imaginable.

Example: for t in {<principal> <principle>} Compute f(w-1, t, w+1) The largest count wins

Page 15: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web as a Baseline

Works very well in some cases machine translation candidate selection article generation noun compound interpretation noun compound bracketing adjective ordering

But lacking in others spelling correction countability detection prepositional phrase attachment

How to push this idea further?

Significantly better than the best supervised algorithm.

Not significantly different from the best supervised.

Page 16: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Using Unambiguous Cases

The trick: look for unambiguous cases to start

Use these to improve the results beyond what co-occurrence statistics indicate.

An Early Example: Hindle and Rooth, “Structural Ambiguity and

Lexical Relations”, ACL ’90, Comp Ling’93 Problem: Prepositional Phrase attachment

I eat/v spaghetti/n1 with/p a fork/n2. I eat/v spaghetti/n1 with/p sauce/n2.

Question: does n2 attach to v or to n1?

Page 17: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Using Unambiguous Cases

How to do this with unlabeled data? First try:

Parse some text into phrase structure Then compute certain co-occurrences

f(v, n1, p) f(n1, p) f(v, n1) Problem: results not accurate enough

The trick: look for unambiguous cases: Spaghetti with sauce is delicious. (pre-verbal) I eat with a fork. (no direct

object)

Use these to improve the results beyond what co-occurrence statistics indicate.

Page 18: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Unambiguous + Unlimited = Unsupervised Apply the Unambiguous Case Idea to the Very,

Very Large Corpora idea The potential of these approaches are not fully realized

Our work (with Preslav Nakov): Structural Ambiguity Decisions

PP-attachment Noun compound bracketing Coordination grouping

Semantic Relation Acquisition Hypernym (ISA) relations Verbal relations between nouns

SAT Analogy problems

Page 19: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Applying U + U = U to Structural Ambiguity

We introduce the use of (nearly) unambiguous features: Surface features Paraphrases

Combined with ngrams Use from very, very large corpora Achieve state-of-the-art results without

labeled examples.

Page 20: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the liver cell.In (b), the cell line is derived from the liver.

Page 21: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Dependency Model

right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)

home health care

w1 and w2 independently modify w3

adult male rat

left bracketing : [ [w1w2 ]w3] only 1 modificational choice possible

law enforcement officer

w1 w2 w3

w1 w2 w3

Page 22: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Our U + U + U Algorithm

Compute bigram estimates Compute estimates from surface features Compute estimates from paraphrases Combine these scores with a voting

algorithm to choose left or right bracketing.

We use the same general approach for two other structural ambiguity problems.

Page 23: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Computing Bigram Statistics

Dependency Model, FrequenciesCompare #(w1,w2) to #(w1,w3)

Dependency model, Probabilities

Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)

Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)

So we compare Pr(w1w2|w2) to Pr(w1w3|w3)

w1 w2 w3

left

right

Page 24: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Using ngrams to estimate probabilities

Using page hits as a proxy for n-gram counts

Pr(w1w2|w2) = #(w1,w2) / #(w2) #(w2) word frequency; query for “w2” #(w1,w2) bigram frequency; query for “w1 w2”

smoothed by 0.5 Use 2 to determine if w1 is associated with w2

(thus indicating left bracketing), and same for w1 with w3

Page 25: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Our U + U + U Algorithm

Compute bigram estimates Compute estimates from surface features Compute estimates from paraphrases Combine these scores with a voting

algorithm to choose left or right bracketing.

Page 26: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features

Authors often disambiguate noun compounds using surface markers, e.g.: amino-acid sequence left brain stem’s cell left brain’s stem cell right

The enormous size of the Web makes these frequent enough to be useful.

Page 27: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Dash (hyphen)

Left dash cell-cycle analysis left

Right dash donor T-cell right

Double dash T-cell-depletion unusable…

Page 28: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Possessive Marker

Attached to the first word brain’s stem cell right

Attached to the second word brain stem’s cell left

Combined features brain’s stem-cell right

Page 29: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Capitalization

anycase – lowercase – uppercase Plasmodium vivax Malaria left plasmodium vivax Malaria left

lowercase – uppercase – anycase brain Stem cell right brain Stem Cell right

Disable this on: Roman digits Single-letter words: e.g. vitamin D

deficiency

Page 30: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Embedded Slash

Left embedded slash leukemia/lymphoma cell right

Page 31: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Parentheses

Single-word growth factor (beta) left (brain) stem cell right

Two-word (growth factor) beta left brain (stem cell) right

Page 32: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Comma, dot, semi-colon

Following the first word home. health care right adult, male rat right

Following the second word health care, provider left lung cancer: patients left

Page 33: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Web-derived Surface Features:Dash to External Word

External word to the left mouse-brain stem cell right

External word to the right tumor necrosis factor-alpha left

Page 34: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Other Web-derived Features:Abbreviation

After the second word tumor necrosis factor (NF) right

After the third word tumor necrosis (TN) factor right

We query for, e.g., “tumor necrosis tn factor” Problems:

Roman digits: IV, VI States: CA Short words: me

Page 35: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Other Web-derived Features:Concatenation

Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812

Adjacency model healthcare vs. carereform

Dependency model healthcare vs. healthreform

Triples “healthcare reform” vs. “health carereform”

Page 36: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Other Web-derived Features:Reorder

Reorders for “health care reform” “care reform health” right “reform health care” left

Page 37: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Other Web-derived Features:Internal Inflection Variability

Vary inflection of second word tyrosine kinase activation tyrosine kinases activation

Page 38: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Other Web-derived Features:Switch The First Two Words

Predict right, if we can reorder adult male rat as male adult rat

Page 39: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Our U + U + U Algorithm

Compute bigram estimates Compute estimates from surface features Compute estimates from paraphrases Combine these scores with a voting

algorithm to choose left or right bracketing.

Page 40: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases

The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) Prepositional

stem cells in the brain right cells from the brain stem right

Verbal virus causing human immunodeficiency left

Copula office building that is a skyscraper right

Page 41: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases

prepositional paraphrases: We use: ~150 prepositions

verbal paraphrases: We use: associated with, caused by, contained in,

derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.

copula paraphrases: We use: is/was and that/which/who

optional elements: articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc.

Page 42: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Our U + U + U Algorithm

Compute bigram estimates Compute estimates from surface features Compute estimates from paraphrases Combine these scores with a voting

algorithm to choose left or right bracketing.

Page 43: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Evaluation: Datasets

Lauer Set 244 noun compounds (NCs)

from Grolier’s encyclopedia inter-annotator agreement: 81.5%

Biomedical Set 430 NCs

from MEDLINE inter-annotator agreement: 88% ( =.606)

Page 44: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Co-occurrence Statistics

Lauer set

Bio set

Page 45: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrase and Surface Features Performance

Lauer Set

Biomedical Set

Page 46: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Individual Surface Features Performance: Bio

Page 47: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Individual Surface Features Performance: Bio

Page 48: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Results Lauer

Page 49: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Results: Comparing with Others

Page 50: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Results Bio

Page 51: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Results for Noun Compound Bracketing

Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) surface features paraphrases

Obtained new state-of-the-art results on NC bracketing more robust than Lauer (1995) more accurate than Keller&Lapata (2004)

Page 52: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Prepositional Phrase Attachment

Problem: (a) Peter spent millions of dollars. (noun

attach)

(b) Peter spent time with his family. (verb attach)

Which attachment for quadruple: (v, n1, p, n2)

Results:Much simpler than other algorithmsAs good as or better than best unsupervised, and better than some supervised approaches

Page 53: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Noun Phrase Coordination

(Modified) real sentence:

The Department of Chronic Diseases and Health Promotion leads and strengthens global efforts to prevent and control chronic diseases or disabilities and to promote health and quality of life.

Page 54: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

NC coordination: ellipsis

Ellipsis car and truck production means car production and truck production

No ellipsis president and chief executive

All-way coordination Securities and Exchange Commission

Page 55: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Results428 examples from Penn TB

Page 56: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Semantic Relation Detection

Goal: automatically augment a lexical database

Many potential relation types: ISA (hypernymy/hyponymy) Part-Of (meronymy)

Idea: find unambiguous contexts which (nearly) always indicate the relation of interest

Page 57: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Lexico-Syntactic Patterns

Page 58: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Lexico-Syntactic Patterns

Page 59: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Adding a New Relation

Page 60: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Semantic Relation Detection

Lexico-syntactic Patterns: Should occur frequently in text Should (nearly) always suggest the relation of

interest Should be recognizable with little pre-encoded

knowledge.

These patterns have been used extensively by other researchers.

Page 61: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Semantic Relation Detection

What relationship holds between two nouns? olive oil – oil comes from olives machine oil – oil used on machines

Assigning the meaning relations between these terms has been seen as a very difficult solution

Our solution: Use clever queries against the web to figure out

the relations.

Page 62: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Queries for Semantic Relations

Convert the noun-noun compound into a query of the form:

noun2 that * noun1 “oil that * olive(s)” This returns search result snippets containing

interesting verbs. In this case:

Come from Be obtained from Be extracted from Made from …

Page 63: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Uncovering Semantic Relations

More examples: Migraine drug -> treat, be used for, reduce,

prevent Wrinkle drug -> treat, be used for, reduce,

smooth

Printer tray -> hold, come with, be folded, fit under, be inserted into

Student protest -> be led by, be sponsored by, pit, be, be organized by

Page 64: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Application: SAT Analogy Problems

Page 65: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Tackling the SAT Analogy Problem

First issue queries to find the relations (features) that hold between each word pair

Compare the features for each answer pair to those of the question pair. Weight the features with term count and

document counts Compare the weighted feature sets using Dice

coefficient

Page 66: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Queries for SAT Analogy Problem

Page 67: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Extract Features from Retrieved Text

Verb The committee includes many members. This is a committee, which includes many

members. This is a committee, including many members.

Verb+Preposition The committee consists of many members.

Preposition He is a member of the committee.

Coordinating Conjunction the committee and its members

Page 68: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Most Frequent Features for “committee member”

Page 69: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

SAT Results: Nouns Only

Page 70: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Conclusions

The enormous size of the web opens new opportunities for text analysis There are many words, but they are more likely to appear

together in a huge dataset This allows us to do word-specific analysis

To counter the labeled-data roadblock, we start with unambiguous features that we can find naturally. We’ve applied this to structural and semantic language

problems. These are stepping stones towards sophisticated language

understanding.

Page 71: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Thank you!

http://biotext.berkeley.eduSupported in part by NSF DBI-0317510

Page 72: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Using n-grams to make predictions

Say trying to distinguish: [home health] care home [health care]

Main idea: compare these co-occurrence probabilities “home health” vs “health care”

Page 73: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Using n-grams to make predictions

Use search engines page hits as a proxy for n-gram counts compare Pr(w1w2|w2) to Pr(w1w3|w3)

Pr(w1 w2|w2 ) = #(w1,w2) / #(w2) #(w2) word frequency; query for “w2”

#(w1,w2) bigram frequency; query for “w1 w2”

Page 74: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Probabilities: Why? (1)

Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?

Keller&Lapata (2004) calculate: AltaVista queries:

(a): 70.49% (b): 68.85%

British National Corpus: (a): 63.11% (b): 65.57%

Page 75: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Probabilities: Why? (2)

Why should we use: (a) Pr(w1w2|w2), rather than

(b) Pr(w2w1|w1)?

Maybe to introduce a bracketing prior. Just like Lauer (1995) did.

But otherwise, no reason to prefer either one. Do we need probabilities? (association is OK) Do we need a directed model? (symmetry is

OK)

Page 76: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Adjacency & Dependency (2)

right bracketing: [w1[w2w3] ] w2w3 is a compound (modified by w1)

w1 and w2 independently modify w3

adjacency model Is w2w3 a compound?

(vs. w1w2 being a compound)

dependency model Does w1 modify w3?

(vs. w1 modifying w2)

w1 w2 w3

w1 w2 w3

w1 w2 w3

Page 77: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases: pattern (1)

(1)v n1 p n2 v n2 n1 (noun)

Can we turn “n1 p n2” into a noun compound “n2 n1”? meet/v demands/n1 from/p customers/n2 meet/v the customer/n2 demands/n1

Problem: ditransitive verbs like give gave/v an apple/n1 to/p him/n2 gave/v him/n2 an apple/n1

Solution: no determiner before n1 determiner before n2 is required the preposition cannot be to

Page 78: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases: pattern (2)

(2)v n1 p n2 v p n2 n1 (verb)

If “p n2” is an indirect object of v, then it could be switched with the direct object n1. had/v a program/n1 in/p place/n2 had/v in/p place/n2 a program/n1

Determiner before n1 is required to prevent

“n2 n1” from forming a noun compound.

Page 79: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases: pattern (3)

(3)v n1 p n2 p n2 * v n1(verb)

“*” indicates a wildcard position (up to three intervening words are allowed)

Looks for appositions, where the PP has moved in front of the verb, e.g. I gave/v an apple/n1 to/p him/n2 to/p him/n2 I gave/v an apple/n1

Page 80: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases: pattern (4)

(4)v n1 p n2 n1 p n2 v(noun)

Looks for appositions, where “n1 p n2” has moved in front of v shaken/v confidence/n1 in/p markets/n2 confidence/n1 in/p markets/n2 shaken/v

Page 81: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases: pattern (5)

(5)v n1 p n2 v PRONOUN p n2 (verb)

n1 is a pronoun verb (Hindle&Rooth, 93)

Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g. put/v a client/n1 at/p odds/n2 put/v him at/p odds/n2

pronoun

Page 82: Unambiguous + Unlimited = Unsupervised Using the Web for  Natural Language Processing Problems

Marti Hearst, Neyman Seminar, 2006

Paraphrases: pattern (6)

(6)v n1 p n2 BE n1 p n2 (noun)

BE is typically used with a noun attachment

Pattern (6) substitutes v with a form of to be (is or are), e.g. eat/v spaghetti/n1 with/p sauce/n2 is spaghetti/n1 with/p sauce/n2

to be