42
Three kinds of web data that can help computers make better sense of human language Shane Bergsma Johns Hopkins University Fall, 2011

Three kinds of web data that can help computers make better sense of human language

  • Upload
    alair

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Three kinds of web data that can help computers make better sense of human language. Shane Bergsma Johns Hopkins University. Fall, 2011. Computers that understand language. - PowerPoint PPT Presentation

Citation preview

Page 1: Three kinds of web data that can help computers make better sense of human language

Three kinds of web data that can help computers make better

sense of human language

Shane BergsmaJohns Hopkins University

Fall, 2011

Page 2: Three kinds of web data that can help computers make better sense of human language

2

Computers that understand language

“William Wilkinson’s ‘An Account of the Principalities of Wallachia and Moldavia’

inspired this author’s most famous novel.”

Page 3: Three kinds of web data that can help computers make better sense of human language

3

Research Vision

Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets

Derive meaning from real-world data:1) Raw text on the web2) Bilingual text (words plus their translations)

Part 1: Parsing noun phrases3) Visual data (labelled online images)

Part 2: Learning the meaning of words

Page 4: Three kinds of web data that can help computers make better sense of human language

4

Part 1: Parsing Noun Phrases (NPs)Google: What pages/ads should be returned for

the query “washed baby carrots”?

[washed baby] carrots vs. washed [baby carrots]carrots for washed babies baby carrots that are washed

Page 5: Three kinds of web data that can help computers make better sense of human language

5

Training a parser via machine learning

washed baby carrots

PARSER

INCORRECT in training data

TESTER

[washed baby] carrots

with weights, w0

Page 6: Three kinds of web data that can help computers make better sense of human language

6

Training a parser via machine learning

washed baby carrots

PARSER

CORRECT in gold standard

TESTER

washed [baby carrots]

Training corpus:retired [science teacher][social science] teacher

female [bus driver][school bus] driver

zebra [hair straightener]alleged [Canadian lover]

with weights, w1

Page 7: Three kinds of web data that can help computers make better sense of human language

[Banko & Brill, 2001]

Grammar CorrectionTask

More data is better data (learning curve)

Page 8: Three kinds of web data that can help computers make better sense of human language

8

Testing a parser on new data

washed baby smell

PARSER

INCORRECT

TESTER

Big Challenge: For parsing NPs, every word matters - both parses are grammatical - we can’t generalize from “washed baby carrots” in training to “washed baby smell” at test time

Solution: New sources of data

- Having seen washed [baby carrots] in training…

washed [baby smell]

with final weights, wN

Page 9: Three kinds of web data that can help computers make better sense of human language

9

English Data for Parsing

Human Annotated• 1 MILLION words

Penn (Parse-)Treebank [Marcus et al., 1993]

Bitexts• 1 BILLION words

Canadian Hansards, etc. [Callison-Burch et al., 2010]

Web text [N-grams]• 1 TRILLION words

Google N-gram Data [Brants & Franz, 2006]

Page 10: Three kinds of web data that can help computers make better sense of human language

10

Task: Parsing NPs with conjunctions1) [dairy and meat] production2) [sustainability] and [meat production]

yes: [dairy production] in (1)no: [sustainability production] in (2)

• Our contributions: new semantic features from raw web text and a new approach to using bilingual data as soft supervision

[Bergsma, Yarowsky & Church, ACL 2011]

Page 11: Three kinds of web data that can help computers make better sense of human language

11

One Noun Phrase or Two:A Machine Learning Approach

Input: “dairy and meat production”→ features: x

x = (…, first-noun=dairy, … second-noun=meat, … first+second-noun=dairy+meat, …)

h(x) = w ∙ x (predict one NP if h(x) > 0)

• Set w via training on annotated training data using some machine learning algorithm

Page 12: Three kinds of web data that can help computers make better sense of human language

12

Leveraging Web-Derived Knowledge

[dairy and meat] production• If there is only one NP, then it is implicitly talking

about “dairy production” • Do we see this phrase occurring a lot on the web? [Yes]

sustainability and [meat production]• If there is only one NP, then it is implicitly talking

about “sustainability production”• Do we see this phrase occurring a lot on the web? [No]

• Classifier has features for these counts

Page 13: Three kinds of web data that can help computers make better sense of human language

13

Search Engine Page Counts for NLP

• Early web work: Use an Internet search engine to get web counts[Keller & Lapata, 2003]

“dairy production” 714,000 pages

“sustainability production” 11,000 pages

Problem: Using a search engine is just too inefficient to get data on a large scale

Page 14: Three kinds of web data that can help computers make better sense of human language

14

Google N-gram Data for NLP• Google N-gram Data [Brants & Franz, 2006]

– N words in sequence + their count on web:…

dairy producers 22724 dairy production 17704 dairy professionals 204 dairy profits 82 dairy propaganda 15 dairy protein 1268 …

– A compressed version of all the text on web– Enables new features/statistics for a range of tasks

[Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]

Page 15: Three kinds of web data that can help computers make better sense of human language

15

Features for Explicit Paraphrasesdairy and meat production sustainability and meat production

Pattern: ❸ of ❶ and ❷↑Count( production of dairy and meat)

↓Count(production of sustainability and meat)

Pattern: ❷ ❸ and ❶

↓Count(meat production and dairy)

↑Count(meat production and sustainability)

❶ and ❷ ❸

New paraphrases extending ideas in [Nakov & Hearst, 2005]

❶ and ❷ ❸

Page 16: Three kinds of web data that can help computers make better sense of human language

16

Training Examples

conservation and good management

motor and heating fuels

freedom and security agenda

Google N-gram Data

Feature Vectorsx1, x2, x3, x4

Classifier: h(x)

Machine Learning

Human-Annotated

Data (small)

Raw Data (HUGE)

Page 17: Three kinds of web data that can help computers make better sense of human language

17

Using Bilingual Data

• Bilingual data: a rich source of paraphrasesdairy and meat production producción láctea y cárnica

• Build a classifier which uses bilingual features– Applicable when we know the translation of the NP

Page 18: Three kinds of web data that can help computers make better sense of human language

18

Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production

Pattern: ❸ ❶ … ❷ (Spanish)

Count( producc ión láctea y cárnica )

unseen

Pattern: ❶ … ❸ ❷ (Italian)

unseen Count(sosten ib i l i tà e la produz ione d i carne )

❶ and ❷ ❸ ❶ and ❷ ❸

Page 19: Three kinds of web data that can help computers make better sense of human language

19

Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production

Pattern: ❶- … ❷❸ (Finnish)C o u nt ( m a i d o n ̶ j a l i h a n t u o ta n t o o n )

unseen

❶ and ❷ ❸ ❶ and ❷ ❸

Page 20: Three kinds of web data that can help computers make better sense of human language

20

Training Examples

conservation and good management

motor and heating fuels

freedom and security agenda

Translation Data

Feature Vectorsx1, x2, x3, x4

Classifier: h(xb)

Machine Learning

Human-Annotated

Data (small)

Bilingual Data (medium)

Page 21: Three kinds of web data that can help computers make better sense of human language

21

h(xb)

insurrection and regime change

coal and steel money

North and South Carolina

business and computer science

the Bosporus and Dardanelles straits

rocket and mortar attacks

the environment and air transport

pollution and transport safety

h(xm)

insurrection and regime change

coal and steel money

North and South Carolina

business and computer science

the Bosporus and Dardanelles straits

rocket and mortar attacks

the environment and air transport

pollution and transport safety

+ Features from Google Data

Training Examples

+ Features from Translation Data

Training Examples

coal and steel money rocket and mortar attacks

insurrection and regime change

North and South Carolina

business and computer science

the Bosporus and Dardanelles straits

the environment and air transport

pollution and transport safety

Bitext Examples

Page 22: Three kinds of web data that can help computers make better sense of human language

22

h(xm)

+ Features from Google Data

Training Examples

+ Features from Translation Data

Training Examplescoal and steel money

rocket and mortar attacks

insurrection and regime change

North and South Carolina

business and computer science

the Bosporus and Dardanelles straits

the environment and air transport

pollution and transport safety

h(xb)1

insurrection and regime change

North and South Carolina

business and computer science

the Bosporus and Dardanelles straits

the environment and air transport

pollution and transport safety

business and computer sciencethe Bosporus and Dardanelles straitsthe environment and air transport

insurrection and regime change

North and South Carolina

pollution and transport safety

Page 23: Three kinds of web data that can help computers make better sense of human language

23

+ Features from Google Data

Training Examples

+ Features from Translation Data

Training Examplescoal and steel money

rocket and mortar attacks

business and computer science

the environment and air transportthe Bosporus and Dardanelles straits

insurrection and regime change

North and South Carolina

pollution and transport safety

h(xb)1

h(xm)1

insurrection and regime change

North and South Carolina

pollution and transport safety

Co-Training: [Yarowsky’95], [Blum & Mitchell’98]

Page 24: Three kinds of web data that can help computers make better sense of human language

24

h(xm)i

h(xb)i

Error rate (%) of co-trained classifiers

Page 25: Three kinds of web data that can help computers make better sense of human language

25

Error rate (%) on Penn Treebank (PTB)

Broad-cove

rage Parse

rs

Nakov & Hearst

(2005)

Pitler e

t al (2

010)

New Supervi

sed Monocla

ssifier

Co-trained M

onoclassi

fier0

5

10

15

20

800 PTB training

examples800 PTB training

examples 2 training examples

unsupervised

h(xm)N

Page 26: Three kinds of web data that can help computers make better sense of human language

26

Part 1: Conclusion

• Knowledge from large-scale monolingual corpora is crucial for parsing noun phrases– New paraphrase features

• New way to use bilingual data as soft supervision to guide the use of monolingual features

• Next steps: Use bilingual data even when we don’t know the translations to begin with– infer translations jointly with syntax– i.e., beyond bitexts (1B), make use of huge (1T+) N-

gram corpora in English, Spanish, French, …

Page 27: Three kinds of web data that can help computers make better sense of human language

27

Part 2: Using visual data to learn the meaning of words

• Large volumes of visual data also reveal word meaning (semantics), but in language-universal way

• Humans label their images as they post them online, providing the word-meaning link

• There’s lots of images to work with

[from Facebook’s Twitter feed]

Page 28: Three kinds of web data that can help computers make better sense of human language

28

Part 2: Using visual data to learn the meaning of words

Progress in the area of “lexical semantics”

Task #1: learning translations of words into foreign languages using visual data, e.g.

“turtle” in English = “tortuga” in Spanish

Main contribution: a totally new approach to building bilingual dictionaries

[Bergsma and Van Durme, IJCAI 2011]

Page 29: Three kinds of web data that can help computers make better sense of human language

29

English Web Images Spanish Web Images

turtle

candle

vela

tortuga

cockatoo

cacatúa

Page 30: Three kinds of web data that can help computers make better sense of human language

30

Task #1: Bilingual Lexicon Induction

• Why?– Needed for automatic machine translation,

cross-language information retrieval, etc.– Poor coverage of human-compiled

dictionaries/bitexts• How to do it with monolingual data only?– Link words to information that is preserved across

languages (clues to common meaning)

Page 31: Three kinds of web data that can help computers make better sense of human language

31

Clues to Common Meaning: Spelling[Koehn & Knight 2002, many others]

natural-naturalhigiénico:hygenicradón-radonvela-candle*calle-candle

Page 32: Three kinds of web data that can help computers make better sense of human language

32

Clues to Common Meaning: Images

candle

calle

vela

Visual similarities:• high contrast• black background• glowing flame

Page 33: Three kinds of web data that can help computers make better sense of human language

33

Link words by web-based visual similarity

Step 1: Retrieve online images via Google Image Search (in each lang.), 20 images for each word– Google competitive with “hand-prepared

datasets” [Fergus et al., 2005]

Page 34: Three kinds of web data that can help computers make better sense of human language

34

Step 2: Create Image Feature Vectors

Color histogram features

Page 35: Three kinds of web data that can help computers make better sense of human language

35

Step 2: Create Image Feature Vectors

SIFT keypoint features

Using David Lowe’s software [Lowe, 2004]

Page 36: Three kinds of web data that can help computers make better sense of human language

36

Step 3: Compute an Aggregate Similarity for Two Words

0.33

0.55

0.19

0.46

VectorCosine

Similarity

Best match for one English

image

Avg. over all

English images

Page 37: Three kinds of web data that can help computers make better sense of human language

37

Output: Ranking of Foreign Translations by Aggregate Visual Similarities

English Spanish French

rosary 1. camándula:0.151 1. chapelet:0.213

2. puntaje:0.140 2. activité:0.153

3. accidentalidad:0.139 3. rosaire:0.150

… …

Page 38: Three kinds of web data that can help computers make better sense of human language

38

Experiments

• 500-word lists in each language• Results on all pairs from German, English,

Spanish, French, Italian, Dutch • Avg. Top-N Accuracy: How often correct

answer is in top N most similar words?– Lots more details in paper, including how we

determine which words are ‘physical objects’

Page 39: Three kinds of web data that can help computers make better sense of human language

39

Average Top-N Accuracy on 14 Language Pairs

01020304050607080

Top-1 Top-20

Page 40: Three kinds of web data that can help computers make better sense of human language

40

Task #2: Lexical Semantics from Images

Can you eat “migas”?

Can you eat “carillon”?

Can you eat “mamey”?

Selectional Preference:

Is noun X a plausible object for verb Y?

[Bergsma and Goebel, RANLP 2011]

Page 41: Three kinds of web data that can help computers make better sense of human language

41

Conclusion• Robust NLP needs to look beyond human-

annotated data to exploit large corpora• Size matters:– Most parsing systems trained on 1 million words– We use:• billions of words in bitexts• trillions of words of monolingual text• online images: hundreds of billions (⨯1000 words each a 100 trillion words!)

Page 42: Three kinds of web data that can help computers make better sense of human language

42

Questions + Thanks• Gold sponsors:

• Platinum sponsors (collaborators):– Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin

(Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins)