Upload
alair
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Three kinds of web data that can help computers make better sense of human language. Shane Bergsma Johns Hopkins University. Fall, 2011. Computers that understand language. - PowerPoint PPT Presentation
Citation preview
Three kinds of web data that can help computers make better
sense of human language
Shane BergsmaJohns Hopkins University
Fall, 2011
2
Computers that understand language
“William Wilkinson’s ‘An Account of the Principalities of Wallachia and Moldavia’
inspired this author’s most famous novel.”
3
Research Vision
Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets
Derive meaning from real-world data:1) Raw text on the web2) Bilingual text (words plus their translations)
Part 1: Parsing noun phrases3) Visual data (labelled online images)
Part 2: Learning the meaning of words
4
Part 1: Parsing Noun Phrases (NPs)Google: What pages/ads should be returned for
the query “washed baby carrots”?
[washed baby] carrots vs. washed [baby carrots]carrots for washed babies baby carrots that are washed
5
Training a parser via machine learning
washed baby carrots
PARSER
INCORRECT in training data
TESTER
[washed baby] carrots
with weights, w0
6
Training a parser via machine learning
washed baby carrots
PARSER
CORRECT in gold standard
TESTER
washed [baby carrots]
Training corpus:retired [science teacher][social science] teacher
female [bus driver][school bus] driver
zebra [hair straightener]alleged [Canadian lover]
…
with weights, w1
[Banko & Brill, 2001]
Grammar CorrectionTask
More data is better data (learning curve)
8
Testing a parser on new data
washed baby smell
PARSER
INCORRECT
TESTER
Big Challenge: For parsing NPs, every word matters - both parses are grammatical - we can’t generalize from “washed baby carrots” in training to “washed baby smell” at test time
Solution: New sources of data
- Having seen washed [baby carrots] in training…
washed [baby smell]
with final weights, wN
9
English Data for Parsing
Human Annotated• 1 MILLION words
Penn (Parse-)Treebank [Marcus et al., 1993]
Bitexts• 1 BILLION words
Canadian Hansards, etc. [Callison-Burch et al., 2010]
Web text [N-grams]• 1 TRILLION words
Google N-gram Data [Brants & Franz, 2006]
10
Task: Parsing NPs with conjunctions1) [dairy and meat] production2) [sustainability] and [meat production]
yes: [dairy production] in (1)no: [sustainability production] in (2)
• Our contributions: new semantic features from raw web text and a new approach to using bilingual data as soft supervision
[Bergsma, Yarowsky & Church, ACL 2011]
11
One Noun Phrase or Two:A Machine Learning Approach
Input: “dairy and meat production”→ features: x
x = (…, first-noun=dairy, … second-noun=meat, … first+second-noun=dairy+meat, …)
h(x) = w ∙ x (predict one NP if h(x) > 0)
• Set w via training on annotated training data using some machine learning algorithm
12
Leveraging Web-Derived Knowledge
[dairy and meat] production• If there is only one NP, then it is implicitly talking
about “dairy production” • Do we see this phrase occurring a lot on the web? [Yes]
sustainability and [meat production]• If there is only one NP, then it is implicitly talking
about “sustainability production”• Do we see this phrase occurring a lot on the web? [No]
• Classifier has features for these counts
13
Search Engine Page Counts for NLP
• Early web work: Use an Internet search engine to get web counts[Keller & Lapata, 2003]
“dairy production” 714,000 pages
“sustainability production” 11,000 pages
Problem: Using a search engine is just too inefficient to get data on a large scale
14
Google N-gram Data for NLP• Google N-gram Data [Brants & Franz, 2006]
– N words in sequence + their count on web:…
dairy producers 22724 dairy production 17704 dairy professionals 204 dairy profits 82 dairy propaganda 15 dairy protein 1268 …
– A compressed version of all the text on web– Enables new features/statistics for a range of tasks
[Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]
15
Features for Explicit Paraphrasesdairy and meat production sustainability and meat production
Pattern: ❸ of ❶ and ❷↑Count( production of dairy and meat)
↓Count(production of sustainability and meat)
Pattern: ❷ ❸ and ❶
↓Count(meat production and dairy)
↑Count(meat production and sustainability)
❶ and ❷ ❸
New paraphrases extending ideas in [Nakov & Hearst, 2005]
❶ and ❷ ❸
16
Training Examples
conservation and good management
motor and heating fuels
freedom and security agenda
Google N-gram Data
Feature Vectorsx1, x2, x3, x4
Classifier: h(x)
Machine Learning
Human-Annotated
Data (small)
Raw Data (HUGE)
17
Using Bilingual Data
• Bilingual data: a rich source of paraphrasesdairy and meat production producción láctea y cárnica
• Build a classifier which uses bilingual features– Applicable when we know the translation of the NP
18
Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production
Pattern: ❸ ❶ … ❷ (Spanish)
Count( producc ión láctea y cárnica )
unseen
Pattern: ❶ … ❸ ❷ (Italian)
unseen Count(sosten ib i l i tà e la produz ione d i carne )
❶ and ❷ ❸ ❶ and ❷ ❸
19
Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production
Pattern: ❶- … ❷❸ (Finnish)C o u nt ( m a i d o n ̶ j a l i h a n t u o ta n t o o n )
unseen
❶ and ❷ ❸ ❶ and ❷ ❸
20
Training Examples
conservation and good management
motor and heating fuels
freedom and security agenda
Translation Data
Feature Vectorsx1, x2, x3, x4
Classifier: h(xb)
Machine Learning
Human-Annotated
Data (small)
Bilingual Data (medium)
21
h(xb)
insurrection and regime change
coal and steel money
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
rocket and mortar attacks
the environment and air transport
pollution and transport safety
h(xm)
insurrection and regime change
coal and steel money
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
rocket and mortar attacks
the environment and air transport
pollution and transport safety
+ Features from Google Data
Training Examples
+ Features from Translation Data
Training Examples
coal and steel money rocket and mortar attacks
insurrection and regime change
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
pollution and transport safety
Bitext Examples
22
h(xm)
+ Features from Google Data
Training Examples
+ Features from Translation Data
Training Examplescoal and steel money
rocket and mortar attacks
insurrection and regime change
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
pollution and transport safety
h(xb)1
insurrection and regime change
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
pollution and transport safety
business and computer sciencethe Bosporus and Dardanelles straitsthe environment and air transport
insurrection and regime change
North and South Carolina
pollution and transport safety
23
+ Features from Google Data
Training Examples
+ Features from Translation Data
Training Examplescoal and steel money
rocket and mortar attacks
business and computer science
the environment and air transportthe Bosporus and Dardanelles straits
insurrection and regime change
North and South Carolina
pollution and transport safety
h(xb)1
h(xm)1
insurrection and regime change
North and South Carolina
pollution and transport safety
Co-Training: [Yarowsky’95], [Blum & Mitchell’98]
24
h(xm)i
h(xb)i
Error rate (%) of co-trained classifiers
25
Error rate (%) on Penn Treebank (PTB)
Broad-cove
rage Parse
rs
Nakov & Hearst
(2005)
Pitler e
t al (2
010)
New Supervi
sed Monocla
ssifier
Co-trained M
onoclassi
fier0
5
10
15
20
800 PTB training
examples800 PTB training
examples 2 training examples
unsupervised
h(xm)N
26
Part 1: Conclusion
• Knowledge from large-scale monolingual corpora is crucial for parsing noun phrases– New paraphrase features
• New way to use bilingual data as soft supervision to guide the use of monolingual features
• Next steps: Use bilingual data even when we don’t know the translations to begin with– infer translations jointly with syntax– i.e., beyond bitexts (1B), make use of huge (1T+) N-
gram corpora in English, Spanish, French, …
27
Part 2: Using visual data to learn the meaning of words
• Large volumes of visual data also reveal word meaning (semantics), but in language-universal way
• Humans label their images as they post them online, providing the word-meaning link
• There’s lots of images to work with
[from Facebook’s Twitter feed]
28
Part 2: Using visual data to learn the meaning of words
Progress in the area of “lexical semantics”
Task #1: learning translations of words into foreign languages using visual data, e.g.
“turtle” in English = “tortuga” in Spanish
Main contribution: a totally new approach to building bilingual dictionaries
[Bergsma and Van Durme, IJCAI 2011]
29
English Web Images Spanish Web Images
turtle
candle
vela
tortuga
cockatoo
cacatúa
30
Task #1: Bilingual Lexicon Induction
• Why?– Needed for automatic machine translation,
cross-language information retrieval, etc.– Poor coverage of human-compiled
dictionaries/bitexts• How to do it with monolingual data only?– Link words to information that is preserved across
languages (clues to common meaning)
31
Clues to Common Meaning: Spelling[Koehn & Knight 2002, many others]
natural-naturalhigiénico:hygenicradón-radonvela-candle*calle-candle
32
Clues to Common Meaning: Images
candle
calle
vela
Visual similarities:• high contrast• black background• glowing flame
33
Link words by web-based visual similarity
Step 1: Retrieve online images via Google Image Search (in each lang.), 20 images for each word– Google competitive with “hand-prepared
datasets” [Fergus et al., 2005]
34
Step 2: Create Image Feature Vectors
Color histogram features
35
Step 2: Create Image Feature Vectors
SIFT keypoint features
Using David Lowe’s software [Lowe, 2004]
36
Step 3: Compute an Aggregate Similarity for Two Words
0.33
0.55
0.19
0.46
VectorCosine
Similarity
Best match for one English
image
Avg. over all
English images
37
Output: Ranking of Foreign Translations by Aggregate Visual Similarities
English Spanish French
rosary 1. camándula:0.151 1. chapelet:0.213
2. puntaje:0.140 2. activité:0.153
3. accidentalidad:0.139 3. rosaire:0.150
… …
38
Experiments
• 500-word lists in each language• Results on all pairs from German, English,
Spanish, French, Italian, Dutch • Avg. Top-N Accuracy: How often correct
answer is in top N most similar words?– Lots more details in paper, including how we
determine which words are ‘physical objects’
39
Average Top-N Accuracy on 14 Language Pairs
01020304050607080
Top-1 Top-20
40
Task #2: Lexical Semantics from Images
Can you eat “migas”?
Can you eat “carillon”?
Can you eat “mamey”?
Selectional Preference:
Is noun X a plausible object for verb Y?
[Bergsma and Goebel, RANLP 2011]
41
Conclusion• Robust NLP needs to look beyond human-
annotated data to exploit large corpora• Size matters:– Most parsing systems trained on 1 million words– We use:• billions of words in bitexts• trillions of words of monolingual text• online images: hundreds of billions (⨯1000 words each a 100 trillion words!)
42
Questions + Thanks• Gold sponsors:
• Platinum sponsors (collaborators):– Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin
(Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins)