Upload
lykien
View
216
Download
0
Embed Size (px)
Citation preview
Unsupervised Acquisition of Lexical Knowledge fromN‐GramsLexical Knowledge from N‐Grams
Dekang Lin (Google), Ken Church (JHU), Heng Ji (CUNY), Satoshi Sekine (NYU), David Yarowsky (JHU), Shane Bergsma (Univ. of Alberta) Kailash Patil (JHU) Emily Pitler (UPenn) Rachel LathburyAlberta), Kailash Patil (JHU), Emily Pitler (UPenn), Rachel Lathbury(Univ. of Virginia), Vikram Rao (Cornell), Kapil Dalwani (JHU), Sushant Narsale (JHU)
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
“There is no better data than more data” M (1985)– Mercer (1985)
The web: a large data set of textResource Requirements
Collecting & processing data at web scale is beyond th h f t d i l b t ithe reach of most academic laboratories
Starting Pointd l f ( l & )Generous donation: lots of N‐grams (Google & NYU)
N‐grams can be viewed as a compressed f l th bsummary of language usage on the web
car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187,254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one‐car 104, flying 103, hit‐and‐run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72 unfortunate 71 railroad 71 work‐related 70 snowmobile 70 mysterious 68 fishing 67logging 72, unfortunate 71, railroad 71, work related 70, snowmobile 70, mysterious 68, fishing 67,shooting 66, mountaineering 66, highway 66, single‐car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one‐vehicle 48, <UNK> 48, work 47, single‐vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39railway 38 Humvee 38 skating 35 hang‐gliding 35 canoeing 35 0000 35 shuttle 34 parachutingrailway 38, Humvee 38, skating 35, hang gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two‐vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb‐making 25, bicycling 25, auto 25, alcohol‐related 24snowboarding 24, motoring 24, early‐morning 24, trucking 23, elevator 22, horse‐riding 22, fire 22,two‐car 21 strange 20 mountain‐climbing 20 drunk‐driving 20 gun 19 rail 18 snowmobiling 17two car 21, strange 20, mountain climbing 20, drunk driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba‐diving 15, rock‐climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four‐wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13 domestic 13 buggy 13 horrific 12 violent 12 trolley 12 three‐vehicle 12 tank 12flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three‐vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10
Tools for searching/exploring the n‐gram dataGetting ideas and inspirations from the dataLower the barrier for web scale lexical learning and ambiguity resolutionambiguity resolution.
Lexical resourcesLi t f h ( lti d i )List of phrases (multi‐word expressions)Lexical properties (gender, definiteness, ….)Distributional clusterings of words and phrasesDistributional clusterings of words and phrases
ApplicationsPOS taggingPOS taggingResolving hard syntactic ambiguities
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
Google N‐gram Version 11 trillion token corpus
Google N‐gram Version 2Same corpus as Version 1. De‐duped and selected a subset of 200 billion tokens
N‐grams in news text from LDCN‐gram extraction done by Satoshi Sekine at NYU1.9 billion tokens, no cut‐off
N‐grams in WikipediaN‐gram extraction done by Satoshi Sekine at NYU1.7 billion tokens, no cut‐off
Google V1 Google V2 News86 (NYU) Wiki. (NYU)1gram 13,588,391 7,641,641 4,954,449 7,955,9272gram 314,843,401 296,890,580 75,500,632 92,650,303
3gram 977,069,902 1,078,786,671 351,107,322 377,375,925g , , , , , , , , ,4gram 1,313,818,354 1,523,248,748 760,430,455 732,948,7495gram 1,176,470,663 1,239,404,360 1,108,779,637 1,005,745,6286gram 1 330 094 614 1 172 521 0136gram 1,330,094,614 1,172,521,0137gram 1,449,291,893 1,266,198,876Tokens 1,024,908,267,229 207,447,212,712 1,918,950,546 1,702,102,103S t 95 119 665 584 9 712 670 106 81 934 589 150 910 476Sentences 95,119,665,584 9,712,670,106 81,934,589 150,910,4761gram cut-off 200 50 0 02-5gram cut-off 40 10 0 0
annotation none POS POS, chunk, NE
POS, chunk, NE
Sentence de‐duping and selectionDuplicate sentences are discardedOnly sentences between 20 to 300 bytes long and with less than 20% of the characters being digits orwith less than 20% of the characters being digits or punctuation marks have been selected.
Text transformationText transformationAll the digits have been converted to '0's (e.g., “123‐45‐6789” becomes "000‐00‐0000") and URLs )and email addresses are substituted with <URL> and <EMAIL> respectively.
d h ( )POS‐tagged with TnT (Brants 2000).
You can advertise your event here and it will be b A ti N t k 32386seen by Action Network users near you . 32386
However , in view of ongoing research , changes i t l ti d th t tin goverment regulations , and the constant flow of information relating to drug therapy and drug reactions the reader is urged to check thedrug reactions , the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warningsindications and dosage and for added warnings and precautions . 20481Top of Page © 2005 The Weather Co. |Top of Page © 2005 The Weather Co. | Conditions of Use | Privacy Statement 14084
For each N‐gram, we also generated all possible t ti t h th fi t d i trotations, except when the first word is a stop
word (top‐100 most frequent words).designed for cas al trail hikingdesigned for casual trail hikingcasual trail hiking >< for designedtrail hiking >< casual for designedtrail hiking >< casual for designedhiking >< trail casual for designed
Sort all rotated n gramsSort all rotated n‐grams.For any n‐gram, all co‐occurrence statistics can be computed with a consecutive sequence ofbe computed with a consecutive sequence of rotated n‐grams.
n‐gram 16579n‐gram >< an build 13
n‐gram in 227n‐gram is >< each 33g
n‐gram >< average 12n‐gram >< current the 11
gn‐gram language >< back‐off 10n‐gram language model can 15
l hn‐gram >< into 15n‐gram >< of set 33n‐gram >< similar 11
n‐gram length >< maximum 21n‐gram method is 18n‐gram model >< statistical 14n gram >< similar 11
n‐gram >< the for 92n‐gram >< used 46
n gram model >< statistical 14n‐gram models 1390n‐gram models and 50
n‐gram algorithms 14n‐gram approach , >< the 17n gram clustering 10
n‐gram order . >< and 12n‐gram score 29n‐gram statistics for large >< of 10n‐gram clustering 10
n‐gram distance 15n‐gram statistics for large >< of 10n‐gram to >< the 12
How to get them?Consider the sentence:
We augment the naive Bayes model with an n‐gram l d l ddlanguage model to address two ...
Is “model with an n‐gram” a phrase?d kWe need to know its count.
Collecting these counts is computationally expensive but the n gram data already have themexpensive, but the n‐gram data already have them.
A frequent n‐gram is not necessarily a phrasen gram language 1248n‐gram language 1248
n‐gram language 1248n‐gram language >< 34
n‐gram language modeling 111n‐gram language modelling 23n gram language >< , 34
......n‐gram language >< using 14n‐gram language >< with 17
g g g gn‐gram language models 585n‐gram language models , 85n‐gram language models >< and 26n‐gram language >< with 17
n‐gram language >< word 15n‐gram language model 467n gram language model >< An 13
n gram language models >< and 26n‐gram language models >< backoff 14n‐gram language models >< by 11n‐gram language models >< class 11n‐gram language model >< An 13
......n‐gram language model >< statistical 11n gram language model >< the 49
n gram language models >< class 11......n‐gram language models are 24n gram language models can 10n‐gram language model >< the 49
n‐gram language model and 12......
l d l t 12
n‐gram language models can 10......n‐gram language models trained 11n gram language models trained on 11n‐gram language model to 12 n‐gram language models trained on 11
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
Many lexical properties are not explicitly k d i dmarked in words:GenderA iAnimacyDefinitenessCountabilityCountability
Learning unobservable properties by counting observable proxiesobservable proxies.
Count the possessive pronouns following a word ft ti b [B 2005]after connectives or verbs. [Bergsma 2005]Pattern: Word CC/V* PRP$
its his her their mystar and ___ 2805 2409 1594 250 247
Hollywood star and ___ 0 29 13 0 0
variable star and ___ 57 0 0 0 0
Count the relative pronoun after nounsPattern: N.* (who|which)
who whichwho which
operator ___ 43845 29487
li t 0 550linear operator ___ 0 550
machine operator ___ 366 17
Count the determiner before words that are followed by prepositions or verbsfollowed by prepositions or verbs.Pattern: (a|an) Word [IV]* vs. the Word [IV]*
the a
13920 345901___ professor of
2551 64___ powerset of
255475 154677 diameter of___
292210 10105___ wife of
CountabilityCo‐occurrence with ‘much’?
Syntactic head vs. semantic headcup of coffee
First and last namesNouns that are category namesLabels for entities
Most frequently co‐occurring category name
Ideas and needs from Parsing and Speech teams?g pOther languages?
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
Develop a pattern language to allow the searches of specific types of n‐grams needed for lexical property acquisitionto specify the statistics to be collected from theto specify the statistics to be collected from the matching n‐grams.to support both ad hoc retrieval and batchto support both ad hoc retrieval and batch processing.
Open source implementationp pDistribute the tool with Wikipedia N‐grams
1.7B tokens1B 5‐grams with no cut‐off
Atomic patterns: match individual tokensWord/Tag: equalTo, inList, regMatchWord: belongToCluster, similarTo
Composite patternsseq/ /+/*/?and, or, not
l h d f “ ”Examples: the determiner of “power”(seq (tag = DT)
( d ( ) ( d ))(and (tag = NN) (word = power))(tag ~ “IN|V.*”))
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
Phrase selectionCounts, POS sequence pattern, distribution of surrounding tags
F t t tiFeature extraction:N‐gram data ‐> featuresF t l tiFeature selectionFeature value: binary, counts, MI, …
K means clusteringK‐means clusteringInverted indexSeed selectionSeed selection
:l: a 7454 :r: . 913l of 819:r: , 4657
:l: on 3837:l: , 3470
:l: of 819:l: or 814:r: for 778l ith 729:r: and 2504
:r: </s> 1765:r: the 1641
:l: with 729:l: my 717:l: your 647
604:r: a 1244:l: and 1234:l: <s> 1129
:r: on 604:l: to 587:l: board 566l 549:r: with 1104
:r: to 990:r: or 982
:l: ur 549:r: is 529:r: that 504
) 483
Dec. 8, 2008
:l: the 945:r: in 940
:r: ) 483:r: i 453
:l: erasable 9.673345 :r: whiteboard 6.208234:r: eraser 8.665204:r: erasers 8.495943:r: drawin 8.420887
:l: non‐magnetic 6.169274:r: colourfully 6.158004:r: easel 6.125531
:l: elluminate 8.003080:l: 000gsm 7.752731:l: mimio 7.585966
:r: pens 6.099751:l: mottled 6.057417:l: sided 5.992384
:l: mottle 7.414867:l: erase 7.293596:r: dusters 6.897158
:l: virtual 5.946798:r: brainstorming 5.908857:l: low‐tech 5.819000
:l: magnet 6.747450:l: wall‐mounted 6.711939:l: promethean 6.464373
:l: yre 5.796874:r: shovel 5.767510:r: refrigerator 5.759568
Dec. 8, 2008
:l: kanban 6.330404 :l: ur 5.720380
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
Correct tagging error with the ‘one tag sequence ’ h th iper n‐gram’ hypothesis
Determine the tags of out‐of‐vocabulary wordsWords never occurred in the training data: e.g. webmailFW 592 JJ 43711 NN 68093 VB 2672 VBP 13FW 592, JJ 43711, NN 68093, VB 2672, VBP 13
Missing POS tag possibilities in Penn Treebank: e.g., lower bound
JJR JJ 31576, JJR VBN 243992, RBR JJ 75628, RBR VBN 37294, VB JJ 73, VB VBN 68594, VBP VBN 1055
E l ti f i d tEvaluation of improved tagger
a water pumpDT|NN|NN|8864 DT|NN|VB|1775 DT|NN|VBP|89 FW|NN|NN|9
touch base withNN|NN|IN|2352 VB|JJ|IN|2 VB|NN|IN|13913 VBP|NN|IN|584
go to schoolNN|TO|NN|2 VB|TO|NN|227922VB|TO|VB|36208 VBP|TO|NN|54820
| | |VBP|TO|VB|23989
Mistakes are easy to correct when they are the i itminority.
For some n‐grams, however, the tagger may k i t k th th tmake more mistakes than the correct ones
Example:water the areaNN|DT|NN|427 VB|DT|NN|236
go to workgo to workVB|TO|NN|62 VB|TO|VB|245033VBP|TO|NN|2 VBP|TO|VB|75892| | | | | |
water the areaNN|DT|NN|427 VB|DT|NN|236
watered the areaVBD|DT|NN|45
watering the areaNN|DT|NN|1 VBG|DT|NN|130
go to workVB|TO|NN|62 VB|TO|VB|245033VBP|TO|NN|2 VBP|TO|VB|75892
go from workVB|IN|NN|408 VB|IN|VB|5 VBP|IN|NN|181 VBP|IN|VB|1
Base NP structureairport parking shuttle bus stop
Conjunctionsdining room and atriumprivate room and bathroom and board allowancedining room and bedroom furniture
b l l d lRequire bi‐lexical statistics and are mostly ignored by current state‐of‐the‐art parsers.
l h h h bPotential synergy with the Parse‐the‐Web team
Frequency countsPreviously tried, but we have more of them
Distributional clusters as featuresConjuncts generally belong to the same cluster.
Parsing results also serve as an evaluation of the clusters.
IntroductionMotivationDeliverables
N‐gram data setsTasks
Lexical resourcesN‐gram search toolsDistributional clusteringApplications
Timeline and Evaluation
Week 1Get Hadoop running on Google Academic ClusterExtract distributional features from N‐grams
l hi lImplement N‐gram pattern matching languageBrainstorm on finding observable proxies for unobservable lexical propertiesunobservable lexical properties
Week 2‐3Implement MapReduce version of K means clusteringImplement MapReduce version of K‐means clusteringImplement one‐tag‐sequence per 3‐5gram. Identity hard cases.hard cases.Lexical property acquisition, error‐analysis, iterate
Week 4‐5Add clusters to n‐gram search and result summarization.Experiment with syntactic ambiguity resolutionExperiment with syntactic ambiguity resolution
Week 6W d/ t iti th tWrap up and/or pursue opportunities that may arise during the workshop
Adequacy of ngrams?Li it ti l th f th h ldLimitations: length, frequency thresholdZellig Harris’ Distributional Hypothesis
Ngrams Linguistics?Ngrams Linguistics?Ngrams Lexicon?Ngrams Parsing?Ngrams Catalan Constructions
Conjunction? PPs? NN seqs?Catalan: All possible binary treesp y
Learning Curves:No data like more dataYes, but how much is enough?
Open source software tools for n‐gram search d di t ib ti l l t iand distributional clustering.
Broad coverage resourcesList of phrasesLexical propertiesPh l lPhrasal clusters
Better POS taggingImprovement over TnT [good]State of the art accuracy [excellent]
l f h d bBetter resolution of hard syntactic ambiguitiesUsing n‐gram counts and clusters