Unsupervised of Lexical Knowledge from N‐Grams Motivation Deliverables N‐gram data sets Tasks Lexical resources N‐gram search tools Distributional clustering Applications Timeline

Unsupervised Acquisition of Lexical Knowledge fromN‐GramsLexical Knowledge from N‐Grams

Dekang Lin (Google), Ken Church (JHU), Heng Ji (CUNY), Satoshi Sekine (NYU), David Yarowsky (JHU), Shane Bergsma (Univ. of Alberta) Kailash Patil (JHU) Emily Pitler (UPenn) Rachel LathburyAlberta), Kailash Patil (JHU), Emily Pitler (UPenn), Rachel Lathbury(Univ. of Virginia), Vikram Rao (Cornell), Kapil Dalwani (JHU), Sushant Narsale (JHU)

IntroductionMotivationDeliverables

N‐gram data setsTasks

Lexical resourcesN‐gram search toolsDistributional clusteringApplications

Timeline and Evaluation

“There is no better data than more data” M (1985)– Mercer (1985)

The web: a large data set of textResource Requirements

Collecting & processing data at web scale is beyond th h f t d i l b t ithe reach of most academic laboratories

Starting Pointd l f ( l & )Generous donation: lots of N‐grams (Google & NYU)

N‐grams can be viewed as a compressed f l th bsummary of language usage on the web

car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187,254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one‐car 104, flying 103, hit‐and‐run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72 unfortunate 71 railroad 71 work‐related 70 snowmobile 70 mysterious 68 fishing 67logging 72, unfortunate 71, railroad 71, work related 70, snowmobile 70, mysterious 68, fishing 67,shooting 66, mountaineering 66, highway 66, single‐car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one‐vehicle 48, <UNK> 48, work 47, single‐vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39railway 38 Humvee 38 skating 35 hang‐gliding 35 canoeing 35 0000 35 shuttle 34 parachutingrailway 38, Humvee 38, skating 35, hang gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two‐vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb‐making 25, bicycling 25, auto 25, alcohol‐related 24snowboarding 24, motoring 24, early‐morning 24, trucking 23, elevator 22, horse‐riding 22, fire 22,two‐car 21 strange 20 mountain‐climbing 20 drunk‐driving 20 gun 19 rail 18 snowmobiling 17two car 21, strange 20, mountain climbing 20, drunk driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba‐diving 15, rock‐climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four‐wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13 domestic 13 buggy 13 horrific 12 violent 12 trolley 12 three‐vehicle 12 tank 12flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three‐vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10

Tools for searching/exploring the n‐gram dataGetting ideas and inspirations from the dataLower the barrier for web scale lexical learning and ambiguity resolutionambiguity resolution.

Lexical resourcesLi t f h ( lti d i )List of phrases (multi‐word expressions)Lexical properties (gender, definiteness, ….)Distributional clusterings of words and phrasesDistributional clusterings of words and phrases

ApplicationsPOS taggingPOS taggingResolving hard syntactic ambiguities





Google N‐gram Version 11 trillion token corpus

Google N‐gram Version 2Same corpus as Version 1. De‐duped and selected a subset of 200 billion tokens

N‐grams in news text from LDCN‐gram extraction done by Satoshi Sekine at NYU1.9 billion tokens, no cut‐off

N‐grams in WikipediaN‐gram extraction done by Satoshi Sekine at NYU1.7 billion tokens, no cut‐off

Google V1 Google V2 News86 (NYU) Wiki. (NYU)1gram 13,588,391 7,641,641 4,954,449 7,955,9272gram 314,843,401 296,890,580 75,500,632 92,650,303

3gram 977,069,902 1,078,786,671 351,107,322 377,375,925g , , , , , , , , ,4gram 1,313,818,354 1,523,248,748 760,430,455 732,948,7495gram 1,176,470,663 1,239,404,360 1,108,779,637 1,005,745,6286gram 1 330 094 614 1 172 521 0136gram 1,330,094,614 1,172,521,0137gram 1,449,291,893 1,266,198,876Tokens 1,024,908,267,229 207,447,212,712 1,918,950,546 1,702,102,103S t 95 119 665 584 9 712 670 106 81 934 589 150 910 476Sentences 95,119,665,584 9,712,670,106 81,934,589 150,910,4761gram cut-off 200 50 0 02-5gram cut-off 40 10 0 0

annotation none POS POS, chunk, NE

POS, chunk, NE

Sentence de‐duping and selectionDuplicate sentences are discardedOnly sentences between 20 to 300 bytes long and with less than 20% of the characters being digits orwith less than 20% of the characters being digits or punctuation marks have been selected.

Text transformationText transformationAll the digits have been converted to '0's (e.g., “123‐45‐6789” becomes "000‐00‐0000") and URLs )and email addresses are substituted with <URL> and <EMAIL> respectively.

d h ( )POS‐tagged with TnT (Brants 2000).

You can advertise your event here and it will be b A ti N t k 32386seen by Action Network users near you . 32386

However , in view of ongoing research , changes i t l ti d th t tin goverment regulations , and the constant flow of information relating to drug therapy and drug reactions the reader is urged to check thedrug reactions , the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warningsindications and dosage and for added warnings and precautions . 20481Top of Page © 2005 The Weather Co. |Top of Page © 2005 The Weather Co. | Conditions of Use | Privacy Statement 14084

Google V1 Google 2 News86 Wikipedia

For each N‐gram, we also generated all possible t ti t h th fi t d i trotations, except when the first word is a stop

word (top‐100 most frequent words).designed for cas al trail hikingdesigned for casual trail hikingcasual trail hiking >< for designedtrail hiking >< casual for designedtrail hiking >< casual for designedhiking >< trail casual for designed

Sort all rotated n gramsSort all rotated n‐grams.For any n‐gram, all co‐occurrence statistics can be computed with a consecutive sequence ofbe computed with a consecutive sequence of rotated n‐grams.

n‐gram 16579n‐gram >< an build 13

n‐gram in 227n‐gram is >< each 33g

n‐gram >< average 12n‐gram >< current the 11

gn‐gram language >< back‐off 10n‐gram language model can 15

l hn‐gram >< into 15n‐gram >< of set 33n‐gram >< similar 11

n‐gram length >< maximum 21n‐gram method is 18n‐gram model >< statistical 14n gram >< similar 11

n‐gram >< the for 92n‐gram >< used 46

n gram model >< statistical 14n‐gram models 1390n‐gram models and 50

n‐gram algorithms 14n‐gram approach , >< the 17n gram clustering 10

n‐gram order . >< and 12n‐gram score 29n‐gram statistics for large >< of 10n‐gram clustering 10

n‐gram distance 15n‐gram statistics for large >< of 10n‐gram to >< the 12

How to get them?Consider the sentence:

We augment the naive Bayes model with an n‐gram l d l ddlanguage model to address two ...

Is “model with an n‐gram” a phrase?d kWe need to know its count.

Collecting these counts is computationally expensive but the n gram data already have themexpensive, but the n‐gram data already have them.

A frequent n‐gram is not necessarily a phrasen gram language 1248n‐gram language 1248

n‐gram language 1248n‐gram language >< 34

n‐gram language modeling 111n‐gram language modelling 23n gram language >< , 34

......n‐gram language >< using 14n‐gram language >< with 17

g g g gn‐gram language models 585n‐gram language models , 85n‐gram language models >< and 26n‐gram language >< with 17

n‐gram language >< word 15n‐gram language model 467n gram language model >< An 13

n gram language models >< and 26n‐gram language models >< backoff 14n‐gram language models >< by 11n‐gram language models >< class 11n‐gram language model >< An 13

......n‐gram language model >< statistical 11n gram language model >< the 49

n gram language models >< class 11......n‐gram language models are 24n gram language models can 10n‐gram language model >< the 49

n‐gram language model and 12......

l d l t 12

n‐gram language models can 10......n‐gram language models trained 11n gram language models trained on 11n‐gram language model to 12 n‐gram language models trained on 11





Many lexical properties are not explicitly k d i dmarked in words:GenderA iAnimacyDefinitenessCountabilityCountability

Learning unobservable properties by counting observable proxiesobservable proxies.

Count the possessive pronouns following a word ft ti b [B 2005]after connectives or verbs. [Bergsma 2005]Pattern: Word CC/V* PRP$

its his her their mystar and ___ 2805 2409 1594 250 247

Hollywood star and ___ 0 29 13 0 0

variable star and ___ 57 0 0 0 0

Count the relative pronoun after nounsPattern: N.* (who|which)

who whichwho which

operator ___ 43845 29487

li t 0 550linear operator ___ 0 550

machine operator ___ 366 17

Count the determiner before words that are followed by prepositions or verbsfollowed by prepositions or verbs.Pattern: (a|an) Word [IV]* vs. the Word [IV]*

the a

13920 345901___ professor of

2551 64___ powerset of

255475 154677 diameter of___

292210 10105___ wife of

CountabilityCo‐occurrence with ‘much’?

Syntactic head vs. semantic headcup of coffee

First and last namesNouns that are category namesLabels for entities

Most frequently co‐occurring category name

Ideas and needs from Parsing and Speech teams?g pOther languages?





*VBD* and eventually *VBD*VBD and eventually VBD

Develop a pattern language to allow the searches of specific types of n‐grams needed for lexical property acquisitionto specify the statistics to be collected from theto specify the statistics to be collected from the matching n‐grams.to support both ad hoc retrieval and batchto support both ad hoc retrieval and batch processing.

Open source implementationp pDistribute the tool with Wikipedia N‐grams

1.7B tokens1B 5‐grams with no cut‐off

Atomic patterns: match individual tokensWord/Tag: equalTo, inList, regMatchWord: belongToCluster, similarTo

Composite patternsseq/ /+/*/?and, or, not

l h d f “ ”Examples: the determiner of “power”(seq (tag = DT)

( d ( ) ( d ))(and (tag = NN) (word = power))(tag ~ “IN|V.*”))

Raw frequencyMutual InformationSummarization of fillers by clusters





Phrase selectionCounts, POS sequence pattern, distribution of surrounding tags

F t t tiFeature extraction:N‐gram data ‐> featuresF t l tiFeature selectionFeature value: binary, counts, MI, …

K means clusteringK‐means clusteringInverted indexSeed selectionSeed selection

:l: a 7454 :r: . 913l of 819:r: , 4657

:l: on 3837:l: , 3470

:l: of 819:l: or 814:r: for 778l ith 729:r: and 2504

:r: </s> 1765:r: the 1641

:l: with 729:l: my 717:l: your 647

604:r: a 1244:l: and 1234:l: <s> 1129

:r: on 604:l: to 587:l: board 566l 549:r: with 1104

:r: to 990:r: or 982

:l: ur 549:r: is 529:r: that 504

) 483

Dec. 8, 2008

:l: the 945:r: in 940

:r: ) 483:r: i 453

:l: erasable 9.673345 :r: whiteboard 6.208234:r: eraser 8.665204:r: erasers 8.495943:r: drawin 8.420887

:l: non‐magnetic 6.169274:r: colourfully 6.158004:r: easel 6.125531

:l: elluminate 8.003080:l: 000gsm 7.752731:l: mimio 7.585966

:r: pens 6.099751:l: mottled 6.057417:l: sided 5.992384

:l: mottle 7.414867:l: erase 7.293596:r: dusters 6.897158

:l: virtual 5.946798:r: brainstorming 5.908857:l: low‐tech 5.819000

:l: magnet 6.747450:l: wall‐mounted 6.711939:l: promethean 6.464373

:l: yre 5.796874:r: shovel 5.767510:r: refrigerator 5.759568

Dec. 8, 2008

:l: kanban 6.330404 :l: ur 5.720380





Correct tagging error with the ‘one tag sequence ’ h th iper n‐gram’ hypothesis

Determine the tags of out‐of‐vocabulary wordsWords never occurred in the training data: e.g. webmailFW 592 JJ 43711 NN 68093 VB 2672 VBP 13FW 592, JJ 43711, NN 68093, VB 2672, VBP 13

Missing POS tag possibilities in Penn Treebank: e.g., lower bound

JJR JJ 31576, JJR VBN 243992, RBR JJ 75628, RBR VBN 37294, VB JJ 73, VB VBN 68594, VBP VBN 1055

E l ti f i d tEvaluation of improved tagger

a water pumpDT|NN|NN|8864 DT|NN|VB|1775 DT|NN|VBP|89 FW|NN|NN|9

touch base withNN|NN|IN|2352 VB|JJ|IN|2 VB|NN|IN|13913 VBP|NN|IN|584

go to schoolNN|TO|NN|2 VB|TO|NN|227922VB|TO|VB|36208 VBP|TO|NN|54820

| | |VBP|TO|VB|23989

Mistakes are easy to correct when they are the i itminority.

For some n‐grams, however, the tagger may k i t k th th tmake more mistakes than the correct ones

Example:water the areaNN|DT|NN|427 VB|DT|NN|236

go to workgo to workVB|TO|NN|62 VB|TO|VB|245033VBP|TO|NN|2 VBP|TO|VB|75892| | | | | |

water the areaNN|DT|NN|427 VB|DT|NN|236

watered the areaVBD|DT|NN|45

watering the areaNN|DT|NN|1 VBG|DT|NN|130

go to workVB|TO|NN|62 VB|TO|VB|245033VBP|TO|NN|2 VBP|TO|VB|75892

go from workVB|IN|NN|408 VB|IN|VB|5 VBP|IN|NN|181 VBP|IN|VB|1

Base NP structureairport parking shuttle bus stop

Conjunctionsdining room and atriumprivate room and bathroom and board allowancedining room and bedroom furniture

b l l d lRequire bi‐lexical statistics and are mostly ignored by current state‐of‐the‐art parsers.

l h h h bPotential synergy with the Parse‐the‐Web team

Frequency countsPreviously tried, but we have more of them

Distributional clusters as featuresConjuncts generally belong to the same cluster.

Parsing results also serve as an evaluation of the clusters.





Week 1Get Hadoop running on Google Academic ClusterExtract distributional features from N‐grams

l hi lImplement N‐gram pattern matching languageBrainstorm on finding observable proxies for unobservable lexical propertiesunobservable lexical properties

Week 2‐3Implement MapReduce version of K means clusteringImplement MapReduce version of K‐means clusteringImplement one‐tag‐sequence per 3‐5gram. Identity hard cases.hard cases.Lexical property acquisition, error‐analysis, iterate

Week 4‐5Add clusters to n‐gram search and result summarization.Experiment with syntactic ambiguity resolutionExperiment with syntactic ambiguity resolution

Week 6W d/ t iti th tWrap up and/or pursue opportunities that may arise during the workshop

Adequacy of ngrams?Li it ti l th f th h ldLimitations: length, frequency thresholdZellig Harris’ Distributional Hypothesis

Ngrams Linguistics?Ngrams Linguistics?Ngrams Lexicon?Ngrams Parsing?Ngrams Catalan Constructions

Conjunction? PPs? NN seqs?Catalan: All possible binary treesp y

Learning Curves:No data like more dataYes, but how much is enough?

Open source software tools for n‐gram search d di t ib ti l l t iand distributional clustering.

Broad coverage resourcesList of phrasesLexical propertiesPh l lPhrasal clusters

Better POS taggingImprovement over TnT [good]State of the art accuracy [excellent]

l f h d bBetter resolution of hard syntactic ambiguitiesUsing n‐gram counts and clusters

Documents

Unsupervised of Lexical Knowledge from N‐Grams Motivation Deliverables N‐gram data sets Tasks Lexical resources N‐gram search tools Distributional clustering Applications Timeline