Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

Fine-Grained LinguisticSoft Constraints on Statistical

Natural Language Processing Models

Yuval MartonPh.D. Dissertation DefenseDepartment of Linguistics

University of Maryland

Unifiedcorpus-based model

with soft linguistic constraints

Syntactic(Parsing)

in stat. machine translation

Semantic(Phrases)




Yuval Marton, Dissertation Defense 2

Dissertation Theme

• Hybrid knowledge/corpus-based statistical NLP models using fine-grained linguistic soft constraints

Syntactic(Parsing)


Semantic(Words)

in word-pair similarity tasks

Semantic(Phrases)


Pure vs. Hybrid Models

• Pure models– Corpus-based, data-driven, distributional, statistical

• Statistical Machine Translation• Distributional Profiles (Context Vectors)

– Manually-crafted linguistic knowledge (rules, word grouping by concept), theory-driven• Rule-based / syntax-driven machine translation• WordNet/thesaurus-based semantic similarity measures

• Hybrid models– Here: bias data-driven models with linguistic constraints



Hard and Soft Constraints

• Hard constraints– [0,1]; in/out– Decrease search space– Theory-driven– Faster, slimmer

• Soft constraints– [0..1]; fuzzy– Only bias the model– Data-driven: Let patterns emerge

Universe

Hard

Universe

Soft


Fine-GrainedSoft Linguistic Constraints

• Fine granularity is a big deal– Soft syntactic constraints in SMT

• Chiang 2005 vs. Marton and Resnik 2008• Negative results positive results

– Soft semantic constraints in word-pair similarity ranking • Mohammad and Hirst 2006 vs.

Marton, Mohammad and Resnik 2009• Positive results better results

– Soft semantic constraints in paraphrase generation for SMT• Callison-Burch et al. 2006 vs. Marton, Callison-Burch & Resnik 2009


Road Map Hybrid models with soft constraints

– Pure and hybrid models– Hard and soft constraints– Fine-grained

• Soft syntactic constraints– In statistical machine translation

• Soft semantic constraints – In word pair similarity tasks– In paraphrasing for statistical machine translation

• Unified model

7

• Chiang 2005, 2007• Weighted synchronous CFG

– Unnamed non-terminals: X <e, f >e.g., X < 今年 X1, X1 this year>

• Translation model features:e.g., ϕ3 = log p(e|f)

• Log-linear model:+ rule penalty feature, “glue” rules

• These trees are not necessarily “syntactic”! – Not syntactic in the linguistic sense

Statistical Machine Translation: Hiero

的竞选 Election投票在初选 voted in the primaries

Yuval Marton, Dissertation Defense


Previous (Coarse) Soft Syntactic Constraints

• X X1 speech ||| X1 discurso – What should be the span of X1?

• Chiang’s (2005) constituency feature– Reward rule’s score if rule’s

source-side matches a constituent span

– Constituency-incompatible emergent patterns can still ‘win’ (in spite of no reward)

– Good idea -- Neg-result


New (Fine-Grained) Soft Syntactic Constraints

• separate weighted feature for each constituent, e.g.:• NP-only: (NP= )• VP-only: (VP= )

10

New Constraint Conditions

• VP-only, revisited:– We saw VP-match (VP= ):

Reward exact match of a VP sub-tree span

– We can also incur a penalty for crossing constituent boundaries, e.g., VP-cross (VP+ )


11

Constraint (Feature) Space• {NP, VP, IP, CP, …} x {match=,cross-boundary+}• Basic translation models:

– For each feature, add (only it) to default feature set, assigning it a separate weight.

• Feature “combo” translation models:– NP2 (double feature): add both NP= and NP+

with a separate weight for each– NP_ (conflated feature) ties weights of NP= and NP+

– XP=, XP+, XP2, XP_: conflate all labels that correspond to “standard” X-bar Theory XP constituents in each condition.

– All-labels= (Chiang’s), All-labels+, All-labels_, All-labels2


12

Chinese-English Results• Replicated Chiang 2005

constituency feature (negative result)

• NP=, QP+, VP+ up to .74 BLEU points better.

• XP+, IP2, all-labels_, VP2, NP_, up to 1.65 BLEU points better.

• Validated on the NIST MT08 test set

BLEU score: higher=better*,**: sig. better than baseline+,++: better than Chiang-05

(replicated)


13

Arabic-English Results• New result for Chiang’s

constituency feature (MT06, MT08)

• PP+, AdvP= up to 1.40 BLEU better than Chiang’s and baseline.

• AP2, AdvP2 up to 1.94 better.

• Validated on the NIST MT08 test set

*,**: sig. better than baseline+,++: better than Chiang-05

New!


14

PP+ Example: Arabic MT06

Source ... (PP (IN ب) (NP (NP (NN تعىىن) (NP (NN مندوب) (NP (NNP سورىا) (NNP لدى)))) (DT ال) (NP (NN امم) (NP (NN ال) (JJ متحدة))))))) …

Gloss …(PP (IN in) (NP (NP (NN appointment) (NP (NN representative) (NP (NNP syria) (NNP to)))) (DT the) (NP (NN nations) (NP (NN the) (JJ united))))))) …

Reference [the third decree ordered] the appointment of the syrian representative to the united nations …

Baseline … to appoint syria to the united nations representative …

PP+ … to appoint a representative of syria to the united nations …


15

Arabic-English Results – MIRA


Chiang, Marton and Resnik (2008)

Previous problem of feature selection solved here:




Soft syntactic constraints– In statistical machine translation


• Unified model

Semantic Models• Forget Frege, alternative worlds, <e,t>, …• To model meaning of words, we can use

– “Pure” models• Knowledge-based: Manually crafted linguistic resources

(dictionary, thesaurus, taxonomies, WordNet)• Usage-based: Machine-generated distributional profiles

(containing word co-occurrence-based information)– Hybrid models

• Bias distributional profiles with soft semantic constraints– As we just saw with soft syntactic constraints– E.g, use thesaurus “concepts” as word senses, with which

to alter co-occurrence counts in distributional profiles



Word-Based Distributional Profiles (DPs)

• Distributional Hypothesis (Harris 1940; Firth 1957)– DP (Context Vector) of “bank”:

Which words “bank” occurs next to• Strength of association

– Counts, PMI, TF/IDF-based, Log-likelihood ratios …

• Vector similarity (cosine, L1, L2,..)

linguistmoneyrivertellerwater

…

banklinguistmoneyriver

tellerwater…

tenure

α


Taxonomies and Groupings

• WordNet– Synsets– Classical Relations (“is-a”)– Arc distance– “The tennis problem”

• Thesaurus– Flat lists of related words– Potentially coarse – Implicit relations,

potentially non-classical

job

Academic job

Is-a

Professor

Is-a

Industry job

Is-a

CEO

Is-a


Concept-Based Distributional ProfilesMohammad & Hirst (2006) – Macquarie Thesaurus

• Word-based DP• Concept-based

DP– Approximate

senses– Aggregated– Coarse

• “bank” is listed under several concepts

• DP for each sense


…

bank


…

RIVERbank, boat,

wave, …


…

FIN.INSTbank, dollar,

deposit, …


Concept-Based Distributional ProfilesMohammad & Hirst (2006) – Macquarie Thesaurus

• How similar are “bank” and “wave”?

• Compare all pairs of senses– FIN.INST, PHYSICS– FIN.INST, RIVER– RIVER, PHYSICS– RIVER, RIVER

• Return closest sense pair• Problem: bank = wave ??

bank

RIVERbank, boat,

wave, …

FIN.INSTbank, dollar,

deposit, …

wave

PHYSICSamp., wave, freq.,

…


New: Word/Concept Hybrid Model(Word Sense DP)

• Given the word’s word-based DP and concept-based DPs:

• Bias DP of “bank” towards DP of RIVER

• Create bankFIN.INST

similarly, etc.


…

bank


…

RIVERbank, boat,

wave, …


…

bankRIVER


Fine-Grained Soft Semantic Constraints

• Hybrid models: best of all: fine-grained, sense-aware, widely applicable– bankFIN.INST ≠ bankRIVER ≠ waveRIVER !

• Two hybrid flavors:– Hybrid-filtered– Hybrid-proportional

Pros and cons: Word-based DP

Concept-based DP

Word senses: Smear senses Sense-awareRelations: co-occurrence Semantic RelatednessTarget granularity: Word level (fine) Aggregated (coarse)Applicability (vocab): Wide Limited


Evaluation: Word-Pair Similarity Task

• Give each word pair a similarity score– Rooster – voyage: 0.12– Coast – shore: 0.93

• Same part-of-speech pairs– Noun-noun (Rubinstein & Goodenough, 1965; Finkelstein et al. 2002)

– Verb-verb (Resnik & Diab, 2000)

• Result: list of pairs ordered by similarity• Evaluation metric: Spearman rank correlation


Word-Pair Similarity Results






• Unified model

Words Phrases

• Extend the word-based semantic similarity measures to “phrases”– she declined to provide any other information …– police refused to provide any other details …

• So far: See if y is similar to xNow: Find y’s similar to x

• Can solve other problems now!– Use these extended phrasal DPs to find

good paraphrases of unknown “phrases” in machine translation models


informationmoney

declinedteller

details…

to provide

any otherbank

Coverage Problem in Statistical Machine Translation

• Trained on parallel text• Every new test

document contains some “phrases” unknown to the model

Spanish EnglishSpanish EnglishSpanish EnglishSpanish EnglishSpanish EnglishSpanish EnglishSpanish EnglishSpanish English

Test

set

SpanishSpanishSpanishSpanishSpanishSpanishSpanishSpanish

??

28Yuval Marton, Dissertation Defense

Previous Solution: Pivoting

• Use other parallel texts to increase coverage

• Drawback: Parallel text is a limited resources!


Test

set

SpanishSpanishSpanishSpanishSpanishSpanish’Spanish’’Spanish’’’

French SpanishFrench’’ Spanish’

’ French’’ Spanish


German SpanishGerman’ Spanish German’ Spanish’

New Solution: Monolingually-Derived Paraphrases

• Use monolingual text to increase coverage

• Resources available in abundance!


Test

set

SpanishSpanishSpanishSpanishSpanishSpanish’Spanish’’Spanish’’’M

onol

ingu

al te

xt

SpanishSpanishSpanishSpanishSpanish’Spanish’’Spanish’’’Spanish’’’’

SpanishSpanishSpanish


α

Find Paraphrases

• Gather all contexts L _ R for phrase “to provide any other”:• What else appears between L _ R ?


Left context (L) __ Right context (R)declined to provide any other details

refused to provide any other information unable to provide any other details

failed to provide any other explanation

… to provide any other …

Find Paraphrases

• Gather all contexts L _ R for phrase “to provide any other”:• What else appears between L _ R ?• Measure distributional similarity to each candidate, e.g.,

“to provide any other” -- “to give further”


Left context (L) __ Right context (R)declined to give further details

refused to provide any information unable to reveal any details

failed to provide further explanation

… to provide any other …

Paraphrase Examples (Phrases)

•


Paraphrase Examples (Unigrams)

•


Paraphrase Feature Model

• Evidence reinforcement:If exist more than one fi paraphrases of f:Aggregate score with a “quasi-online updating”:asimi = asimi-1 + (1 – asimi-1) sim(fi,f), where asim0 = 0


Analogous to Callison-Burch et al. (2006)

English to Chinese Results

• 29k line subset created to emulate low density language setting

* better than baseline+ better than non-hybrid

counterpart


English-Chinese Translation Examples


Spanish to English

•


Comparison with Corpus Size & Pivoting

•






Soft semantic constraints – In word pair similarity tasks– In paraphrasing for statistical machine translation

• Unified model


Unified Model

• Soft linguistic constraints in a log-linear model– Syntactic– Semantic– …

• ihi(x)

• Constraints = Add more ihi(x) terms to the sum:

ihi(x) + jhj(x)

hi: Features / Constraints

i: Weight / importance of feature i

Unified Model (Soft Syntactic Constraints)

• Straightforward: if is a translation model,

bias is syntactically, e.g., as follows:

+ jϕj(f,e)

1 If the source language where ϕj(f,e) = word sequence f is a VP.

0 Otherwise.


Unified Model (Soft Semantic Constraints)semantic distance of word e in sense s from word e’ in sense s’:


where:

= K cosWord(e,e’)

= cosSense(es ,e’s’)

cross-termscross-terms

cos(es ,e’s’) =

fSense(e,s,wi)

fSense(e,s,wi)

fSense(e,s,wi)

fSense(e’,s’,wi) / ZC

/ ZC

/ ZC

/ ZC

fWord(e,wi)

fWord(e’,wi)

fWord(e,wi)

fWord(e,wi) fWord(e’,wi)

fSense(e’,s’,wi)



Syntactic(Parsing)


Semantic(Phrases)


Main Contributions




Syntactic(Parsing)


Semantic(Words)

in word-pair similarity tasks

Semantic(Phrases)


Fine-grained linguistic

soft constraints


soft constraints


soft constraintsin state-of-the-art

end-to-end phrase-based SMT systems

in state-of-the-art end-to-end

phrase-based SMT systems

distributional paraphrase generation

evidence reinforcement component

45

Thanks to…

• Defense Committee:– Philip Resnik, Chair/Advisor – Amy Weinberg, Advisor – William Idsardi, Member – Chris Callison-Burch,

Special Member (JHU) – Bonnie Dorr, Dean's

Representative• Ling Chair:

– Norbert HornsteinYuval Marton, Dissertation Defense

• Ling Cohort:– Ellen … Lau– Phil Monahan– Eri Takahashi– Rebecca McKeown– Chizuru Nakao

• CLIP Lab– David Chiang, Smara Muresan,

Hendra Setiawan, Adam Lopez, Chris Dyer, Asad Sayeed, Vlad Eidelman, Zhongqiang Huang, Denis Filimonov, and many others!

Thank you!

• Questions


http://umiacs.umd.edu/~ymarton

Documents

Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models