151
Deep Learning for Semantic Composition Xiaodan Zhu * & Edward Grefenstette * National Research Council Canada Queen’s University [email protected] DeepMind [email protected] July 30 th , 2017 Xiaodan Zhu & Edward Grefenstette DL for Composition July 30 th , 2017 1 / 119

Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

  • Upload
    vuanh

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Deep Learning for Semantic Composition

Xiaodan Zhu∗ & Edward Grefenstette†

∗National Research Council CanadaQueen’s [email protected]

[email protected]

July 30th, 2017

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 1 / 119

Page 2: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 2 / 119

Page 3: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 3 / 119

Page 4: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Principle of compositionality: The meaning of a whole is a function ofthe meaning of the parts.

While we focus on natural language, compositionality exists not justin language.

Sound/music

Music notes are composed with some regularity but not randomlyarranged to form a song.

Vision

Natural scenes are composed of meaningful components.Artificial visual art pieces often convey certain meaning with regularityfrom their parts.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 4 / 119

Page 5: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Principle of compositionality: The meaning of a whole is a function ofthe meaning of the parts.

While we focus on natural language, compositionality exists not justin language.

Sound/music

Music notes are composed with some regularity but not randomlyarranged to form a song.

Vision

Natural scenes are composed of meaningful components.Artificial visual art pieces often convey certain meaning with regularityfrom their parts.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 4 / 119

Page 6: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Principle of compositionality: The meaning of a whole is a function ofthe meaning of the parts.

While we focus on natural language, compositionality exists not justin language.

Sound/music

Music notes are composed with some regularity but not randomlyarranged to form a song.

Vision

Natural scenes are composed of meaningful components.Artificial visual art pieces often convey certain meaning with regularityfrom their parts.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 4 / 119

Page 7: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Principle of compositionality: The meaning of a whole is a function ofthe meaning of the parts.

While we focus on natural language, compositionality exists not justin language.

Sound/music

Music notes are composed with some regularity but not randomlyarranged to form a song.

Vision

Natural scenes are composed of meaningful components.Artificial visual art pieces often convey certain meaning with regularityfrom their parts.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 4 / 119

Page 8: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Compositionality is regarded by many as a fundamental component ofintelligence in addition to language understanding (Miller et al., 1976;Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).

For example, Lake et al. (2016) emphasize several essentialingredients for building machines that “learn and think like people”:

CompositionalityIntuitive physics/psychologyLearning-to-learnCausality models

Note that many of these challenges present in natural languageunderstanding.

They are reflected in the sparseness in training a NLP model.

Note also that compositionality may be entangled with the other“ingredients” listed above.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 5 / 119

Page 9: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Compositionality is regarded by many as a fundamental component ofintelligence in addition to language understanding (Miller et al., 1976;Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).

For example, Lake et al. (2016) emphasize several essentialingredients for building machines that “learn and think like people”:

CompositionalityIntuitive physics/psychologyLearning-to-learnCausality models

Note that many of these challenges present in natural languageunderstanding.

They are reflected in the sparseness in training a NLP model.

Note also that compositionality may be entangled with the other“ingredients” listed above.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 5 / 119

Page 10: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Compositionality is regarded by many as a fundamental component ofintelligence in addition to language understanding (Miller et al., 1976;Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).

For example, Lake et al. (2016) emphasize several essentialingredients for building machines that “learn and think like people”:

CompositionalityIntuitive physics/psychologyLearning-to-learnCausality models

Note that many of these challenges present in natural languageunderstanding.

They are reflected in the sparseness in training a NLP model.

Note also that compositionality may be entangled with the other“ingredients” listed above.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 5 / 119

Page 11: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Compositionality is regarded by many as a fundamental component ofintelligence in addition to language understanding (Miller et al., 1976;Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).

For example, Lake et al. (2016) emphasize several essentialingredients for building machines that “learn and think like people”:

CompositionalityIntuitive physics/psychologyLearning-to-learnCausality models

Note that many of these challenges present in natural languageunderstanding.

They are reflected in the sparseness in training a NLP model.

Note also that compositionality may be entangled with the other“ingredients” listed above.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 5 / 119

Page 12: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Principle of Compositionality

Compositionality is regarded by many as a fundamental component ofintelligence in addition to language understanding (Miller et al., 1976;Fodor et al., 1988; Bienenstock et al., 1996; Lake et al., 2016).

For example, Lake et al. (2016) emphasize several essentialingredients for building machines that “learn and think like people”:

CompositionalityIntuitive physics/psychologyLearning-to-learnCausality models

Note that many of these challenges present in natural languageunderstanding.

They are reflected in the sparseness in training a NLP model.

Note also that compositionality may be entangled with the other“ingredients” listed above.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 5 / 119

Page 13: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

good → very good → not very good → ...

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 6 / 119

Page 14: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Figure: Results from (Zhu et al., 2014).A dot in the figure corresponds to a negatedphrase (e.g., not very good) in StanfordSentiment Treebank (Socher et al., 2013).The y-axis is its sentiment value and x-axisthe sentiment of its argument.

Even a one-layer composition, overone dimension of meaning (e.g.,semantic orientation (Osgoodet al., 1957)), could be acomplicated mapping.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 7 / 119

Page 15: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Figure: Results from (Zhu et al., 2014).A dot in the figure corresponds to a negatedphrase (e.g., not very good) in StanfordSentiment Treebank (Socher et al., 2013).The y-axis is its sentiment value and x-axisthe sentiment of its argument.

Even a one-layer composition, overone dimension of meaning (e.g.,semantic orientation (Osgoodet al., 1957)), could be acomplicated mapping.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 7 / 119

Page 16: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

good → very good → not very good → ...

senator → former senator → ...

basketball player → short basketball player → ...

giant → small giant → ...

empty/full → half empty/full → almost half empty/full → ...1

1See more examples in (Partee, 1995).Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 8 / 119

Page 17: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Semantic composition in natural language: the task of modelling themeaning of a larger piece of text by composing the meaning of itsconstituents.

modelling :

The compositionality in language is very challenging as discussedabove.Compositionality can entangle with other challenges such as thoseemphasized in (Lake et al., 2016).

a larger piece of text:

constituents:

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 9 / 119

Page 18: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Semantic composition in natural language: the task of modelling themeaning of a larger piece of text by composing the meaning of itsconstituents.

modelling : learning a representation

The compositionality in language is very challenging as discussedabove.Compositionality can entangle with other challenges such as thoseemphasized in (Lake et al., 2016).

a larger piece of text:

constituents:

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 9 / 119

Page 19: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Semantic composition in natural language: the task of modelling themeaning of a larger piece of text by composing the meaning of itsconstituents.

modelling : learning a representation

The compositionality in language is very challenging as discussedabove.Compositionality can entangle with other challenges such as thoseemphasized in (Lake et al., 2016).

a larger piece of text: a phrase, sentence, or document.

constituents:

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 9 / 119

Page 20: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Semantic composition in natural language: the task of modelling themeaning of a larger piece of text by composing the meaning of itsconstituents.

modelling : learning a representation

The compositionality in language is very challenging as discussedabove.Compositionality can entangle with other challenges such as thoseemphasized in (Lake et al., 2016).

a larger piece of text: a phrase, sentence, or document.

constituents: subword components, words, phrases.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 9 / 119

Page 21: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Semantic Composition in Natural Language

Semantic composition in natural language: the task of modelling themeaning of a larger piece of text by composing the meaning of itsconstituents.

modelling : learning a representation

The compositionality in language is very challenging as discussedabove.Compositionality can entangle with other challenges such as thoseemphasized in (Lake et al., 2016).

a larger piece of text: a phrase, sentence, or document.

constituents: subword components, words, phrases.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 9 / 119

Page 22: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Introduction

Two key problems:

How to represent meaning?

How to learn such a representation?

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 10 / 119

Page 23: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

Let’s first very briefly revisit the representation we assume in this tutorial... and leave the learning problem to the entire tutorial that follows.

Love

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 11 / 119

Page 24: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

Let’s first very briefly revisit the representation we assume in this tutorial... and leave the learning problem to the entire tutorial that follows.

Love

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 11 / 119

Page 25: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

love, admiration, satisfaction ...anger, fear, hunger ...

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 12 / 119

Page 26: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

love, admiration, satisfaction ...anger, fear, hunger ...

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 12 / 119

Page 27: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

A viewpoint from The Emotion Machine (Minsky, 2006)

Each variable responds to different concepts and each concept isrepresented by different variables.

This is exactly a distributed representation.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 13 / 119

Page 28: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

A viewpoint from The Emotion Machine (Minsky, 2006)

Each variable responds to different concepts and each concept isrepresented by different variables.

This is exactly a distributed representation.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 13 / 119

Page 29: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

A viewpoint from The Emotion Machine (Minsky, 2006)

Each variable responds to different concepts and each concept isrepresented by different variables.

This is exactly a distributed representation.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 13 / 119

Page 30: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 14 / 119

Page 31: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Modelling Composition Functions

How do we model the composition functions?

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 15 / 119

Page 32: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

Deep Learning for Semantic Composition

Deep learning: We focus on deep learning models in this tutorial.

“Wait a minute, deep learning again?”“DL people, leave language along ...”

Asking some questions may be helpful:

Are deep learning models providing nice function or densityapproximation, the problems that many specific NLP tasks essentiallyseek to solve? X→Y

Are continuous vector representations of meaning effective for (asleast some) NLP tasks? Are DL models convenient for computingsuch continuous representations?

Do DL models naturally bridge language with other modalities interms of both representation and learning? (this could be important.)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 16 / 119

Page 33: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

Deep Learning for Semantic Composition

Deep learning: We focus on deep learning models in this tutorial.

“Wait a minute, deep learning again?”“DL people, leave language along ...”

Asking some questions may be helpful:

Are deep learning models providing nice function or densityapproximation, the problems that many specific NLP tasks essentiallyseek to solve? X→Y

Are continuous vector representations of meaning effective for (asleast some) NLP tasks? Are DL models convenient for computingsuch continuous representations?

Do DL models naturally bridge language with other modalities interms of both representation and learning? (this could be important.)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 16 / 119

Page 34: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Representation

Deep Learning for Semantic Composition

Deep learning: We focus on deep learning models in this tutorial.

“Wait a minute, deep learning again?”“DL people, leave language along ...”

Asking some questions may be helpful:

Are deep learning models providing nice function or densityapproximation, the problems that many specific NLP tasks essentiallyseek to solve? X→Y

Are continuous vector representations of meaning effective for (asleast some) NLP tasks? Are DL models convenient for computingsuch continuous representations?

Do DL models naturally bridge language with other modalities interms of both representation and learning? (this could be important.)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 16 / 119

Page 35: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Introduction

More questions:

What NLP problems (e.g., semantic problems here) can be betterhandled with DL and what cannot?

Can NLP benefit from combining DL and other approaches (e.g.,symbolic approaches)?

In general, has the effectiveness of DL models for semantics alreadybeen well understood?

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 17 / 119

Page 36: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Introduction

Deep Learning for Semantic Composition

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 18 / 119

Page 37: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 19 / 119

Page 38: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Formal Semantics

Montague Semantics (1970–1973):

Treat natural language like a formal language via

an interpretation function [[. . .]], anda mapping from CFG rules to function application order.

Interpretation of a sentence reduces to logical form via β-reduction.

High Level Idea

Syntax guides composition, types determine their semantics, predicatelogic does the rest.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 20 / 119

Page 39: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Formal Semantics

Syntactic Analysis Semantic Interpretation

S ⇒ NP VP [[VP]]([[NP]])NP ⇒ cats, milk, etc. [[cats]], [[milk]], . . .VP ⇒ Vt NP [[Vt]]([[NP]])Vt ⇒ like, hug, etc. λyx .[[like]](x , y), . . .

[[like]]([[cats]], [[milk]])

[[cats]] λx .[[like]](x , [[milk]])

λyx .[[like]](x , y) [[milk]]

Cats like milk.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 21 / 119

Page 40: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Formal Semantics

Pros:

Intuitive and interpretable(?) representations.

Leverage the power of predicate logic to model semantics.

Evaluate the truth of statements, derive conclusions, etc.

Cons:

Brittle, requires robust parsers.

Extensive logical model required for evaluation of clauses.

Extensive set of rules required to do anything useful.

Overall, an intractable (or unappealing) learning problem.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 22 / 119

Page 41: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 23 / 119

Page 42: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Simple Parametric Models

Basic models with pre-defined function form (Mitchell et al., 2008):

General form : p = f (u, v ,R,K )

Add : p = u + v

WeightAdd : p = αTu + βTvMultiplicative : p = u ⊗ v

Combined : p = αTu + βTv + γT (u ⊗ v)

We will see later in this tutorial that the above models could be seen asspecial cases of more complicated composition models.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 24 / 119

Page 43: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Results

Reference (R): The color ran.High-similarity landmark (H): The color dissolved.Low-similarity landmark (L): The color galloped.

A good composition model should give the above R-H pair a similarity scorehigher than that given to the R-L pair. Also, a good model should assign suchsimilarity scores with a high correlation (ρ) to what human assigned.

Models R-H similarity R-L similarity ρNonComp 0.27 0.26 0.08**Add 0.59 0.59 0.04*WeightAdd 0.35 0.34 0.09**Kintsch 0.47 0.45 0.09**Multiply 0.42 0.28 0.17**Combined 0.38 0.28 0.19**UpperBound 4.94 3.25 0.40**

Table: Mean cosine similarities for the R-H pairs and R-L pairs as well as thecorrelation coefficients (ρ) with human judgments (*: p < 0.05, **: p < 0.01).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 25 / 119

Page 44: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Results

Reference (R): The color ran.High-similarity landmark (H): The color dissolved.Low-similarity landmark (L): The color galloped.

A good composition model should give the above R-H pair a similarity scorehigher than that given to the R-L pair. Also, a good model should assign suchsimilarity scores with a high correlation (ρ) to what human assigned.

Models R-H similarity R-L similarity ρNonComp 0.27 0.26 0.08**Add 0.59 0.59 0.04*WeightAdd 0.35 0.34 0.09**Kintsch 0.47 0.45 0.09**Multiply 0.42 0.28 0.17**Combined 0.38 0.28 0.19**UpperBound 4.94 3.25 0.40**

Table: Mean cosine similarities for the R-H pairs and R-L pairs as well as thecorrelation coefficients (ρ) with human judgments (*: p < 0.05, **: p < 0.01).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 25 / 119

Page 45: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 26 / 119

Page 46: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Parameterizing Composition Functions

To move beyond simple algebraic or parametric models we need functionapproximators which, ideally:

Can approximate any arbitrary function (e.g. ANNs).

Can cope with variable size sequences.

Can capture long range or unbounded dependencies.

Can implicitly or explicitly model structure.

Can be trained against a supervised or unsupervised objective (orboth — semi-supervised training).

Can be trained chiefly or primarily through backpropagation.

A Neural Network Model Zoo

This section presents a selection of models satisfying some (if not all) ofthese criteria.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 27 / 119

Page 47: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 28 / 119

Page 48: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recurrent Neural Networks

Bounded Methods

Many methods impose explicit or implicit length limits on conditioninginformation. For example:

order-n Markov assumption in NLM/LBL

fully-connected layers and dynamic pooling in conv-nets

wj f(w1:j)

hj-1 hjRecurrent Neural Networks introduce a repeatedlycomposable unit, the recurrent cell, which bothmodels an unbounded sequence prefix and express afunction over it.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 29 / 119

Page 49: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

The Mathematics of Recurrence

wj f(w1:j)

hj-1 hj

previousstate

nextstate

inputs outputs

Building Blocks

An input vector wj ∈ R|w |

A previous state hj−1 ∈ R|h|

A next state hj ∈ R|h|

An output yj ∈ R|y |

fy : R|w | × R|h| → R|y |

fh : R|w | × R|h| → R|h|

Putting it together

hj = fh(wj , hj−1)

yj = fy (wj , hj)

So yj = fy (wj , fh(wj−1, hj−1)) = fy (wj , fh(wj−1, fh(wj−2, hj−2))) = . . .

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 30 / 119

Page 50: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

RNNs for Language Modelling

Language modelling

We want to model the joint probability of tokens t1, . . . tn in a sequence:

P(t1, . . . tn) = P(t1)n∏

i=2

P(ti |t1, . . . ti−1)

Adapting a recurrence for basic LM

For vocab V, define an embedding matrix E ∈ R|V |×|w | and a logitprojection matrix WV ∈ R|y |×|V |. Then:

wj = embed(tj ,E )

yj = fy (wj , hj) hj = fh(wj , hj−1)

pj = softmax(yjWV )

P(tj+1|t1, . . . , tj) = Categorical(tj+1; pj)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 31 / 119

Page 51: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Aside: The Vanishing Gradient Problem and LSTM RNNs

RNN is deep “by time”, so it couldseriously suffer from the vanishinggradient issue.

LSTM configures memory cells andmultiple “gates” to controlinformation flow. If properly learned,LSTM can keep pretty long-distance(hundreds of time steps) informationin memory.

Memory-cell details:

it = σ(Wxixt + Whiht−1 + Wcict−1)

ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1)

ct = σ(ftct−1 + ittanh(Wxcxt + Whcht−1))

ot = σ(Wxoxt + Whoht−1 + Wcoct)

ht = σ(ottanh(ct))

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 32 / 119

Page 52: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Conditional Language Models

Conditional Language Modelling

A strength of RNNs is that hj can model not only the history of thegenerated/observed sequence t1, . . . , tj , but any conditioning informationβ, e.g. by setting h0 = β.

w1 w2 w3

w1 w2 w3

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 33 / 119

Page 53: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Encoder-Decoder Models with RNNs

Les chiens aiment les os ||| Dogs love bones

Dogs love bones </s>

Source sequence Target sequence

cf. Kalchbrenner et al., 2013; Sutskever et al., 2014

Model p(t1, . . . , tn|s1, . . . , sm)

hei = RNNencoder (si , hei−1)

hdi = RNNdecoder (ti , hdi−1)

hd0 = hem

ti+1 ∼ Categorical(t; fV (hi ))

The encoder RNN as a composition module

All information needed to transduce the source into the target sequenceusing RNNdecoder needs to be present in the start state hd0 .This start state is produced by RNNencoder , which will learn to compose.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 34 / 119

Page 54: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

RNNs as Sentence Encoders

This idea of RNNs as sentence encoder works for classification as well:

Data is labelled sequences (s1, . . . , s|s|; y).

RNN is run over s to produce final state h|s| = RNN(s).

A differentiable function of h|s| classifies: y = fθ(h|s|)

h|s| can be taken to be the composed meaning of s, with regard tothe task at hand.

An aside: Bi-directional RNN encoders

For both sequence classification and generation, sometimes aBi-directional RNN is used to encode:

h←i = RNN←(si , h←i+1) h→i = RNN→(si , h

→i−1)

h|s| = concat(h←1 , h→|s|)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 35 / 119

Page 55: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

A Transduction Bottleneck

Les chiens aiment les os ||| Dogs love bones

Dogs love bones </s>

Source sequence Target sequence

Single vector representation ofsentences causes problems:

Training focusses on learningmarginal language model oftarget language first.

Longer input sequences causecompressive loss.

Encoder gets significantlydiminished gradient.

In the words of Ray Mooney. . .

“You can’t cram the meaning of a whole %&!$ing sentence into a single$&!*ing vector!” Yes, the censored-out swearing is copied verbatim.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 36 / 119

Page 56: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Attention

Les chiens aiment les os ||| Dogs love bones

Dogs love bones </s>

Source sequence Target sequence

cf. Bahdanau et al., 2014

We want to use he1, . . . , hem when

predicting ti by conditioning onwords that might relate to ti :

1 Compute hdi (RNN update)

2 eij = fatt(hdi , h

ej )

3 aij = softmax(ei )j4 hatti =

∑mj=1 aijh

ej

5 hi = concat(hdi , hatti )

6 ti+1 ∼ Categorical(t; fV (hi ))

The many faces of attention

Many variants on the above process: early attention (based on hdi−1 and ti ,

used to update hdi ), different attentive functions fatt (e.g. based onprojected inner products, or MLPs), and so on.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 37 / 119

Page 57: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Attention and Composition

We refer to the set of source activation vectors he1, . . . , hem in the previous

slides as an attention matrix. Is it a suitable sentence representation?

Pros:

Locally compositional: vectors contain information about other words(especially with bi-directional RNN as encoder).

Variable size sentence representation: longer sentences yield largerrepresentation with more capacity.

Cons:

Single vector representation of sentences is convenient (manydecoders, classifiers, etc. expect fixed-width feature vectors as input)

Locally compositional, but are long range dependencies resolved inthe attention matrix? Does it truly express the sentence’s meaning asa semantic unit (or is it just good for sequence transduction)?

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 38 / 119

Page 58: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 39 / 119

Page 59: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recursive Neural Networks

Recursive networks: a generalization of (chain) recurrent networks witha computational graph, often a tree (Pollack, 1990; Francesconi et al.,1997; Socher et al., 2011a,b,c, 2013; Zhu et al., 2015b)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 40 / 119

Page 60: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recursive Neural Networks

Successfully applied to consider input data structures.

Natural language processing (Socher et al., 2011a,c; Le et al., 2015;Tai et al., 2015; Zhu et al., 2015b)Computer vision (Socher et al., 2011b)

How to determine the structures.

Encode given “external” knowledge about the structure of the inputdata,

e.g., syntactic structures; modelling sentential semantics and syntax isone of the most interesting problems in language.

Encode simply a complete tree.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 41 / 119

Page 61: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recursive Neural Networks

Successfully applied to consider input data structures.

Natural language processing (Socher et al., 2011a,c; Le et al., 2015;Tai et al., 2015; Zhu et al., 2015b)Computer vision (Socher et al., 2011b)

How to determine the structures.Encode given “external” knowledge about the structure of the inputdata,

e.g., syntactic structures; modelling sentential semantics and syntax isone of the most interesting problems in language.

Encode simply a complete tree.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 41 / 119

Page 62: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recursive Neural Networks

Successfully applied to consider input data structures.

Natural language processing (Socher et al., 2011a,c; Le et al., 2015;Tai et al., 2015; Zhu et al., 2015b)Computer vision (Socher et al., 2011b)

How to determine the structures.Encode given “external” knowledge about the structure of the inputdata,

e.g., syntactic structures; modelling sentential semantics and syntax isone of the most interesting problems in language.

Encode simply a complete tree.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 41 / 119

Page 63: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recursive Neural Networks

Successfully applied to consider input data structures.

Natural language processing (Socher et al., 2011a,c; Le et al., 2015;Tai et al., 2015; Zhu et al., 2015b)Computer vision (Socher et al., 2011b)

How to determine the structures.Encode given “external” knowledge about the structure of the inputdata,

e.g., syntactic structures; modelling sentential semantics and syntax isone of the most interesting problems in language.

Encode simply a complete tree.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 41 / 119

Page 64: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Syntactic Parses in Composition

Recursive Neural Tensor Network (Socher et al., 2012):

The structure is given (here by a constituency parser.)Each node here is implemented as a regular feed-forward layer plus a3rd -order tensor.

The tensor captures 2nd -degree (quadratic) polynomial interaction ofchildren, e.g., b2i , bicj , and c2j .

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 42 / 119

Page 65: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Results

The models have been successfully applied to a number of tasks such assentiment analysis (Socher et al., 2013).

Table: Accuracy for fine grained (5-class) and binary predictions at thesentence level (root) and for all nodes.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 43 / 119

Page 66: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM

Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It isan extension of chain LSTM to tree structures.

If your have a non-binary tree, asimple solution is to binarize it.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 44 / 119

Page 67: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM

Tree-structured LSTM (Le, *SEM-15; Tai, ACL-15; Zhu, ICML-15): It isan extension of chain LSTM to tree structures.

If your have a non-binary tree, asimple solution is to binarize it.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 44 / 119

Page 68: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM Application: Sentiment Analysis

Sentiment composed over a constituency parse tree:

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 45 / 119

Page 69: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM Application: Sentiment Analysis

Results on Stanford Sentiment Treebank (Zhu et al., 2015b):

Models roots phrasesNB 41.0 67.2SVM 40.7 64.3RvNN 43.2 79.0RNTN 45.7 80.7Tree-LSTM 48.9 81.9

Table: Performances (accuracy) of models on Stanford Sentiment Treebank, atthe sentence level (roots) and the phrase level.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 46 / 119

Page 70: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM Application: Natural Language Inference

Applied to Natural Language Inference (NLI): Determine if a sentenceentails another, if they contradict, or have no relation (Chen et al., 2017).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 47 / 119

Page 71: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM Application: Natural Language Inference

Accuracy on Stanford Natural Language Inference (SNLI) dataset:(Chen et al., 2017)

* Welcome to the poster at 6:00-9:30pm on July 31.Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 48 / 119

Page 72: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Learning Representation for Natural Language Inference

RepEval-2017 Shared Task (Williams et al., 2017): Learn sentencerepresentation as a fixed-length vector.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 49 / 119

Page 73: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Tree-LSTM without Syntactic Parses

How if we simply apply recursive networks over trees that are notgenerated from syntactic parses, e.g., a complete binary trees?

Multiple efforts on SNLI (Munkhdalaiet al., 2016; Chen et al., 2017) haveobserved that the models outperformsequential (chain) LSTM.

This could be related to the discussionthat recursive nets may capturelong-distance dependency (Goodfellowet al., 2016).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 50 / 119

Page 74: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

SPINN: Doing Away with Test-Time Trees

buffer

stack

t = 0

downsatcatthe

shift

t = 1

downsatcat

the

shift

t = 2

downsat

catthe

reduce

t = 3

downsat

the cat

shift

t = 4

down

satthe cat

shift

t = 5

downsat

the cat

reduce

t = 6

sat downthe cat

reduce

t = 7 = T

(the cat) (sat down)

output to modelfor semantic task

Image credit: Sam Bowman and co-authors.cf. Bowman et al., 2016

Shift-Reduce Parsers:

Exploit isomorphism between binary branching trees with T leavesand sequences of 2T − 1 binary shift/reduce actions.

Shift unattached leaves from a buffer onto a processing stack.

Reduce the top two child nodes on the stack to a single parent node.

SPINN: Jointly train a TreeRNN and a vector-based shift-reduce parser.

Training time trees offer supervision for shift-reduce parser.No need for test time trees!

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 51 / 119

Page 75: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

SPINN:Doing Away with Test-Time Trees

buffer downsat

stack

catthe

composition

trackingtransition

reduce

downsat

the cat composition

trackingtransition

shift

down

satthe cat

tracking

Image credit: Sam Bowman and co-authors.

Word vectors start on buffer b (top: first word in sentence).

Shift moves word vectors from buffer to stack s.

Reduce pops top two vectors off the stack, appliesf R : Rd × Rd → Rd , and pushes the result back to the stack(i.e. TreeRNN composition).

Tracker LSTM tracks parser/composer state across operations,decides shift-reduce operations a, is supervised by both observedshift-reduce operations and end-task:

ht = LSTM(f C (bt−1[0], st−1[0], st−1[1]), ht−1) at ∼ f A(ht)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 52 / 119

Page 76: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

A Quick Introduction to Reinforce

What if some part of our process is not differentiable (e.g. samples fromthe shift-reduce module in SPINN) but we want to learn with no labels. . .

x y

x y

z

p(y |x) = Epθ(z|x) [fφ(z , x)] s.t. y ∼ fφ(z , x) or y = fφ(z , x)

∇φp(y |x) =∑z

pθ(z |x)∇φfφ(z , x) = Epθ(z|x) [∇φfφ(z , x)]

∇θp(y |x) =∑z

fφ(z , x)∇θpθ(z |x) = ???

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 53 / 119

Page 77: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

A Quick Introduction to Reinforce

The Reinforce Trick (R. J. Williams, 1992)

∇θ log pθ(z |x) =∇θpθ(z |x)

pθ(z |x)⇒ ∇θpθ(z |x) = pθ(z |x)∇θ log pθ(z |x)

∇θp(y |x) =∑z

fφ(z , x)∇θpθ(z |x)

=∑z

fφ(z , x)pθ(z |x)∇θ log pθ(z |x)

= Epθ(z|x) [fφ(z , x)∇θ log pθ(z |x)]

This naturally extends to cases where p(z |x) = p(z1, . . . , zn|x).

RL vocab: samples of such sequences of of discrete actions are referred toas “traces”. We often refer to pθ(z |x) as a policy πθ(z ; x).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 54 / 119

Page 78: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

SPINN+RL: Doing Away with Training-Time Trees

“Drop in” extension to SPINN (Yogatama et al., 2016):

Treat at ∼ f A(ht) as policy πAθ (at ; ht), trained via Reinforce.

Reward is negated loss of the end task, e.g. log-likelihood of thecorrect label.

Everything else is trained by backpropagation against the end task:tracker LSTM, representations, etc. receive gradient both from thesupervised objective, and from Reinforce via the shift-reduce policy.

a woman

wearing

sunglasses

is frowning . a boy drag

s his sleds

through the sno

w .

Model recovers linguistic-like structures (e.g. noun phrases, auxiliary verb-verb pairing, etc.).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 55 / 119

Page 79: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

SPINN+RL: Doing Away with Training-Time Trees

Does RL-SPINN work? According to Yogatama et al. (2016):

Better than LSTM baselines: model captures and exploits structure.

Better than SPINN benchmarks: model is not biased by whatlinguists think trees should be like, only has a loose inductive biasetowards tree structures.

But some parses do not reflect order of composition (see below).Semi-supervised setup may be sensible.

two men are playing

frisbee in the park .fami

lymembers

standing

outside a hom

e .

Some “bad” parses, but not necessarily worse results.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 56 / 119

Page 80: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 57 / 119

Page 81: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Convolution Neural Networks

Visual Inspiration: How do we learn to recognise pictures?Will a fully connected neural network do the trick?8Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 58 / 119

Page 82: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for pictures

Problem: lots of variance that shouldn’t matter (position, rotation, skew,difference in font/handwriting).8 888 8 8

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 59 / 119

Page 83: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for pictures

Solution: Accept that features are local. Search for local features with awindow. 8Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 60 / 119

Page 84: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for pictures

Convolutional window acts as a classifer for local features.8 ⇒

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 61 / 119

Page 85: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for pictures

Different convolutional maps can be trained to recognise different features(e.g. edges, curves, serifs).

...

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 62 / 119

Page 86: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for pictures

Stacked convolutional layers learn higher-level features.

Fully Connected LayerConvolutional Layer

8 8Raw Image First Order Local Features Higher Order Features Prediction

One or more fully-connected layers learn classification function overhighest level of representation.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 63 / 119

Page 87: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for language

Convolutional neural networks fit natural language well.

Deep ConvNets capture:

Positional invariances

Local features

Hierarchical structure

Language has:

Some positional invariance

Local features (e.g. POS)

Hierarchical structure (phrases,dependencies)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 64 / 119

Page 88: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for language

How do we go from images to sentences? Sentence matrices!

w1 w2 w3 w4 w5

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 65 / 119

Page 89: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for language

Does a convolutional window make sense for language?

w1 w2 w3 w4 w5

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 66 / 119

Page 90: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNets for language

A better solution: feature-specific windows.

w1 w2 w3 w4 w5

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 67 / 119

Page 91: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Word Level Sentence Vectors with ConvNets

K-Max pooling(k=3)

Fully connectedlayer

Folding

Wideconvolution

(m=2)

Dynamick-max pooling (k= f(s) =5)

Projectedsentence

matrix(s=7)

Wideconvolution

(m=3)

game's the same, just got more fiercecf. Kalchbrenner et al., 2014

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 68 / 119

Page 92: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Character Level Sentence Vectors with ConvNets

Image credit: Yoon Kim and co-authors.

cf. Kim et al., 2016

Naively, we could just representeverything at character level.

Convolutions seem to work wellfor low-level patterns(e.g. morphology)

One interpretation: multiplefilters can capture the low-levelidiosyncrasies of naturallanguage (e.g. arbitrary spelling)whereas language is morecompositional at a higher level.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 69 / 119

Page 93: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

ConvNet-like Architectures for Composition

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t14 t15 t16t10

s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

t11 t12 t13 t14 t15 t16 t17t10t9t8t7t6t5t4t3t2t1

Image credit: Nal Kalchbrenner and co-authors.

cf. Kalchbrenner et al., 2016

Many other CNN-likearchitectures (e.g. ByteNet fromKalchbrenner et al. (2016))

Common recipe components:dilated convolutions and ResNetblocks.

These model sequences well indomains like speech, and arebeginning to find applications inNLP, so worth reading up on.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 70 / 119

Page 94: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 71 / 119

Page 95: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Unsupervised Composition Models

Why care about unsupervised learning?

Much more unlabelled linguistic data than labelled data.

Learn general purpose representations and composition functions.

Suitable pre-training for supervised models, semi-supervised, ormulti-task objectives.

In the (paraphrased) words of Yann LeCun: unsupervised learning is acake, supervised learning is frosting, and RL is the cherry on top!Plot twist: it’s possibly a cherry cake.

Yes, that’s nice. . . But what are we doing, concretely?

Good question! Usually, just modelling—directly or indirectly—someaspect of the probability of the observed data.

Further suggestions on a postcard, please!

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 72 / 119

Page 96: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Autoencoders

Autoencoders provide an unsupervised method for representation learning:

We minimise an objective function over inputs xi , i ∈ N and theirreconstructions x ′i :

J =1

2

N∑i

∥∥x ′i − xi∥∥2

Warning: degenerate solution if xi can be updated (∀i .xi = 0).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 73 / 119

Page 97: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Recursive Autoencoders

cf. Socher et al., 2011a

To auto-encode variable lengthsequences, we can chainautoencoders to create a recursivestructure.

Objective FunctionMinimizing the reconstruction errorwill learn a compression function overthe inputs:

Erec(i , θ) =1

2

∥∥∥xi − x ′i

∥∥∥2

A “modern” alternative: use sequence to sequence model, andlog-likelihood objective.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 74 / 119

Page 98: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

What’s wrong with auto-encoders?

Empirically, narrow auto-encoders produce sharp latent codes, andunregularised wide auto-encoders learn identity functions.

Reconstruction objective includes nothing about distance preservationin latent space: no guarantee that

dist(a, b) ≤ dist(a, c)

→ dist(encode(a), encode(b)) ≤ dist(encode(a), encode(c))

Conversely, little incentive for similar latent codes to generateradically different (but semantically equivalent) observations.

Ultimately, compression 6= meaning.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 75 / 119

Page 99: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Skip-Thought

Image credit: Jamie Kiros and co-authors.cf. Kiros et al., 2015

Similar to auto-encoding objective: encode sentence, but decodeneighbouring sentences.

Pair of LSTM-based seq2seq models with share encoder, butalternative formulations are possible.

Conceptually similar to distributional semantics: a unit’srepresentation is a function of its neighbouring units, except units aresentence instead of words.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 76 / 119

Page 100: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders

Semantically Weak Codes

Generally, auto-encoders sparsely encode or densely compress information.No pressure to ensure similarity continuum amongst codes.

Factorized Generative Picture

p(x) =

∫p(x , z)dz

=

∫p(x |z)p(z)dz

= Ep(z) [p(x |z)]

z xN(0, I)

Prior on z enforces semantic continuum (e.g. no arbitrarily unrelated codesfor similar data), but expectation is typically intractable to computeexactly, and Monte Carlo estimate of gradients will be high variance.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 77 / 119

Page 101: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders

Goal

Estimate, by maximising p(x):

The parameters θ of a function modelling part of the generativeprocess pθ(x |z) given samples from a fixed prior z ∼ p(z).

The parameters φ of a distribution qφ(z |x) approximating the trueposterior p(z |x).

How do we do it? We maximise p(x) via a variational lower bound (VLB):

log p(x) ≥ Eqφ(z|x) [log pθ(x |z)]− DKL (qφ(z |x)‖p(z))

Equivalently we can minimise NLL(x):

NLL(x) ≤ Eqφ(z|x)[NLLθ(x |z)] + DKL (qφ(z |x)‖p(z))

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 78 / 119

Page 102: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders

Let’s derive the VLB:

log p(x) = log

∫1 · pθ(x |z)p(z)dz

= log

∫qφ(z |x)

qφ(z |x)pθ(x |z)p(z)dz

= logEqφ(z|x)

[p(z)

qφ(z |x)pθ(x |z)

]≥ Eqφ(z|x)

[log

p(z)

qφ(z |x)+ log pθ(x |z)

]= Eqφ(z|x) [log pθ(x |z)]− DKL (qφ(z |x)‖p(z))

For right qφ(z |x) and p(z) (e.g. Gaussians) there is a closed-formexpression of DKL (qφ(z |x)‖p(z)).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 79 / 119

Page 103: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders

The problem of stochastic gradients

Estimating ∂∂φEqφ(z|x) [log pθ(x |z)] requires backpropagating through

samples z ∼ qφ(z |x). For some choices of q, such as Gaussians there arereparameterization tricks (cf. Kingma et al., 2013)

Reparameterizing Gaussians (Kingma et al., 2013)

z ∼ N(z ;µ, σ2)

equivalent to z = µ+ σε

where ε ∼ N(ε; 0, I)

Trivially:∂z

∂µ= 1

∂z

∂σ= ε

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 80 / 119

Page 104: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders for Sentences

1 Observe a sentence w1, . . . ,wn. Encode it, e.g. with an LSTM:he = LSTMe(w1, . . . ,wn)

2 Predict µ = f µ(he) and σ2 = f σ(he) (in practice we operate in logspace for σ2 by determining log σ).

3 Sample z ∼ q(z |x) = N(z ;µ, σ2)

4 Use conditional RNN to decode and measure log p(x |z). Useclosed-form formula of KL divergence of two Gaussians to calculate−DKL (qφ(z |x)‖p(z)). Add both to obtain maximisation objective.

5 Backpropagate gradient through decoder normally based on logcomponent of the objective, and use reparameterisation trick tobackpropagate through sampling operation back to encoder.

6 Gradient of the KL divergence component of the loss with regard tothe encoder parameters is straightforward backpropagation.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 81 / 119

Page 105: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders and Autoregressivity

The problem of powerful auto-regressive decoders

We want to minimise NLL(x) ≤ Eq(z|x)[NLL(x |z)] + DKL (q(z |x)‖p(z)).What if the decoder is powerful enough to model x without using z?

A degenerate solution:

If z can be ignored when minimising the reconstruction loss of x givenz , the model can safely let q(z |x) collapse to the prior p(z) tominimise DKL (q(z |x)‖p(z)).Since q need not depend on x (e.g. the encoder can just ignore x andpredict the mean and variance of the prior), z bears no relation to x .Result: useless encoder, useless latent variable.

Is this really a problem?

If your decoder is not auto-regressive (e.g. MLPs expressing the probabilityof pixels which are conditionally independent given z), then no.If your decoder is an RNN and domain has systematic patterns, then yes.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 82 / 119

Page 106: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Variational Auto-Encoders and Autoregressivity

What are some solutions to this problem?

Pick a non-autoregressive decoder. If you care more about the latentcode than having a good generative model (e.g. document modelling),this isn’t a bad idea, but frustrating if this is the only solution.

KL Annealing: set Eq(z|x)[NLL(x |z)] + αDKL (q(z |x)‖p(z)) asobjective. Start with α = 0 (basic seq2seq model). Increase α to 1over time during training. Works somewhat, but unprincipledchanging of the objective function.

Set as objective Eq(z|x)[NLL(x |z)] + max(λ,DKL (q(z |x)‖p(z))) whereλ ≥ 0 is a scalar or vector hyperparameter. Once the KL dips below λ,there is no benefit, so the model must rely on z to some extent. Thisobjective is still a valid upper bound on NLL(x) (albeit a looser one).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 83 / 119

Page 107: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 84 / 119

Page 108: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 85 / 119

Page 109: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional or Non-compositional Representation

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 86 / 119

Page 110: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional or Non-compositional Representation

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 87 / 119

Page 111: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional or Non-compositional Representation

Such “hard” or “soft” non-compositionalilty exists at differentgranularities of texts.

We will discuss some models on how to handle this at theword-phrase level.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 88 / 119

Page 112: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional or Non-compositional Representation

Such “hard” or “soft” non-compositionalilty exists at differentgranularities of texts.

We will discuss some models on how to handle this at theword-phrase level.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 88 / 119

Page 113: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional and Non-compositional Semantics

Compositionality/non-compositionality is a common phenomenon inlanguage.

A framework that is able to consider bothcompositionality/non-compositionality is of interest.

A pragmatic viewpoint: If one is able to obtain holistically therepresentation of an n-gram or a phrase in text, it would be desirablethat a composition model has the ability to decide the sources ofknowledge it will use.

In addition to composition, considering non-compositionality mayavoid back-propagating errors unnecessarily to confuse wordembedding.

think about the “kick the bucket” example.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 89 / 119

Page 114: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional and Non-compositional Semantics

Compositionality/non-compositionality is a common phenomenon inlanguage.

A framework that is able to consider bothcompositionality/non-compositionality is of interest.

A pragmatic viewpoint: If one is able to obtain holistically therepresentation of an n-gram or a phrase in text, it would be desirablethat a composition model has the ability to decide the sources ofknowledge it will use.

In addition to composition, considering non-compositionality mayavoid back-propagating errors unnecessarily to confuse wordembedding.

think about the “kick the bucket” example.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 89 / 119

Page 115: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Compositional and Non-compositional Semantics

Compositionality/non-compositionality is a common phenomenon inlanguage.

A framework that is able to consider bothcompositionality/non-compositionality is of interest.

A pragmatic viewpoint: If one is able to obtain holistically therepresentation of an n-gram or a phrase in text, it would be desirablethat a composition model has the ability to decide the sources ofknowledge it will use.

In addition to composition, considering non-compositionality mayavoid back-propagating errors unnecessarily to confuse wordembedding.

think about the “kick the bucket” example.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 89 / 119

Page 116: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Integrating non-compositionality in recursive networks (Zhu et al., 2015a):

Basic idea: Enabling individual composition operations to be able tochoose information from different resources, compositional ornon-compositional (e.g., holistically learned).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 90 / 119

Page 117: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Model 1: Regular bilinear merge (Zhu et al., 2015a):

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 91 / 119

Page 118: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Model 2: Tensor-based merging (Zhu et al., 2015a)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 92 / 119

Page 119: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Model 3: Explicitly gated merging (Zhu et al., 2015a):

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 93 / 119

Page 120: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Experiment Set-Up

Task: sentiment analysisData: Stanford Sentiment TreebankNon-compositional sentiment

Sentiment of ngrams automatically learned from tweets (Mohammadet al., 2013).

Polled the Twitter API every four hours from April to December 2012in search of tweets with either a positive word hashtag or a negativeword hashtag.Using 78 seed hashtags (32 positive and 36 negative) such as #good,#excellent, and #terrible to annotate sentiment.775,000 tweets that contain at least a positive hashtag or a negativehashtag were used as the learning corpus.Point-wise mutual information (PMI) is calculated for each bigramsand trigrams.Each sentiment score is converted to a one-hot vector; e.g. a bigramwith a score of -1.5 will be assigned a 5-dimensional vector [0, 1, 0, 0,0] (i.e., the e vector).

Using the human annotation coming with Stanford SentimentTreebank for bigrams and trigrams.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 94 / 119

Page 121: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Results

Models sentence-level (roots) all phrases (all nodes)

(1) RNTN 42.44 79.95

(2) Regular-bilinear (auto) 42.37 79.97(3) Regular-bilinear (manu) 42.98 80.14

(4) Explicitly-gated (auto) 42.58 80.06(5) Explicitly-gated (manu) 43.21 80.21

(6) Confined-tensor (auto) 42.99 80.49(7) Confined-tensor (manu) 43.75† 80.66†

Table: Model performances (accuracy) on predicting 5-category sentiment at thesentence (root) level and phrase level.

1The results is based on the version 3.3.0 of the Stanford CoreNLP.Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 95 / 119

Page 122: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

We have discussed integrating non-compositionality in recursivenetworks.

How if there are no prior input structures available?

Remember we have discussed the models that capture hiddenstructures.

How if a syntactic parsing tree is not very reliable?

e.g., for data like social media text or speech transcripts.

In these situations, how can we still consider non-compositionality inthe composition process.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 96 / 119

Page 123: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Integrating non-compositionality in chain recurrent networks (Zhu et al.,2016)

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 97 / 119

Page 124: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Non-compositional nodes:

Form the non-compositional paths (e.g., 3-8-9 or 4-5-9).

Allow the embedding spaces of a non-compositional node to bedifferent from those of a compositional node.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 98 / 119

Page 125: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Fork nodes:

Summarizing history so far tosupport both compositional andnon-compositional paths.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 99 / 119

Page 126: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Merging nodes:

Combining information from compositional and non-compositional paths.

Binarization

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 100 / 119

Page 127: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Binarization:

Binarizing the composition of in-boundpaths (we do not worry too much aboutthe order of merging.)

Now we do not need to design differentnodes for different fan-in, but letparameter-sharing be all over the nets.

Use the tree-LSTM above.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 101 / 119

Page 128: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Integrating Compositional and Non-compositionalSemantics

Binarization:

Binarizing the composition of in-boundpaths (we do not worry too much aboutthe order of merging.)

Now we do not need to design differentnodes for different fan-in, but letparameter-sharing be all over the nets.

Use the tree-LSTM above.Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 101 / 119

Page 129: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Results

Method SemEval-13 SemEval-14Majority baseline 29.19 34.46Unigram (SVM) 56.95 58.583rd best model 64.86 69.952nd best model 65.27 70.14The best model 69.02 70.96DAG-LSTM 70.88 71.97

Table: Performances of different models in official evaluation metric (macroF-scores) on the test sets of SemEval-2013 and SemEval-2014 Sentiment Analysisin Twitter in predicting the sentiment of the tweet messages.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 102 / 119

Page 130: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Results

Method SemEval-13 SemEval-14DAG-LSTM

Full paths 70.88 71.97Full – {autoPaths} 69.36 69.27

Full – {triPaths} 70.16 70.77Full – {triPaths, biPaths} 69.55 69.93

Full – {manuPaths} 69.88 70.58LSTM without DAG

Full – {autoPaths,manuPaths} 64.00 66.40

Table: Ablation performances (macro-averaged F-scores) of DAG-LSTM withdifferent types of paths being removed.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 103 / 119

Page 131: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 104 / 119

Page 132: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition

Composition can also be performed to learn representations for wordsfrom subword components (Botha et al., 2014; Ling et al., 2015;Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).

Rich morphology: some languages have larger vocabularies than others.Informal text: very coooooool!

Basically alleviate Sparseness!

One perspective of viewing subword models:

Morpheme based composition: deriving word representation frommorphemes.Character based composition: deriving word representation fromcharacters (pretty effective as well, even used by itself!)

Another perspective (by model architectures):

Recursive modelsConvolutional modelsRecurrent models

We will discuss several typical methods here only briefly.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 105 / 119

Page 133: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition

Composition can also be performed to learn representations for wordsfrom subword components (Botha et al., 2014; Ling et al., 2015;Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).

Rich morphology: some languages have larger vocabularies than others.Informal text: very coooooool!Basically alleviate Sparseness!

One perspective of viewing subword models:

Morpheme based composition: deriving word representation frommorphemes.Character based composition: deriving word representation fromcharacters (pretty effective as well, even used by itself!)

Another perspective (by model architectures):

Recursive modelsConvolutional modelsRecurrent models

We will discuss several typical methods here only briefly.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 105 / 119

Page 134: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition

Composition can also be performed to learn representations for wordsfrom subword components (Botha et al., 2014; Ling et al., 2015;Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).

Rich morphology: some languages have larger vocabularies than others.Informal text: very coooooool!Basically alleviate Sparseness!

One perspective of viewing subword models:

Morpheme based composition: deriving word representation frommorphemes.Character based composition: deriving word representation fromcharacters (pretty effective as well, even used by itself!)

Another perspective (by model architectures):

Recursive modelsConvolutional modelsRecurrent models

We will discuss several typical methods here only briefly.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 105 / 119

Page 135: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition

Composition can also be performed to learn representations for wordsfrom subword components (Botha et al., 2014; Ling et al., 2015;Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).

Rich morphology: some languages have larger vocabularies than others.Informal text: very coooooool!Basically alleviate Sparseness!

One perspective of viewing subword models:

Morpheme based composition: deriving word representation frommorphemes.Character based composition: deriving word representation fromcharacters (pretty effective as well, even used by itself!)

Another perspective (by model architectures):

Recursive modelsConvolutional modelsRecurrent models

We will discuss several typical methods here only briefly.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 105 / 119

Page 136: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition

Composition can also be performed to learn representations for wordsfrom subword components (Botha et al., 2014; Ling et al., 2015;Luong et al., 2015; Kim et al., 2016; Sennrich et al., 2016).

Rich morphology: some languages have larger vocabularies than others.Informal text: very coooooool!Basically alleviate Sparseness!

One perspective of viewing subword models:

Morpheme based composition: deriving word representation frommorphemes.Character based composition: deriving word representation fromcharacters (pretty effective as well, even used by itself!)

Another perspective (by model architectures):

Recursive modelsConvolutional modelsRecurrent models

We will discuss several typical methods here only briefly.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 105 / 119

Page 137: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition: Recursive Networks

Morphological Recursive Neural Networks (Luong et al., 2013):Extending recursive neural networks (Socher et al., 2011b) to learn wordrepresentation through composition over morphemes.

Assume the availability of morphemic analyses.

Each tree node combines a stem vector and an affix vector.

Figure. Context insensitive (left) and sensitive (right) MorphologicalRecursive Neural Networks.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 106 / 119

Page 138: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition: Recurrent Networks

Bi-directional LSTM for subword composition (Ling et al., 2015).

Figure. Character RNN for sub-word composition.

Some more details ...Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 107 / 119

Page 139: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition: Convolutional Networks

Convolutional neural networks for subword composition (Zhang et al.,2015)

Figure. Character CNN for sub-word composition.

In general, subword models have been successfully used in a widevariety of problems such as translation, sentiment analysis, questionanswering, etc.

You should seriously consider it in the situations such as OOV is highor the word distribution has a long tail.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 108 / 119

Page 140: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Subword Composition: Convolutional Networks

Convolutional neural networks for subword composition (Zhang et al.,2015)

Figure. Character CNN for sub-word composition.

In general, subword models have been successfully used in a widevariety of problems such as translation, sentiment analysis, questionanswering, etc.

You should seriously consider it in the situations such as OOV is highor the word distribution has a long tail.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 108 / 119

Page 141: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Outline

1 IntroductionSemantic compositionFormal methodsSimple parametric models

2 Parameterizing Composition FunctionsRecurrent composition modelsRecursive composition modelsConvolutional composition modelsUnsupervised models

3 Selected TopicsCompositionality and non-compositionalitySubword composition methods

4 Summary

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 109 / 119

Page 142: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

Summary

The tutorial discusses semantic composition with distributedrepresentation learned with neural networks.

Neural networks are able to learn powerful representation andcomplicated composition functions.

The models can achieve state-of-the-art performances on a widerange of NLP tasks.

We expect further studies would continue to deepen ourunderstanding on such approaches:

Unsupervised modelsCompositionality with other “ingredients” of intelligenceCompositionality in multi-modalitiesInterpretability of modelsDistributed vs./and symbolic composition models... ...

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 110 / 119

Page 143: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References I

C. E. Osgood, G. J. Suci, and P. H. Tannenbaum. TheMeasurement of Meaning. University of Illinois Press, 1957.

Richard Montague. “English as a Formal Language”. In: Linguagginella societa e nella tecnica. Ed. by Bruno Visentini. Edizioni diCommunita, 1970, pp. 188–221.

G. A. Miller and P. N. Johnson-Laird. Language and perception.Cambridge, MA: Belknap Press, 1976.

J. A. Fodor and Z. W. Pylyshyn. “Connectionism and cognitivearchitecture: A critical analysis”. In: Cognition 28 (1988), pp. 3–71.

Jordan B. Pollack. “Recursive Distributed Representations”. In:Artif. Intell. 46.1-2 (1990), pp. 77–105.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 111 / 119

Page 144: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References II

Ronald J. Williams. “Simple Statistical Gradient-FollowingAlgorithms for Connectionist Reinforcement Learning”. In: MachineLearning 8 (1992), pp. 229–256.

Barbara Partee. “Lexical semantics and compositionality”. In:Invitation to Cognitive Science 1 (1995), pp. 311–360.

Elie Bienenstock, Stuart Geman, and Daniel Potter.“Compositionality, MDL Priors, and Object Recognition”. In: NIPS.1996.

Enrico Francesconi et al. “Logo Recognition by Recursive NeuralNetworks”. In: GREC. 1997.

Jeff Mitchell and Mirella Lapata. “Vector-based Models of SemanticComposition”. In: ACL. 2008, pp. 236–244.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 112 / 119

Page 145: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References III

Richard Socher et al. “Dynamic Pooling and Unfolding RecursiveAutoencoders for Paraphrase Detection”. In: NIPS. 2011,pp. 801–809.

Richard Socher et al. “Parsing Natural Scenes and Natural Languagewith Recursive Neural Networks”. In: ICML. 2011, pp. 129–136.

Richard Socher et al. “Semi-Supervised Recursive Autoencoders forPredicting Sentiment Distributions”. In: EMNLP. 2011.

Richard Socher et al. “Semantic Compositionality through RecursiveMatrix-Vector Spaces”. In: EMNLP-CoNLL. 2012, pp. 1201–1211.

Nal Kalchbrenner and Phil Blunsom. “Recurrent ContinuousTranslation Models.”. In: EMNLP. Vol. 3. 39. 2013, p. 413.

Diederik P. Kingma and Max Welling. “Auto-Encoding VariationalBayes”. In: CoRR abs/1312.6114 (2013).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 113 / 119

Page 146: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References IV

Thang Luong, Richard Socher, and Christopher D. Manning.“Better Word Representations with Recursive Neural Networks forMorphology”. In: CoNLL. 2013.

Saif Mohammad, Svetlana Kiritchenko, and Xiao-Dan Zhu.“NRC-Canada: Building the State-of-the-Art in Sentiment Analysisof Tweets”. In: SemEval@NAACL-HLT. 2013.

Richard Socher et al. “Recursive Deep Models for SemanticCompositionality Over a Sentiment Treebank”. In: EMNLP. 2013,pp. 1631–1642.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neuralmachine translation by jointly learning to align and translate”. In:arXiv preprint arXiv:1409.0473 (2014).

Jan A. Botha and Phil Blunsom. “Compositional Morphology forWord Representations and Language Modelling”. In: ICML. 2014.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 114 / 119

Page 147: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References V

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. “Aconvolutional neural network for modelling sentences”. In: arXivpreprint arXiv:1404.2188 (2014).

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence tosequence learning with neural networks”. In: Advances in neuralinformation processing systems. 2014, pp. 3104–3112.

Xiaodan Zhu et al. “An Empirical Study on the Effect of NegationWords on Sentiment”. In: ACL. 2014.

Ryan Kiros et al. “Skip-thought vectors”. In: Advances in neuralinformation processing systems. 2015, pp. 3294–3302.

Phong Le and Willem Zuidema. “Compositional DistributionalSemantics with Long Short Term Memory”. In:*SEM@NAACL-HLT. 2015.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 115 / 119

Page 148: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References VI

Wang Ling et al. “Finding Function in Form: CompositionalCharacter Models for Open Vocabulary Word Representation”. In:2015.

Thang Luong et al. “Addressing the Rare Word Problem in NeuralMachine Translation”. In: ACL. 2015.

Sheng Kai Tai, Richard Socher, and D. Christopher Manning.“Improved Semantic Representations From Tree-Structured LongShort-Term Memory Networks”. In: ACL. 2015, pp. 1556–1566.

Xiang Zhang and Yann LeCun. “Text Understanding from Scratch”.In: CoRR abs/1502.01710 (2015).

Xiaodan Zhu, Hongyu Guo, and Parinaz Sobhani. “Neural Networksfor Integrating Compositional and Non-compositional Sentiment inSentiment Composition”. In: *SEM@NAACL-HLT. 2015.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 116 / 119

Page 149: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References VII

Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “Long Short-TermMemory Over Recursive Structures”. In: ICML. 2015,pp. 1604–1612.

Samuel R Bowman et al. “A fast unified model for parsing andsentence understanding”. In: arXiv preprint arXiv:1603.06021(2016).

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.MIT Press, 2016.

Nal Kalchbrenner et al. “Neural Machine Translation in LinearTime”. In: CoRR abs/1610.10099 (2016).

Yoon Kim et al. “Character-Aware Neural Language Models”. In:AAAI. 2016.

B. M. Lake et al. “Building Machines that Learn and Think LikePeople”. In: Behavioral and Brain Sciences. (in press). (2016).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 117 / 119

Page 150: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References VIII

Tsendsuren Munkhdalai and Hong Yu. “Neural Tree Indexers forText Understanding”. In: CoRR abs/1607.04492 (2016).

Rico Sennrich, Barry Haddow, and Alexandra Birch. “NeuralMachine Translation of Rare Words with Subword Units”. In: ACL.2016.

Dani Yogatama et al. “Learning to Compose Words into Sentenceswith Reinforcement Learning”. In: CoRR abs/1611.09100 (2016).

Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. “DAG-StructuredLong Short-Term Memory for Semantic Compositionality”. In:NAACL. 2016.

Qian Chen et al. “Enhanced LSTM for Natural Language Inference”.In: ACL. 2017.

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 118 / 119

Page 151: Deep Learning for Semantic Composition - Xiaodan · PDF fileSound/music Music notes are composed with some regularity but not randomly arranged to form a song. Vision ... Deep Learning

References IX

Williams, Nikita Nangia, and Samuel R. Bowman. “ABroad-Coverage Challenge Corpus for Sentence Understandingthrough Inference”. In: CoRR abs/1704.05426 (2017).

Xiaodan Zhu & Edward Grefenstette DL for Composition July 30th, 2017 119 / 119