Toward the Engineering of Improving Semantic …xye/papers_and_ppts/ppts/Improving...Language Space Semantic Space Context Expression Count-based representations: 30k 30k 300 300 Parsed

Toward the Engineering of Improving Toward the Engineering of Improving Toward the Engineering of Improving Toward the Engineering of Improving Semantic Embedding Semantic Embedding Semantic Embedding Semantic Embedding

Xugang Ye

Semantic Space

Language Space

��

�� ≈ ��′

Ge

ne

rative

pro

cess

��~(��|��; Θ) ≈ ��

��

��

��

��

Neural

Network

model

What is Semantic Embedding?

Semantic Space

Language Space

��

�� ≈ ��′

Ge

ne

rative

pro

cess

��~(��|��; Θ) ≈ ��

��

��

��

��

Neural

Network

model

What is Semantic Embedding?

��′

Relevance (by variational Bayes) :

ln ��

��≥ �� , �� − �� , ��|��

Language Space

Semantic Space

Context Expression

Count-based representations: 30k30k

300 300

Parsed words/phrases: Context

30k30k

300 300

300

Word/phrase hashing:

300

128 128��(�)

��(�)

��(�)

��(�)

Cosine similarity: �(�� , �� )

Relevance probability: (�� |�� )

Compressed representations:

Aggregated representations:

Semantic representations:

�(·) �(·)

�(�,�) �(�,�)

�(�,�) �(�,�)

�(�,�) �(�,�)

ModelArchitecture

Expression

Relevance probability

�� = !" #$ %� & ,%� '

( !" #$ %� & ,%� )%�,

Where

� �� , �� =%� & ·%� '

%� &'%� '

'

=∑ %+

(&)%+(')

+

∑ %+(&) '

+ ∑ %+(') '

+

,

�,(�) = ℎ .,

(�,�) , �,(�)

= ℎ .,(�,�)

, ~level 0

.,(�,�)

= ∑ �,/(�,�)

0/(�,�)

/ , .,(�,�)

= ∑ �,/(�,�)

0/(�,�)

/ , �(�,�), �(�,�) are fully connected

0,(�,�)

= ℎ .,(�,�) , 0,

(�,�)= ℎ .,

(�,�), ~level 1

.,(�,�)

= ∑ �,/(�,�)

0/(�,�)

/ , .,(�,�)

= ∑ �,/(�,�)

0/(�,�)

/ , �(�,�), �(�,�) are fully connected

0,(�,�)

= ℎ .,(�,�) , 0,

(�,�)= ℎ .,

(�,�), ~level 2

.,(�,�)

= ∑ �,/(�,�)

0/(�,�)

/ , .,(�,�)

= ∑ �,/(�,�)

0/(�,�)

/ , �(�,�), �(�,�) are partially connected

0,(�,�)

, 0,(�,�)

are count-based

Learning

• The gradient of relevance probability:

12 ln (�� |�� ) = 12 lnexp 6� �� , ��

( exp 6� �� , �� 7��

= 12 6� �� , �� − ln( exp 6� �� , �� 7��

= 612� �� , �� −( !" #$ %� & ,%�8 #9:$ %� & ,%�8 )%�8

( !" #$ %� & ,%� )%�

= 612� �� , �� − 6 ( ��′ �� 12� �� , �� 7��, where ��′ �� = !" #$ %� & ,%�8

( !" #$ %� & ,%� )%�

= 6 ( ��′ �� 12� �� , �� − 12� �� , �� 7��

≈ 6∑ ��′ �� 12� �� , �� − 12� �� , ��;�8<;� &

Learning

• The gradient of semantic similarity (via back propagation):

=$ %� & ,%� '

=%+(&) =

�

%� &'

%+(')

%� ''

− � �� , �� %+(&)

%� &'

, =$ %� & ,%� '

=%+(') =

�

%� ''

%+(&)

%� &'

− � �� , �� %+(')

%� ''

,

=$ %� & ,%� '

=2>?&,& =

=$ %� & ,%� '

=%>(&)

)@ A>(&,&)

)A>(&,&) 0B

(�,�),

=$ %� & ,%� '

=2>?&,' =

=$ %� & ,%� '

=%>(')

)@ A>(&,')

)A>(&,') 0B

(�,�),

=$ %� & ,%� '

=C?&,& = ∑

=$ %� & ,%� '

=%+(&)

)@ A+(&,&)

)A+(&,&) �,B

(�,�), ,

=$ %� & ,%� '

=C?&,' = ∑

=$ %� & ,%� '

=%+(')

)@ A+(&,')

)A+(&,') �,B

(�,�), ,

=$ %� & ,%� '

=2>?',& =

=$ %� & ,%� '

=C>&,&

)@ A>(',&)

)A>(',&) 0B

(�,�),

=$ %� & ,%� '

=2>?',' =

=$ %� & ,%� '

=C>&,'

)@ A>(',')

)A>(',') 0B

(�,�),

=$ %� & ,%� '

=C?',& = ∑

=$ %� & ,%� '

=C+&,&

)@ A+(',&)

)A+(',&) �,B

(�,�), ,

=$ %� & ,%� '

=C?',' = ∑

=$ %� & ,%� '

=C+&,'

)@ A+(',')

)A+(',') �,B

(�,�), ,

=$ %� & ,%� '

=2>?D,& =

=$ %� & ,%� '

=C>',&

)@ A>(D,&)

)A>(D,&) 0B

(�,�),

=$ %� & ,%� '

=2>?D,' =

=$ %� & ,%� '

=C>','

)@ A>(D,')

)A>(D,') 0B

(�,�)

Learning

• The objective function:

Suppose the association between ��(E) and ��()) is observed F)(E)

times. Let F(E) = ∑ F)(E)

) , Then, approximated by the

multinomial distribution, the probability of observing F)(E): 7 ∈ �(E) given I, F(E), �(E), where �(E) =

7: �� ) isassociatedwith��(E) is

F)(E): 7 ∈ �(E) |I, F E , �(E) ∝ ∏ ��())|��(E)

UV(W)

)∈X(W) . (1)

Consider all possible I ∈ Y and assume independence, we have the joint probability of observing F)(E): 7 ∈ �(E) : I ∈ Y

is

F)(E): 7 ∈ �(E) : I ∈ Y | I, F E , � E : I ∈ Y ∝ ∏ ∏ ��())|��(E)

UV(W)

)∈X(W)E∈Z . (2)

Hence a loss function can be constructed as

[ = −∑ ∑ F)(E) ln ��())|��(E))∈X(W)E∈Z . (3)

Logic behind the data

Semantic Space

Non-complete description:

{main features, meta features}

Complete description:

Semantic Space







Query

Infer + retrieveJudge

Feedback

Engineering platform

Open Space

Closed Space

NLP Knowledge

Parser

Tagger

Tokenizer

Contents

⋯

Search Engine

Index

Ranking

models

Services

Search

Recommendation

Suggestion

Conversation

Feedbacks

Instrumentation

Data Analytics

Statistical

models

⋯

Search Engine

Closed learning (cold start mining for IR)

Entities/Meanings/Concepts

Main features Meta features

* ⋯ ⋯

* ⋯ ⋯

* ⋯ ⋯



* ⋯ ⋯

* ⋯ ⋯

* ⋯ ⋯



* ⋯ ⋯

* ⋯ ⋯

* ⋯ ⋯

Contents

Structured

Unstructured

NLP Models:

Tokenizing

model

Tagging

model

Parsing

model

⋯

Query

Doc

Featurized query

+ Ranking Models:

Retrieved

results

Auto-complete Search

Featurized pairs (signals):

Feedback learning: search & suggestion

Semantic Space

Language Space

Context Expression

�(·)

ContextExpression

�(·)

⋯

Query Result Count

Query Result Count

Query Result Count

⋯

Query

Retrieved results:

*----------------

*------------

*---------

Instrumentation

Feedbacks

Data Analytics:

Statistical

models

NLP Models:

Parsing model

Tagging model

Tokenizing model

�

Re-ranking

Feedback learning: recommendation

Semantic Space

Language Space

Context Expression

�(·)

Featurized pairs (signals):

⋯

Query Result Count

Query Result Count

Query Result Count

�

⋯

Query Query Query⋯

Result Result Result⋯

�(·)

ContextExpression

Result

Result

Result

Model-based recommendation

Rule-based recommendation

Knowledge

⋯

Feedback learning: conversation

Language input:

Knowledge

Processed:

Tag Tag

Search Engine:

Index

Ranking models

NLP Models:

Parsing model

Tagging model

Tokenizing model

Retrieved results:

* --------------------

* ------------------

* ----------------

Language Models:

RNNs

LSTM

Response

-------

Chat Log

--------------

-------

-------

⋯

Instrumentation

�̂_ = ` �(a)b�_c� + d(a)0�_��_ = ` �(e)b�_c� + d(e)0�_f�_ = ` �(g)b�_c� + d(g)0�_��_ = ℎ �(h)b�_c� + d(h)0�_i�_ = i�_c� ∘ ��_ + ��_ ∘ �̂_b�_ = ℎ i�_ ∘ f�_

[ = −∑ ∑ ∑ k_,B(U) ln f_,B

(U)B_U

Demos

• Auto suggest

https://youtu.be/iDcKuOPU1q4

https://www.zillow.com:443/abs/AssignTrialBucket.htm?redirect=www.zillow.com&treatment=POI_TYPEAHEAD&trial=SHO_POI_TYPEAHEAD

• Chat bot

https://youtu.be/JJ6tg94LLw8

https://youtu.be/_Iu6uqaA6to

Recurrent DSSM for Sentence Embedding

From �(�) = (��(�), … , ��(�)) to � = (��, … , ��), to �� , it’s model-based word-sequence level featurization,

supervised by maximizing the joint likelihood of sequences.

��

�� = ��(�) LSTM LSTM LSTM LSTM �� … ��

��

��

��

��

��

��

��

+

ℎ

�

�

�

ℎ

�

�� = ℎ(��) ∘ ��, where

�� = �(�(�)�� +�(�)�� +�(�,�)��(�)), �� = �� ∘ �� + �� ∘ ��, �� = �(�(�)�� +�(�)�� +�(�,�)��(�)), �� = ℎ(�(�)�� +�(�)�� +�(�,�)��(�)), �� = �(�(�)�� +�(�)�� +�(�,�)��(�)).

��(�)

…

��(�)

…

�� (�)

…

…

��(�)

… Word

embedding

50k

300

Sentence

embedding

��(�)

Context

��(�)

��(�)

…

Context

embedding

�(�) = (��(�) , … , ��(�))

� = (��, … , ��)

��

�′(�) = (��′�(�), … , ��′$(�))

�′ = (��′�, … , ��′$)

��′$

Similarity: %(�� , ��′$) =&�'(&�)'$

*|&�'|*,-*&�)'$*-, for retrieval

%(�� , ��′$), as the main part of DSSM, measures the sentence level

similarity, trained from additional signals like clicks

A Quick Summary on Building and Evaluating Seq-to-Seq Models

Build model

Suppose the source sequence is � = (��, … , ��), the target sentence � = (��, … , �� ), we want to

model the relevance probability �(�|�). By the probability chain rule, we have

�(�|�) = �(��, … , �� |�) = ∏ �(��|��, … , ��, �) �� ,

which says we need to model �(��|��, … , ��, �). Suppose ��, … , ��, � are encoded into ��, then

�(��|��, … , ��, �) ≈ �(��|��). Define �(��, ��) as the similarity function (e.g., �(��, ��) =�� , where � is a projection matrix), then we can further model �(��|��, … , ��, �) as

softmax:

�(��|��, … , ��, �) ≈ ��(��(�� !",#� ))∑ ��(��(�� !",#� %))&''� %

.

The model depends on the sequence of state vectors ��, … , �� .

By using LSTM, the sequence of state vectors has the recurrence relation as

Suppose we have the data points {)�(*), �(*)+: - = 1, … , /}, then by assuming the independence of the

data points, we have the loss function as

12��)3)�(*), �(*)+: - = 1, … , /4+ = − �6∑ ln�(�(*)|�(*))6*��

�� = 9�(�) LSTM LSTM LSTM LSTM �� … ��

��

2��

��

:�� :��

;�� <�� =��

+

ℎ

?

?

?

ℎ

?

�� = ℎ(:��) ∘ 2��, where 2�� = ?(A(B)�� +D(B)��), :�� = :�� ∘ ;�� + <�� ∘ =��, ;�� = ?(A(E)�� +D(E)��), <�� = ℎ(A(F)�� +D(F)��), =�� = ?(A(*)�� +D(*)��).

Evaluate model

1) Perplexity

Suppose we have reference pairs: {)�′(H), �′(H)+: I = 1,… ,J}, then we can define the perplexity:

Perplexity = 2�"L∑ "

|�| MNOP Q(�%(R)|�%(R))LRS" , where |�| is number of words in �.

Pros: obviously, the Perplexity evaluates the model without generating target sequence for each

source sequence. So, it naturally solves the multiple reference problem.

Cons: it does not consider actual generating output.

2) BLEU

BLEU is a statistical metric that compares a generated target sequence for each source with its

reference target sequence.

2.1) Modified precision:

TH(�(H), �′(H)) = |�(R)∩�%(R)||�(R)| , which is the percentage of �(H) that appears in �′(H).

Pre({)�′(H), �′(H)+: I = 1,… , J}) = )∏ TH(�(H), �′(H))YH�� +�/Y, which is geometric mean.

2.2) Brevity penalty:

To heuristically catch the idea of recall, that is the longer the generated text, the more likely it

contains the refence components.

BP = \1,if: > `exp c1 − d

ef , otherwise

where : = ∑ |�(H)|YH�� , ` = ∑ |�l(H)|YH��

Putting together yields:

BLEU = BP ∙ Pre.

Pros: it’s very intuitive, easy to use

Cons: it’s bad for comparing very different systems.

Xugang, Nov. 2018

On Feeding Context into the RNN/LSTM Unit of RLM?

In recurrent language model (RLM):

Conditional sequence probability: ��|�� = ∏ ��|�, … , ��, �� ,

Conditional word probability: ��|�, … , ��, �� =��,��

∑ ��,��

!!��

, "�#��, �� = #�� $��,

In simplest RNN:

State transition mechanism: #� = ℎ�&� +(#�� +(�)�*��.

In typical LSTM:

State transition mechanism: #� = ℎ�+�� ∘ -�,

Where

-� = .�&�/�� +(�/�#�� +(�/,��*��,

+� = +�� ∘ 0� + 1� ∘ 2�,

0� = .�&�3�� +(�3�#�� +(�3,��*��,

1� = ℎ�&�4�� +(�4�#�� +(�4,��*��,

2� = .�&�5�� +(�5�#�� +(�5,��*��.

#� #��

�

#� #��

-�

�

+� +��

0�

1�

2�

+

ℎ

.

.

.

ℎ

ℎ

.

.

*��

*��

Xugang, Nov. 2018

On Making a Personally Consistent Response

A good example is the SPEAKER model in (Li et al 2017). This model basically treats the

embedding vector of personal info as context. As the following figure shows, ��(�) is the embedding

vector of person �. The intuition is that when generating �� (from ��), ��(�) also plays a role, with

the observation that person � and �� has significant co-occurrence.

�(��|�� , … , ��,�, �) = ��(��(��,��))∑ ��(��(��,��)) !!��

,

�("|�, �) = �(�� , … , ��#|�, �) = ∏ �(��|�� , … , ��,�, �)#�% .

A natural extension is called SPEAKER-Addressee model, in which ��(�) is replaced by

��(�(�),�(&)) = ℎ ()*�(�)+��*�(�)+ +)*�(&)+��*�(&)+-,

where �() is the speaker and �(.) is the addressee.

Note that even if the two speakers at test time were never involved in the same conversation in the

training data, the two speakers who are respectively close in embeddings may have been, and this can

help modelling how one speaker should respond to the other.

��

��(�)

��(�)

LSTM LSTM LSTM LSTM �� … ��

Speaker

embeddings

Personalized

dictionary

Xugang, Nov. 2018

Deriving Maximum Mutual Information Objective Function for Recurrent Language Model

Consider response sentence � = (��, … , ��), with message � and context �.

In recurrent language model (RLM),

(�|�, �) = (��, … , ��|�, �) = ∏ (��|��, … , ��,�, �)�� .

Suppose we use LSTM to model (��|��, … , ��,�, �), then

(��|��, … , ��,�, �) =��(��(��,��))

∑ ��(��(��,�� ))

!""��

,

where # is similarity function. For example, #($��, ��) = (%$��)�� = $��

%��, where % is the

projection matrix to handle the dimension mismatch of $�� and ��. Putting those together yields

(�|�, �) = ∏��(��(��,��))

∑ ��(��(��,�� ))

!""��

�� , which is target probability in usual seq-to-seq objective

function.

Now, let’s consider (�|�) and still use LSTM, then

(�|�) = ∏��(��(&��,��))

∑ ��(��(&��,�� ))

!""��

�� .

We now have parametrized log*(�|�,�)

*(�|�), which is the target mutual information score in MMI

objective function.

*Note: it has been shown (in Li et al 2016) that maximum mutual information (MMI) models

produce more diverse, interesting, and appropriate responses in dialogues.

��

+�(�)

+�(�)

LSTM LSTM LSTM LSTM $�� … $��

��

,��

+�(�)

LSTM LSTM LSTM LSTM ,�� … ,��

Documents

Toward the Engineering of Improving Semantic …xye/papers_and_ppts/ppts/Improving...Language Space Semantic Space Context Expression Count-based representations: 30k 30k 300 300 Parsed