Neural Models for Sequence Chunkingcis.csuohio.edu/~sschung/CIS601/NN Sequence Chunking.pdf ·...

Preview:

Citation preview

Neural Models for Sequence Chunking

Publish by Feifei et al

(IBM Watson)

Presented by Sagar Dahiwala

CIS 601

Agenda

1. Natural language understanding

2. Problem in current system

3. Basic neural networks• RNN – LSTM

4. Implemented Model 1, 2, 3

5. Experiments

6. Conclusion

1. Natural language understanding (NLU)

• NLU task such as,

1. Shallow parsing • analysis of a sentence - first identifies constituent parts of sentences (nouns,

verbs, adjectives, etc.)

• links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

2. semantic slot filling

• Require the assignment of representative labels to the meaningful chunks in a sentence.

2. Problem in current system

• Most of the current Deep neural network (DNN) – based method considers this task as sequence labeling problem.

• Sequence labeling problem • words is treated as the basic unit of the labeling, rather than chunks.

IOB-based (Inside-Outside-Beginning)Sequence labeling

• B – Stands for Beginning of chunk

• O – Artificial class

• VP – Verb phrase

• I – Inside of chunk, other words within the same semantic chunk

• NP – Noun phrase

Sentence : “But it could be much worse”

3. Basic neural networks

1. RNN – Recurrent Neural Network

2. LSTM – Long Short-term memory

3.1 RNN – Recurrent Neural Network

3.1 RNN – Recurrent Neural Network

3.1 RNN – Recurrent Neural Network

3.1 RNN – Recurrent Neural Network

3.1 RNN – Recurrent Neural Network

3.2 LSTM – Long Short-term memory

• Element wise addition (+)

• Element wise multiplication (X)

IOB schema for labeling problem has two drawbacks• No Explicit model

• to learn and identify the scope of chunks in sentence, instead we infer them implicitly.

• Some Neural Network (NN) like RNN and LSTM have the ability to encode context information • but don’t treat each chunk as a complete unit

Natural solution to overcome above two drawbacks is Sequence Chunking• Two sub task

• Segmentation – identify the scope of the chunks explicitly

• Labeling – to label each chunk as single unit based on segmentation results

• How humans remember things• Phone numbers are not typically seen or remembered as a long string of

numbers like 8605554589, but rather 860-555-4589.

• Birthdates are typically not recalled by 11261995, but rather 11/26/1995.

4. Model 1

• Average(.) computes the average of input vectors.

• Uses softmax layer for labeling. • In Above Figure 2, “much

worse” is identified as a chunk with length 2;

• apply hidden states in formula, to finally get the “ADJP” label.

Bi-LSTM

• Given an input sentence 𝑥 = (𝑥1, 𝑥2, … , 𝑥𝑇)

• Forward LSTM reads Input sentence from 𝑥1 𝑡𝑜 𝑥𝑇• Generate, Forward hidden states (ℎ1 , ℎ2 , … , ℎ𝑇 )

• Backward LSTM reads Input sentence from 𝑥𝑇 𝑡𝑜 𝑥1• Generate, Backward hidden states (ℎ1, ℎ2, … , ℎ𝑇 )

• Then for each timestep t, the hidden states of Bi-LSTM is generated by concatenating

ℎ𝑡 𝑎𝑛𝑑 ℎ𝑡 , ℎ𝑡 = [ℎ𝑡 ; ℎ𝑡]

Drawbacks of model 1

•May not perform well •on both segmentation and labeling subtasks

4. Model 2

• Follows encoder-decoder framework.

• Similar to Model 1, we Employ a Bi-LSTM for segmentation with IOB labels.

• This Bi-LSTM serve as encoder and create a sentence representation [ℎ𝑡 ; ℎ1]. Which is used to initialize the decoder LSTM.

• use chunks as the inputs instead of words

• for example: “much worse” is a chunk in Figure 3, and we take it as a single input to the decoder.

4. Model 2

Where g(.) is CNNMax layer

Cwj is the concatenation of context word embeddingsDifference is {Cxj, Chj, Cwj}.

The generated hidden states are finally used for labeling by a softmax layer

Drawbacks of using IOB labels for segmentation

• Hard to use chunk level features for segmentation, like length of chunks

• IOB labels can not compare different chunks directly

4. Model 3

• Model III is similar to Model II, • the only difference being the method of identifying chunks.

• Model III is a greedy process of segmentation and labeling, where we first identify one chunk, and then label it.

• Repeat the process till all word are processed. As all chunks are adjacent to each other 2, after one chunk is identified, the beginning point of the next one is also known, and only its ending point is to be determined.

4. Model 3

4. Model 3

Here, they implement pointer network to do so. Where j is decoder timestemp (chunk index)

The probability of choosing ending point candidate i is:

5. Experiments

• Text Chunking Results

• Compare with publish report

5. Experiments

• Slot Filling Results• Segmentation Results

• Labeling Results

• Compare with publish report

Segmentation Results

Labeling Results

Compare with publish report

ReferenceAuthor / title Context Link

Lampel et al. (2016) – Neural Architectures for Named Entity

Recognition

Stack-LSTM and transition based algorithm https://arxiv.org/pdf/1603.01360.pdf

Dyer et al. Stack-LSTM http://www.cs.cmu.edu/~lingwang/papers/acl2015.pdf

Wiki Softmax layer https://en.wikipedia.org/wiki/Softmax_function

Bahdanau, Cho, and Bengio 2014.

On the Properties of Neural Machine Translation: Encoder–Decoder

Approaches

encoder-decoder framework https://arxiv.org/pdf/1409.1259.pdf

Convolutional Neural Networks (CNNs / ConvNets) CNN http://cs231n.github.io/convolutional-networks/

Nallapati et al. 2016 - Abstractive Text Summarization using Sequence-

to-sequence RNNs and Beyond

encoder-decoder-pointer framework https://arxiv.org/pdf/1602.06023.pdf

Pointer Networks -

Vinyals, Fortunato, and Jaitly 2015

Pointer Network https://arxiv.org/pdf/1506.03134.pdf

Brandon Rohrer - Recurrent Neural Networks (RNN) and Long Short-

Term Memory (LSTM)

RNN-LSTM https://www.youtube.com/watch?v=WCUNPb-5EYI

Spoken Language Understanding(SLU)/Slot Filling in Keras ATIS – Airline Travel Information System https://github.com/chsasank/ATIS.keras

Thank You

Recommended