Multi-word Expression Recognition with Few Shot Learningcourses.cecs.anu.edu.au/courses/CSPROJECTS/18S1/reports/u6133… · Multi-word Expression Recognition with Few Shot Learning

Multi-word Expression Recognitionwith Few Shot Learning

Wei Chu

A report submitted for the courseCOMP8755 Individual ProjectSupervised by: Dr. LiZhen Qu

The Australian National University

May 2018

c© Wei Chu 2018

Except where otherwise indicated, this report is my own original work.

Wei Chu25 May 2018

Acknowledgments

Thank you to my supervisor Dr. Lizehn Qu. Your advice and guidance greatly helpedme on completing this thesis. The two years study with you makes me realize wheremy passion lays. From the weekly flip-course to the paper discussion group, thoseall hugely deepens my knowledge and help me discover the things I could do andthe things I want to do. I will keep working hard and use my knowledge to benefitmore people around us.

Thank you to my course convener Dr. Peter Strazdins. Because of your weeklymeetings that I can plan and done this paper quite smoothly. Writing is never easy tome and your writing tips such as be consistent with the terminologies you use, thewriting tenses etc. Those really helped me a lot.

Special thanks to Nikhil Mathew. Thank you for the contribution and collaborationof the memory retrieving models, without your help I could not done so much withthis project. Thank you for being an excellent study partner these two years, there istoo much things I learned from you. Coding is not hard, just be cool and debugging.

Thank you to family and my dear partner. You are the sources for me to make somuch progress these years. It is never been an easy decision to make a jump fromeconomics to computer science. Still remember that day I talked to you about mydecision, without hesitation, I got your fully supports. And I am happy to say that Ihave found what I love to do.

iii

Abstract

This paper has introduced a semi-supervised few-shot learning model for multi-word expression (MWE) recognition. The biggest challenge for MWE recognition isannotated data is limited, which makes the traditional deep learning method hardto excel. One remedy is to use few-shot learning model to leverage the distributionlearned and apply that to similar tasks. To create the support of such distribution,we proposed a knowledge base system, which is using for storing and retrievingrelevant information that few-shot model needed. The results have shown that few-shot learning can recognize almost 70% of MWEs but the accuracy is relatively low asthe number of incorrect MWE predictions are large. We also find the context of MWEwill not help much for recognizing MWEs because the context variations betweenMWE class and non-MWE class is subtle.

v

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 11.1 MWE landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Characteristics of MWE . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Challenges in MWE Recognition . . . . . . . . . . . . . . . . . . . . . . . 21.3 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 52.1 Neural Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Sequencing Neural Language Model . . . . . . . . . . . . . . . . 62.1.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Few-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Memory-Augmented Few-Shot Networks 133.1 MWE candidate generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Tree search matching MWE corpus . . . . . . . . . . . . . . . . . 143.1.2 Spans of contiguous tokens . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Memory initialization with MWE candidates . . . . . . . . . . . . . . . . 153.2.1 Relational database for text knowledge . . . . . . . . . . . . . . . 153.2.2 Memory map for vectorized knowledge . . . . . . . . . . . . . . . 153.2.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Memory retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.1 Weighted retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Few-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.1 Siamese learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4.2 Prototype learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4.3 Distance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.4 Weighted scoring with attention mechanism . . . . . . . . . . . . 18

vii

viii Contents

4 Experiments 194.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 MWE candidate generator results . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 MWE Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Retrieving results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Few-shot learning results . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Conclusion 295.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Bibliography 31

List of Figures

2.1 Attention mechanism for machine translation (Synced [2017]) . . . . . . 8

4.1 Visualization of retrievals without weighing . . . . . . . . . . . . . . . . 224.2 Visualization of retrievals with weighing . . . . . . . . . . . . . . . . . . 25

ix

x LIST OF FIGURES

List of Tables

4.1 MWE Generator Coverage Rates . . . . . . . . . . . . . . . . . . . . . . . 204.2 Top Retrievals for query candidates part a (no weighing) . . . . . . . . . 234.3 Top Retrievals for query candidates part b (no weighing) . . . . . . . . . 244.4 Top Retrievals for query candidates part a (with weighing) . . . . . . . . 264.5 Top Retrievals for query candidates part b (with weighing) . . . . . . . 274.6 Experimental results of few-shot models . . . . . . . . . . . . . . . . . . 28

xi

xii LIST OF TABLES

Chapter 1

Introduction

The penetration of deep learning has brought success to many fields and producednumerous state-of-art solutions in computer vision, speech, natural language process-ing and reinforcement learning. Quite often, to achieve superior outcomes with deeplearning models, the support of massive data is essential. While there are learningcases that have limited accesses to labeled resource, and train such models wouldbe very difficult. For our task, multi-word expression (MWE) recognition, it is thecase, we do not have many labeled data available. However, the recent rising of meta-learning paradigm has provided a way out, learning is still possible if we have only afew examples. In this work, we cast the meta-learning concept to MWE recognitionand proposed a semi-supervise learning regime to address insufficient label dataproblem.

1.1 MWE landscape

Let us start with the question of what quantifies an MWE. Loosely speaking, an MWEis a word combination that acts as a single linguistic unit. In other words, an MWEshould consist at least two words and together it is considered lexical, syntactic andsemantic idiosyncrasy [Moschitti et al., 2014].

Definitions of MWEs can vary. For example, Copestake et al. [2002] define MWEsas "nominal compounds, phrasal verbs, idioms and collocations" and Calzolari et al. [2002]defines an MWE as the interface between grammar and lexicon. While Schneideret al. [2014] have integrated the various definitions of MWE and expanded with somemodern usages. In this work, MWE is defined follows the definition from Schneideret al. [2014]. Details are discussed in session 1.1.1.

1.1.1 Characteristics of MWE

Understanding the idiosyncrasy of MWE is very important. It will help us distinguishthe subtleties between different MWE types. Specifically, these subtleties will provideus some guidance regarding to the unsupervised learning, we can better define theconcept of closeness.

Generally speaking, MWEs can be decomposed into idioms and name entities.

1

2 Introduction

Idioms can further be decomposed into the following specific items according toBaldwin and Kim [2010].

1. MWE idioms [Baldwin and Kim, 2010]

• Lexical idiomaticityLexical idiomaticity refers to exotic phrases which are not originated fromEnglish, i.e., ad hoc.

• Syntactic idiomaticitySyntactic idiomaticity can be considered as violations to general grammarrules, i.e., by and large, "it is adverbial in nature, but made up of the anomalouscoordination of a preposition (by) and an adjective (large)" [Baldwin and Kim,2010].

• Semantic idiomaticitySometimes also address as non-compositional MWE, which is saying themeaning of the MWE is not derivable from its token components, forexample, blow hot and cold and kick the bucket.

• Pragmatic idiomaticityPragmatic idiomaticity refers to the MWE that is treated as a fixed set, i.e.,good morning.

• Statistical idiomaticityStatistical idiomaticity measures co-occurrence of words, very frequentlythe set of words are using together, traffic light is one example.

2. MWE name entities [Schneider et al., 2014]Name entity is another big category, it is also a combination of words, usuallyused to indicate names, organizations, places, times, monetary value, etc [Nadeauand Sekine, 2007].The inclusion of name entities would make MWE recognition much more gen-erally applicable to other language tasks.

1.2 Challenges in MWE Recognition

Conventional deep learning are extremely data hungry, learning is totally driven bylarge amount of data. Not only the total scale of the data is large, but also large atper class level. For our task – MWE recognition, the main impediment is we do nothave the support of sufficient labeled data, very limited resources are provided withMWE dependencies annotated. To address this issue, we propose a few-shot basedlearning approach, the advantage is that few-shot works for learning new conceptswith very few labeled examples.

Recognition of MWE is particularly challenging because it is not easy to generalizethe diverse patterns of MWEs. As we have discussed in 1.1.1, different types of MWEscould share no textual or lexical or syntactic or semantic patterns. The recognition willno longer be a binary classification problem due to the existence of the latent classes.

§1.3 Report outline 3

Latent classes are the classes that we cannot directly observe but rather can be inferredfrom we can observe. How to model the distribution of latent classes of MWEsbecomes the main challenge. To address this issue, we proposed a semi-supervisedlearning model, which use the nearest neighbour to simulate the distribution of latentclasses. The underlying assumption for this model to work is that, for any instancefrom the MWE class, there should exist another MWE instance that is closer to it thanany non-MWE instances.

Our contributions are summarized as follows:

1. We cast few-shot learning to train an MWE classifier with insufficient data

2. We model the latent spaces with nearest neighbour that pulled from semi-supervise results

1.3 Report outline

In chapter 2 we will discuss the related background, focus will be on neural languagemodels and few-shot learning. In chapter 3 we will discussion how we constructedthe few-shot networks followed by the experimental results in 4.

4 Introduction

Chapter 2

Background and Related Work

This chapter briefly discussed the background of the language processing modelsand its evolution over deep learning era. Then, the we moved to the most concerningtopic – few-shot learning, specifically we have introduced a number of variations ofmetric based few-shot learning methods.

2.1 Neural Language Modeling

Language models is about assigning probabilities to a sequence of words [Jurafskyand Martin, 2009]. In order to construct the probability distribution over words, wehave to convert the literal to its numerical representation, this conversion process werefer as data encoding.

2.1.1 Embeddings

Data encoding is a big obstacle for language related tasks as the general semanticmeaning of the word is not easy to represent. Traditional language models that useTF-IDF or bag of words encoding schemes would usually make predictions context-dependent [Jurafsky and Martin, 2009], unseen words will likely cause problem.Suppose we are asked to predict the word after "open the", and the answer is word"window" but which never appeared in the training data. Under this situation, it willnot be expected for traditional encodings to predict "window". The emerge of neuralbased vector space models have alleviated this issue.

In neural based setting, each word will be represented as a vector and this vectoris often referred as embedding. The dimension of the embedding varies between50 and 500 [Jurafsky and Martin, 2009], the larger the richer information the vectorwould have. The good side is that the vector representation do capture the semanticmeanings, similar words are often spatially closer [Pennington et al., 2014]. Howeverthe downside is that the representations are not continuous in the vector spaces,leaving some of the vectors unexplainable.

To obtain the word embeddings, Mikolov et al. [2013] proposed a neural basedmodel often known as the Word2Vec model. The biggest contribution is that itautomates the feature assembling processes, from where the semantic meanings of

5

6 Background and Related Work

the word could be encoded. The gist of the model is to perform a mapping fromword to its context or from the context to the word. For example, you have a sentencesnippet the cat sits on the. Assume your encoding target is cat. Then, a tuple of[[the, sits], cat] can be written to represent the context and the target word,respectively. Training is done by feeding this tuple and perform the mapping fromone to another. Specifically, if the mapping is from context to word then it is calledcontinuous bag-of-words (CBOW); if the mapping is from word to context, then itis called skip-gram. With large amount of data, the results have revealed excellentperformances to represent semantic and syntactic meanings of the word [Penningtonet al., 2014].

2.1.2 Sequencing Neural Language Model

Sequencing neural language model take embeddings as inputs and models the con-ditional probabilities in a time series fashion. In traditional deep learning settingsuch as feed-forward networks, information can only travel in one direction, there isno feedback involved for updating the knowledge already learned. Early work in1980s such as Hopfield network [Hopfield, 1988] propose to associate the model witha memory, from the memory it obtains the feedback. Gradually it evolves to whatso called recurrent neural networks (RNN). The structure of RNN which look as thefollows

ht = σh(Whxt + Uhht−1 + bh) (2.1)

yt = σy(Wyht + by) (2.2)

it has a hidden state ht that stores the feedback and when new information xt comesin, the hidden state will update accordingly. yt is the prediction at each time stamp,it is computed via a non-linear transformation to the hidden state as shown in 2.2.

Conventional RNN suffers vanish gradients problem. As the length of the sen-tence grows longer, the flow of gradients from back-prorogation decays dramatically.Hochreiter and Schmidhuber [1997] proposed a remedy method called Long Short-Term Memory (LSTM). By disposing a number of gates within LSTM, the vanishgradient problem get alleviated. The gates control which part of the information canget into the memory and which part of the information have to discard. Specificimplementation is formulated as 2.3, which includes a forget gate ft, input gate it,output gate ot, a cell state ct and hidden state ht. This model has been widely de-ployed and achieved many state-of-art outcomes (Frome et al. 2013; Koch et al. 2015;Li et al. 2016).

§2.1 Neural Language Modeling 7

ft = σg(W f xt + U f ht−1 + b f ) (2.3)

it = σg(Wixt + Uiht−1 + bi) (2.4)

ot = σg(Woxt + Uoht−1 + bo) (2.5)

ct = ft ∗ ct−1 + it ∗ σc(Wcxt + Ucht−1 + bc) (2.6)

ht = ot ∗ σh(ct) (2.7)

The effective performance of LSTM has inspired many variations. Gate RecurrentUnit (GRU) is considered the most promising alternatives to LSTM. Modern neuralnetworks are getting bigger and more complicated, people start to demand muchefficient training components, GRU as so gains its popularity. GRU does not employthe cell unit of LSTM, and instead the hidden state gets directly exposed to thesubsequent training without much control [Cho et al., 2014]. The formulae of GRU isas the follows.

zt = σg(Wzxt + Uzht−1 + bz) (2.8)

rt = σg(Wrxt + Urht−1 + br) (2.9)

h′t = σh(Whxt + Uh(rt ∗ ht−1) + bh) (2.10)

ht = (1− zt) ∗ ht−1 + zt ∗ h′t (2.11)

2.1.3 Attention Mechanism

LSTM based models are impressive for processing sequential data, but its perfor-mance will diminish as the sentence gets longer [Sutskever et al., 2014]. It turns outthe most recent unrolled input will generate larger effects to the cell state. In otherwords, the memory of LSTM is unevenly distributed with more focus to recent inputs.To remedy such imbalance of focus, attention mechanism is deigned.

The gist of attention is using softmax to assign probabilities over dense outputs.The work that really made attention popular is done by Google [Sutskever et al., 2014],where they have applied attention mechanism to the machine translation problem.Traditionally, the translation was done by a encoder-decoder structure, where theencoder takes the inputs and encode with LSTM. The decoder takes the results fromthe encoder, applies another LSTM, step-wise generates output languages. Regardlessof input sentence length, the encoder will always produce a pre-fix size encoding,information loss would be unavoidab as sentence gets longer. To resolve the issue,Sutskever et al. [2014] suggest to incorporate information generated from each hiddenstates of the encoder LSTM. As shown in 2.1, blue is the encoding process and thered is the decoding process. A probability distribution is computed that utilize allencoder hidden states for producing every single word that decoder wants to generate.Indeed, attention is nothing but a weighing system.

There is a couple variations of attention mechanism. Generally can be classified


Figure 2.1: Attention mechanism for machine translation (Synced [2017])

§2.2 Meta-learning 9

into two types: local-based attention mechanisms and concatenation-based attentionmechanisms. Local attention cares nothing but itself. Given a question that "Whereis the football?", the local attention should contextually allocate more focus on criticalwords like where and football (Li et al. [2016]). To produce such attention weights, Liet al. [2016] applied softmax function to the outcome of the dot product between aword embedding vector and a word-level latent vector, as shown in 2.12.

γj = so f tmax(vTgqj ) (2.12)

Regarding concatenate-based attention models, it takes account the interactions ofone vector to another vector, and generates a score to reflect the strength of theinteractions. Chen et al. [2017] have applied concatenation-based attention to evaluatethe positional influence of question to answer, as shown in 2.13 where e(·) is a scorefunction and pj is the accumulated postional influence.

aj = so f tmax(e(hj, pj)) (2.13)

Attention is a very flexible trick and could be applied almost everywhere once needed.

2.2 Meta-learning

Meta-learning is about learning to learn. It is originated to target the knowledgetransfer conundrum. The concept is that we train an agent from scratch with largeamount of data, and then let the agent perform in a new environment relying on itsexperience just learned. The new environment should assign tasks that have someoverlaps to the training, so that the agent can leverage the distribution knowledgelearned to new tasks. The distribution knowledge usually is sampled from thepopulation and often represents by the k nearest neighbours from past knowledge.

2.2.1 Few-shot learning

Few-shot learning is aiming to generalize the learning by consulting only a fewinstances of a concept. Early work emerges in 2006, Fei-Fei et al. [2006] have shownthe possibilities of few-shot learning in computer vision tasks. The past learnedimages classes can contribute to distinguish the new image class at which just oneor a few instances are provided. Since then great progress has been made. There issome prominent work so called zero-shot learning that uses to predict unseen classes.Frome et al. [2013] has proposed a simple but effective zero-shot network. Theyuse a score function to rank the closeness of an instance to every class. The scoreis estimated based on some sampled evidence from the images and the semanticrepresentations of the labels, it shows excellent performance on making predictionsof unseen classes. Indeed, for most of few-shot learning works, the main idea is tobuild a score metric to evaluate the distance to each class, with the help of a nearestneighbour support set.


2.2.2 Metric learning

Metric learning is a distance based strategy for classification problems, though thereis other methods for doing few-shot, we will not discuss in this work. Basically, themetric is saying we want the instances from the same class to be closer and instancesthat not from the same class to be far away. It is akin to k nearest neighbour buttrainable. The following shows three related approaches for applying metric learning.

Siamese Neural Networks Siamese neural network is often used for evaluatingsimilarity between two comparable subjects. The original idea was proposed byBromley et al. [1993], where they suggest to use two identical sub-networks forcalculating the similarity scores between two subjects. Here, identical sub-networksmeans the sub-network are initialized with the same parameters and weights.

Koch et al. [2015] has extended the concept to the filed of few-shot learning. Inorder to learn the category of a giving image, first find a few support instances thatsampled from each category and then put them into a Siamese network for scoring.Instances from the same class should obtain a higher score than that from differentclasses.

φ(xi, xj) = ‖θ(xi)− θ(xj)‖ (2.14)

w = so f tmax(−φ) (2.15)

Matching networks Matching networks are structured very similar to Siamese net-works. The one-shot version of matching networks were proposed by Vinyals et al.[2016]. Basically, they have introduced attention mechanisms to Siamese networks.Given a training example, first it queries what we learned and from there it pulls outa support set. The support set includes instances sampled from each category. Insteadof taking scores directly from the kernel function (like 2.14), they use attention togenerate a weighted sum scores, according to a∗,j = φ(x∗, xj) where ∗ represent thequery instance and j is the element from the support set of that query instance.

Prototypical networks The setting of prototypical networks is quite simple and itperforms very effective. According to the paper [Snell et al., 2017], every categoryshould have a prototype that existed in the metric space which can represent thecharacterizes of the entire category. To calculate such prototype, the authors proposedto take the average of the support elements belonging to the same class Sk,

ck =1|Sk| ∑

(xi ,yi)∈Sk

fφ(xi)

where fφ : RD → RM is mapping with learnable parameters φ. When a new instancecomes, the model will assign it to the nearest prototype category.

§2.3 Summary 11

2.3 Summary

In this section we have discussed the recurrent neural networks and the embeddingmodel, both will be later used for constructing the few-shot learning model. Aslo,different few-shot learning will be implemented and tested in this work, for instance,the Siamese networks and the Prototype networks.


Chapter 3

Memory-Augmented Few-ShotNetworks

This chapter has descirbes the process of constructing few-shot learning models forMWE recognition. Distinct from the previous work, we employ a knowledge basesystem from where the few-shot learning model could draw a support set. Thepurpose of having a support set is for distance-wise distinguishing MWEs class fromnon-MWEs class.

Knowledge base system is constructed with the following two steps. First, we takethe data and apply a candidate generator to it. The generator will output candidatesalong with the MWE labels. Note these labels are from the annotations of the trainingdata. The second step is to encode the candidates and move them to the knowledgebases. Depending on the label, we have prepared two knowledge base repositories,MWE repository and non-MWE repositiory.

3.1 MWE candidate generator

The position candidates are generated from two sources: contiguous spans of tokensand matches from the STREUSLE 3.0 corpus [Schneider and Smith, 2015]. Contiguousspans of tokens mainly handles the name entity candidates and STREUSLE 3.0 mainlyhandles the candidates that are non-contiguous spans of tokens.

The generator will produce each candidate as a tuple associated with an MWElabel that indicated by the training data. The tuple has includes information ofcandidate tokens (which we address as candidate mention), left context of the candidate,right context of the candidate, and the gap of the candidate. Gap is using to depictnon-contiguous spans of tokens. For example, in the snippet of [put, the, box,down], if put down is the candidate, then the box will be the gap. Since each candidateis associated with a label, we actually converted MWE recognition to a classificationproblem.The following sentence is obtained from the training data which uses as an exampleto illustrate the candidate generation process.

Tokens: ["and", "be", "keep", "those", "dogs", "safe", "from", "potential","problems"]

13

14 Memory-Augmented Few-Shot Networks

Generator(Tokens) will output:Position and: [and,], [and, be], [and, be, keep]Position be: [be,], [be, keep], [be, keep, those], [be, safe]Position keep: [kepp,], [keep, those,], [keep, those, dogs], [keep, safe](groundtruth)Position those: [those,], [those, dogs], [those, dogs, safe]Position dogs: [dogs,], [dogs, safe], [dogs, safe, from]Position safe: [safe,], [safe, from], [safe, from, potential]Position from: [from,], [from, potential]Position potential: [potential,]Labels: [keep, safe] has label 1 and all others are label 0

The candidates as you seen are generated position-wise. Among all these gen-erated candidates, [keep, safe] is the only MWE candidate and the rest are non-MWE candidates, as indicated by the training data. The generation results are akinto token permutations, but we have made slight changes. Good side of generatingthe spans of all token permutations is that it gives us full coverage to MWEs. Thedownside is that they are very expensive, expensive in particular to the training pro-cess as the model will get more candidates to distinguish. As for the above sentence,using spans of all tokens will produce 60 more candidates, which triple the numberof candidates what we presented and more importantly, these extra candidates pro-vide no much benefits for prediction. However without generating all of the tokenpermutations, we might loss coverage to some of the MWEs. We propose to employan external MWE corpus for the generator. Aim of doing this is to minimize numberof candidates produced by the generator without sacrificing too much of the coveragerates.

3.1.1 Tree search matching MWE corpus

Tree search is designed for matching the corpus. The tree is constructed with BFSalgorithm and the search is done by DFS algorithm. Time complexity of the treesearch (just the searching part) is O(log n).

Candidates are split into tokens and sequentially added to the nodes of the tree.Each of the node is built with a termination flag to indicate if it is the end of an MWE.

The matching process works as the follows. Given a word, the algorithm will lookfor the branch that matches that word. From here it looks ahead of the subsequentwords in the target sentence and output all the matches that hit the termination states.

Using corpus alone for generating candidates are not sufficient. In section 4,we have tested the MWE coverage rates with corpus soley, only half of the MWEcandidates get covered and the other half non-covered cases are mostly name entities.

3.1.2 Spans of contiguous tokens

Span of contiguous tokens are implement for boosting the name entity coverage.Corpus solely has shown under-coverage to name entities and also mismatches to

§3.2 Memory initialization with MWE candidates 15

variations of MWEs, for example, the corpus might covers the MWE [suppose, to]but the ground truth is showing [be, supposed, to] as the MWE. To remedythese under-coverage situations, we have included spans of contiguous tokens with awindow size of 6.

3.2 Memory initialization with MWE candidates

Memory networks (or knowledge bases) are created for holding the information fromcandidates. Text information from the candidate will be arranged to a relationaldatabase and vectorized candidate information will be stored to Numpy with a mem-ory map. In the following sections, we will define what is the text information forcandidate and how to construct the vector information of the candidate.

3.2.1 Relational database for text knowledge

Relational database is used for storing text information regrading to a candidate.For each of the candidate, the database will store its text information including leftcontext, candidate mention, right context, gap if any, and an indicator of its MWElabel. Depend on the label, the results will be directed into either MWE database ornon-MWE database.

Suppose we have a mention candidate [’more’, ’than’] from sentence [’Seem’,’to’, ’me’, ’like’, ’a&e’, ’charges’, ’way’, ’more’, ’than’, ’neces-sary’, ’!’]. The left context of this candidate is defined as [’charges’, ’way],the right context is [’necessary’, ’!’] if the window size is set to 2, and thereis no gap since the mention is spans of contiguous tokens. The results of candidate[’more’, ’than’] is indeed an MWE so it will be allocated to the MWE database.

3.2.2 Memory map for vectorized knowledge

Vectorized knowledge is concatenated embeddings of the text information of candi-dates. We have applied the Word2Vec model for mapping tokens to vectors. Embed-ding size of the Word2Vec model is 200. We construct a candidate vector vi ∈ R1200

and it has arrange as the follows. Note the context sets 2 if not specified.

vi =

vlc

vm

vrc

vgap

(3.1)

where left context vector vlc ∈ R200×2, mention vector vm ∈ R200, right context vectorvrc ∈ R2×200 and gap vector vgapR200. Concatenation is applied to obtain the contextvectors and as well as the entire 1200 dimension vector. For mention vector, becausethe number of words varies in mention, mean of word embeddings is taken, samemanner applied for handling the gap.


3.2.3 K-Means Clustering

K-Means is deployed on clustering vectorized knowledge so that to accelerate thenearest neighbour searches. Recall that k nearest neighbour search is used for fetchingthe support set of a given instance. That means, oftentimes we will send quiresto knowledge base to perform pair-wise distance computations. Compare queryinstance to every candidate in knowledge base is computational expensive. Instead,we apply K-Means to cluster the knowledge base so that the search will be local andcentroid-focused.

3.3 Memory retrieval

The memory networks serve as a long-term dynamic knowledge base for candidateretrievals. Given a query instance, we would like the system to retrieve the mostrelevant candidates from both the MWE knowledge base and non-MWE knowledgebase.

3.3.1 Weighted retrieval

A weighted k nearest neighbour search will be conducted for fetching MWE candi-dates and non-MWE candidates as well, the specifications are shown below.

s(q, r) = α(1− s1(q, r)) + (1− α)[s2(q, r) + s3(q, r)] (3.2)

s1(q, r) = s1(βvlc + γvm + βvrc + ζvg) (3.3)

where q and r represents query instance and retrieval candidate respectively. s1 is thenormalized Euclidean distance ranging from 0 to 1. s2 is the Jaccard string similarityfor candidate mention and s3 is the Jaccard string similarity for part-of-speech tagsof the mention. Final score of the retrieval ranks will weigh between s1, s2 and s3.

For the computation of s1, the Euclidean distance, another weighted combinationhas been applied. Briefly, we want to have high level of attentions to gap vector vg, amedium level of attentions to candidate mention vector vm, and low level of attentionsto context vlc, vrc, in other words, β < γ < ζ.

3.4 Few-shot learning

The perquisite for few-shot learning or any learning models to work is that thetraining set and the test set are come from the same distribution. Following this setup,we train the model the same way as we do the test, that is, training with accessesto only a few instances. For each query, we will train it to learn the distributionfrom the support set which composed by a few MWE examples and a few non-MWEexamples.

We propose to use k nearest neighbour for constructing the support set. Theassumption is that, given an MWE, there should exist at least one MWE in the

§3.4 Few-shot learning 17

knowledge base that is closer to this MWE than any other non-MWEs candidates.Likewise, for anyone non-MWE case, the knowledge base should have at least onenon-MWE candidate that is closer to the non-MWE case than any MWE candidates.Based on this assumption, we fetch top k instances respectively from non-MWEknowledge base and MWE knowledge base, composing the support set.

3.4.1 Siamese learning

We use Siamese network to rank the scores of two bidirectional-GRU (BiGRU) outputs.BiGRU is used to encode query and its corresponding retrievals. As shown in section3.4, φ(·) represents the BiGRU model and the setup are the same for query q andretrieval ri. The distance is measured by Euclidean measurement, alternatives willbe discussed in 3.4.3. The predicted class is the class of the retrieval which give thehighest score.

si(q, ri) = ‖m(q)−m(ri)‖2 (3.4)

yk = argmax(sk(q, rk)) (3.5)

the loss function is defined with cross-entropy,

l(Φ(q, r), y) = −ymwe log(Y = ymwe|q, r)− ynon−mwe log(Y = ynon−mwe|q, r) (3.6)

3.4.2 Prototype learning

Prototypical networks are used as the alternative method for computing the scores.Given a query candidate, it could be either from MWE class or from non-MWE class.Each class can be represented by a prototype. Depending on the distances of thequery to the prototypes, a score could be assigned.

Same setting as to Siamese learning, query and retrievals are going through thesame BiGRU networks for encoding. For the BiGRU output of retrievals (which is aconcatenation of MWE retrievals and non-MWE retrievals), we calculate a prototypefor each class.

m(r+) =1|S+

k |∑

r′∈S+k

m(r′) (3.7)

m(r−) =1|S−k |

∑r′∈S−k

m(r′) (3.8)

yk = argmax(−dist(m(q), m(r+)),−dist(m(q), m(r−)) (3.9)

where m represents the encoding function, which could be either BiGRU or BiLSTM,S+

k is the set of MWE retrievals and S−k is the set of non-MWE retrievals, and Sk isthe support set for the k− th query. Formulae 3.7 is saying that for each class set, anaverage is computed as the prototype. The predicted class is assigned to whicheverprototype closer to the query.


3.4.3 Distance metrics

Euclidean and radial basis function (RBF) are applied for calculating the similarities.The first two metrics are attempting to convert Euclidean distance to Euclidean simi-larity, and the third one is the RBF kernel which has a free parameter θ that relatingto the variance of the data.

se(q′, r′i) =1

1 + Euclidean(q′, m(r+))(3.10)

se(q′, r′i) = −Euclidean(q′, m(r+)) (3.11)

se(q′, r′i) = exp(−θ‖q′ −m(r+)‖2) (3.12)

3.4.4 Weighted scoring with attention mechanism

The sequential encoding schema of BiGRU might diminish the effects of individualwords. In order to compensate the word-level effects, we add word attentions to themodel. Specifically, the attention has been applied for measuring the interactionsbetween the query words and the retrieval words.

sattn(q, r) = argmaxq(q_embed · r_embedT) (3.13)

g(q, r) = so f tmax(sa(q, r)) (3.14)

Overall, we have introduced a weighted scoring system with attention as the follows.

s(q, r) = w1 f (q, r) + w2g(q, r) + w3z(q′, r′) (3.15)

(3.16)

where s ∈ R(n,2) is the weighted score where n represents the number of examplesand 2 represents the MWE and non-MWE class, respectively. f is the scores fromfew-shot learning by using either Siamese networks or prototype networks with aselected distance metric, g is scores from the word-level attentions, and z is scoresfrom handcrafted features.

Chapter 4

Experiments

4.1 Experiment settings

4.1.1 Dataset

We use HAMSTER dataset for the experiment. The HAMSTER dataset is a CoNLL-formatted output that produced from STREUSLE corpus ([Schneider and Smith,2015]) with some fixes on MWE and dependency parse annotations, detail discussioncan be found in Chan et al. [2017].

HAMSTER is a relatively small dataset which in total contains 3812 sentences.Within these sentences, there is 3896 MWE annotated relations. MWEs are most oftenappearing as a combination of 2 words, although we do see some MWEs take a lengthof 5 or longer, it is a tiny portion take account for 0.19%. The number of gappy MWEsaccording to the statistics is 536.

The data is split into training, validation and test sets. We have randomly sampled70% to training, 15% to validation and 15% to test. The training covers 2631 MWEs,validation covers 574 MWEs and test set covers 691 MWEs.

4.2 MWE candidate generator results

Generator is the foundation of the everything else in the model, so the quality of thegenerator has to be assured. We will evaluate the quality on the metric of coveragerate.

4.2.1 MWE Coverage

The coverage rate describes the portion of real MWE candidates to MWEs in a dataset.Recall that the dataset has labeled which words are MWEs and real MWE candidatesare defined in terms of that the generated MWE candidate is indeed the ground truthMWE in the dataset. We have separately compared the contributions of lexicon andspans of contiguous tokens to coverage rates.

Using lexicon only, the exact coverage rate is at most 62% and 73% for at least2 words overlap. We found that the non-covered cases are mostly the failure ofcapturing name entities, i.e., Farrell Electric and Dr. Romanick. Another observation

19

20 Experiments

Table 4.1: MWE Generator Coverage Rates

Match LexiconEnrichedLexicon

Spans of contiguous tokensand Lexicon

Spans of contiguous tokensand Enriched Lexicon

Exact 0.53 0.62 0.90 0.92

≥ 2 words 0.61 0.73 0.997 0.997

shows some mismatches because of the inconsistencies of defining an MWE. Forexample, Chan et al. [2017] classified as as possible as an MWE item, while in thelexicon created by Schneider and Smith [2015], this item is not an MWE, instead wefound the most related one is as soon as possible. To reconcile this inconsistency, weattempted enriched the lexicon, for each of the existed MWE items, we have spannedall its variations, but it does make much differences.

Using lexicon and spans of contiguous words, the coverage rate has boosted to90%. According to Table 6.1, an obvious jump has been observed when incorporatingthe spans, which made a rise from 53% to 90%. While employing enriched lexiconcontributes a margin improvements by 10%, and such improvements get slashedhugely when applying along with the permutation making only 2% boost on coverage.In terms of at least 2 words matching, the coverage performance is at 99.7%. Overall,we use lexicon and spans of contiguous words as the sources for generating MWEcandidates.

We also reconcile the inconsistencies on lemmatization. The lexicon is processedwith WordNetLemmatizer, but evidence showing that the HAMSTER has employeda different lemmatizer. For example, on occasions, the term will be lemmatized ason occasion using our lemmatiztion approach, while in CONLL lemmatization is stillon occasions. The number of uncovered words has decreased from 371 to 368 out of3812 if we reconcile the lemmatization approach, although it is not significant, it willbenefit when we introducing more external corpus resources as the potential conflictsget reduced.

At this stage, there is not much room for further improvements. What is left,in other words, the non-covered cases are very uncommon MWEs. Examples likepass away in sleep, have surgery, do a job, have problem, be in, give chance, etc., it is morelike a hard combination of verb and noun than idiom that people frequently using.Though, there is a way to further improve the results, generating all permutations ofthe sentence, it is too expensive to be implemented.

4.3 Retrieving results

The quality of retrievals is critical and directly affects the ability of our predictionmodel. Recall that the training data is obtained from querying the retrieval system. Anearest neighbour search is performed to the knowledge base and the most plausiblecandidates are fetched. To test and improve the retrieving quality, we focus on tuning

§4.4 Few-shot learning results 21

the weights between Euclidean similarity, Jaccard similarity of metion, Jaccard simi-larity of part-of-speech tagging of the mention. The similarity relationship between aquery and a retrieval.

Without weighing between the semantic, syntactic and string similarities, re-trievals are mixed in spatially. The top non-MWE candidates and top MWE can-didates are not separable in metric spaces as shown in 4.1(a) and 4.1(b). This willadd noises to the lifting process for modeling part. The effect would be a dimensionlift with LSTM or GRU would not possibly generate clear boundaries to distinguishMWE cases from non-MWE cases.

Separability measurement solely is not enough to determine the quality of re-trievals. For example, as shown in 4.3, the top 2 retrieval matches to MWE goodjob are about time and get busy, both are semantically and syntactically non-relevant.Additional measurement is added to rescue the situation.

In particularly, we take account of word overlaps and part-of-speech overlapsbetween a query and a retrieval. With this add-on, we observe that top retrievalsare starting to incorporate the crossed words and part-of-speech information fromthe query. However, the catch is that, it can deteriorate the discriminator’s ability todifferentiate the MWEs from non-MWEs, as non-MWE becomes more confusing. Asyou could observe from 4.2, top negative matches are possibly being the superset ofthe query. To resolve this, we would assign different weights to contexts vlc, vrc inorder to separate separate and improvement the effects over the mention vector vm.

Special attentions are given to gappy MWEs. Intuitively, a gappy MWE shallshare more commonalities to other gappy MWEs than non-gappy ones. We adjust theweights to let the gappy features to restrict the searching, more priorities are givenfor searching the gappy candidates.

Fine-tuning results have shown big improvements on MWE retrievals especiallyto the gappy MWEs. In terms of the retrieval parameters, the weights are assignedas the follows: 1 to left and right context vlc, vrc, 1.5 to mention vector vm, and 2to gap vector vg. In terms of the ranking of retrieval results, 0.3 is assigned forEuclidean distance and 0.7 is assigned for Jaccard similarities. Specifically, for good jobwe now retrieve good luck at least one word overlap and top notch a semantic match.The gappy idiom keep safe now retrieves keep run, quite down, take time which are allgappy idioms. Addditionally, we see improvements on negative retrievals as well,the superset issues are largely diminished. At the visualization level, as shown in4.2, boundaries between the MWE cluster and non-MWE cluster are becoming muchclear.

4.4 Few-shot learning results

The biggest challenge in training the few-shot model is class imbalance problem. Intotal we have 194581 training instances, 98.8% of them are from the non-MWE classand 1.2% of them are from the MWE class. The result of training on imbalance datadid cause severe distribution skews to the over-presented non-MWE class. To rescue,

22 Experiments

(a) Retrievals for the query more than (b) Retrievals for the query accord to

(c) Retrievals for the query good job (d) Retrievals for the query keep safe (gappy)

Figure 4.1: Visualization of retrievals without weighing


Table 4.2: Top Retrievals for query candidates part a (no weighing)

more than

Rank MWE Retreivals non-MWE Retrievals

1 more than more than necessary !

2 the world revolve around more than necessary

3 boy in blue $ NUM3 just to diagnose the

4 boy in blue to rub it

5 bad boy service charge just to rub it

6 rip into more and

7 no harm do the price of a treatment ,

8 to show for service charge just to rub

9 no one for two clean fee (

10 for the sake of for new peddles and proceed

accord to


1 accord to accord to your won statement )

2 bring in accord to your

3 jump - start or after NUM1 pm , bring

4 worst of all the patient ) and

5 work on color me phat ) if

6 get rid of i only bring

7 sum up they even get

8 keep tab on you would most likely

9 here’s to by whom , i do n’t

10 attend to for some reason it will

24 Experiments

Table 4.3: Top Retrievals for query candidates part b (no weighing)

good job

1 good luck jana make me feel

2 lucky panda holly be

3 top notch sale men be

4 prominent builder they be

5 hot iron he be

6 fresh design studio it hurt

7 family own and operate it turn out be

8 slice pizza the employee make you feel

9 gold award place smell and the onwer be

10 get busy the food continue to be

keep safe


1 rip off keep those dog safe

2 with god be keep those dog safe

3 out ta dog safe

4 go on those dog safe

5 perfect in every way honest, apart

6 oh - so a block away

7 hang paper a great stay

8 to one’s like so high

9 cut it close manage to keep the veggie

10 mediocre at best be honest , apart


(a) Retrievals for the query more than (b) Retrievals for the query accord to

(c) Retrievals for the query good job (d) Retrievals for the query keep safe (gappy)

Figure 4.2: Visualization of retrievals with weighing

26 Experiments

Table 4.4: Top Retrievals for query candidates part a (with weighing)

more than


1 more than make the drive to see

2 more like year ,

3 less than food be beautiful

4 other than it be, it s

5 rather than try to help their customer

6 rather than a lot to be

7 above all i have never

8 at least ’s go to be anywhere you

9 at least be worth every

10 at lest to be anywhere you

accord to


1 accord to for all

2 get use to meticulous in

3 look forward to and dod not send

4 in large part to thanks for all

5 have to do not send

6 listen to be meticulous in

7 have to for all

8 have to be not

9 listen to ’s in

10 due to , just because you bruise


Table 4.5: Top Retrievals for query candidates part b (with weighing)

good job


1 good luck jana make me feel

2 lucky panda holly be

3 top notch sale men be

4 prominent builder they be

5 hot iron he be

6 fresh design studio it hurt

7 family own and operate it turn out be

8 slice pizza the employee make you feel

9 gold award place smell and the onwer be

10 get busy the food continue to be

keep safe


1 keep run tell on

2 quite down that do

3 take time call wait

4 let know what that

5 call back for it

6 call back know of

7 call back that that

8 turn down never a

9 let know the hell

10 get worse like to

28 Experiments

Table 4.6: Experimental results of few-shot models

5-few-shot Prototype Networks Num of MWE Recognized Num of MWE Predicted F1

BiGRU without FW 0 out of 536 0 0.0

FS on mention 80 out of 536 1979 0.07

FS on mention & context (concat) 240 out of 536 14395 0.03

FS on mention & context (weigh) 113 out of 536 5965 0.03

FS on mention & context (weigh & attn) 372 out of 536 3684 0.18

we use batch with sampling. The sampling ratio sets to 3:1, that is, 75% examples of abatch are drawn from non-MWE class and 25% examples of a batch are drawn fromthe MWE class, replacement is enabled, as our experiments have shown better resultswith replacement.

Few-shot size has been set to 5. We have tested with few shot size of 10, no muchdifference was revealed. In terms of the choice of network, prototypical turns tooutperform Siamese. Therefore, our main experiment is conducted with 5-few-shoton prototypical networks.

We have compared the result of few-shot learning to the results of a classifierwithout few-shot. A two-layer BiGRU has been run directly on candidates mentionsand contexts for making MWE predictions. The results have shown (first row of Table4.6) with candidate itself, learning to classify MWE is impossible.

Within the few-shot learning setup, candidate mention turns to be the most im-portant driver for MWE recognition. The effects of context are tested in two ways.First, we concatenate the context BiGRU output with the mention BiGRU output,the results showing the number of MWE recognized has triple at a cost of 10 timesmore MWE being predicted. Second, instead of concatenating, we take a weightedaverage of mention and context outputs for scoring, situation has being improved,less number of MWEs are predicted. Generally we do not think context will con-tribute as much as the mention, without tuning, context will likely to cause noises,as the difference between non-MWE contexts and MWE contexts can be very subtle.Besides, we added attentions to capture the word-level interactions between querytokens and retrieval tokens, the attention turns out to generate very positive effects,which increases of the number of MWE recognized to 372 without bringing too muchjump to the number of MWE predicted.

Chapter 5

Conclusion

This paper explored the semi-supervised few-shot learning models for MWE recog-nition. We show that the few-shot models could hugely leverage the distributionslearned and apply it to similar tasks. The MWE recognition rate is almost 70%, how-ever the accuracy is at only 10%. That is saying we are making too many wrongpredictions mostly due to sampling bias. One fix to this is reducing the samplingrates on MWE class.

Additionally, we found that, for MWE recognition, the mention tokens are moreimportant than the context tokens, somehow it reflects that MWE recognition isabout matching the syntactic and semantic meanings between mentions. By addinginteractions between query tokens and retrievals tokens, we found the recognition ratehas improved. It further confirms the importance of mentions to MWE recognition.

5.1 Future Work

A more sophisticated knowledge base system could be designed. The knowledgebase system is the guarantee of the few-shot model. Currently, our knowledge basesystem lacks the abilities of capturing complicated semantic and syntactic structures.As a result, the k nearest neighbour search would produced deviations to the truthdistribution.

Introducing semi-Markov segmentation on sentences. Current model is just alocal classifier without consider the globally information like the segmentation ofsentence. Rather than producing binary predictions on MWE, we can set the lossfunction to match with the correct segmentation of a sentence.

29

30 Conclusion

Bibliography

Baldwin, T. and Kim, S. N., 2010. Multiword expressions. In Handbook of NaturalLanguage Processing, 267–292. Chapman and Hall/CRC. (cited on page 2)

Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; and Shah, R., 1993. Signatureverification using a "siamese" time delay neural network. In Proceedings of the 6thInternational Conference on Neural Information Processing Systems, NIPS’93 (Denver,Colorado, 1993), 737–744. Morgan Kaufmann Publishers Inc., San Francisco, CA,USA. http://dl.acm.org/citation.cfm?id=2987189.2987282. (cited on page 10)

Calzolari, N.; Fillmore, C. J.; Grishman, R.; Ide, N.; Lenci, R.; Macleod, C.;and Zampolli, A., 2002. Towards best practice for multiword expressions incomputational lexicons. In In Proceedings of the Third International Conference onLanguage Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands, 40. (citedon page 1)

Chan, K.; Brooke, J.; and Baldwin, T., 2017. Semi-automated resolution of inconsis-tency for a harmonized multiword expression and dependency parse annotation.In MWE@EACL, 187–193. Association for Computational Linguistics. (cited onpages 19 and 20)

Chen, Q.; Hu, Q.; Huang, J. X.; He, L.; and An, W., 2017. Enhancing recurrent neuralnetworks with positional attention for question answering. In Proceedings of the40th International ACM SIGIR Conference on Research and Development in InformationRetrieval, SIGIR ’17 (Shinjuku, Tokyo, Japan, 2017), 993–996. ACM, New York,NY, USA. doi:10.1145/3077136.3080699. http://doi.acm.org/10.1145/3077136.3080699.(cited on page 9)

Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bougares, F.; Schwenk, H.; and

Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder forstatistical machine translation. CoRR, abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078. (cited on page 7)

Copestake, A. A.; Lambeau, F.; Villavicencio, A.; Bond, F.; Baldwin, T.; Sag,I. A.; and Flickinger, D., 2002. Multiword expressions: linguistic precision andreusability. In LREC. (cited on page 1)

Fei-Fei, L.; Fergus, R.; and Perona, P., 2006. One-shot learning of object categories.IEEE Trans. Pattern Anal. Mach. Intell., 28, 4 (Apr. 2006), 594–611. doi:10.1109/TPAMI.2006.79. https://doi.org/10.1109/TPAMI.2006.79. (cited on page 9)

31

http://dl.acm.org/citation.cfm?id=2987189.2987282

http://dx.doi.org/10.1145/3077136.3080699

http://doi.acm.org/10.1145/3077136.3080699

http://arxiv.org/abs/1406.1078


http://dx.doi.org/10.1109/TPAMI.2006.79

http://dx.doi.org/10.1109/TPAMI.2006.79

https://doi.org/10.1109/TPAMI.2006.79

32 Bibliography

Frome, A.; Corrado, G.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov,T., 2013. Devise: A deep visual-semantic embedding model. In Neural InformationProcessing Systems (NIPS). (cited on pages 6 and 9)

Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neu-ral Comput., 9, 8 (Nov. 1997), 1735–1780. doi:10.1162/neco.1997.9.8.1735. http://dx.doi.org/10.1162/neco.1997.9.8.1735. (cited on page 6)

Hopfield, J. J., 1988. Neurocomputing: Foundations of research. chap. NeuralNetworks and Physical Systems with Emergent Collective Computational Abilities,457–464. MIT Press, Cambridge, MA, USA. ISBN 0-262-01097-6. http://dl.acm.org/citation.cfm?id=65669.104422. (cited on page 6)

Jurafsky, D. and Martin, J. H., 2009. Speech and Language Processing (2Nd Edition).Prentice-Hall, Inc., Upper Saddle River, NJ, USA. ISBN 0131873210. (cited on page5)

Koch, G.; Zemel, R.; and Salakhutdinov, R., 2015. Siamese neural networks forone-shot image recognition. (cited on pages 6 and 10)

Li, H.; Min, M. R.; Ge, Y.; and Kadav, A., 2016. A context-aware attention networkfor interactive question answering. CoRR, abs/1612.07411 (2016). http://arxiv.org/abs/1612.07411. (cited on pages 6 and 9)

Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J., 2013. Efficient estimation of wordrepresentations in vector space. CoRR, abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781. (cited on page 5)

Moschitti, A.; Pang, B.; and Daelemans, W. (Eds.), 2014. Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Octo-ber 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL.ACL. ISBN 978-1-937284-96-1. (cited on page 1)

Nadeau, D. and Sekine, S., 2007. A survey of named entity recognition and classifica-tion. Linguisticae Investigationes, 30, 1 (January 2007), 3–26. http://www.ingentaconnect.com/content/jbp/li/2007/00000030/00000001/art00002. Publisher: John BenjaminsPublishing Company. (cited on page 2)

Pennington, J.; Socher, R.; and Manning, C. D., 2014. Glove: Global vectors forword representation. In EMNLP, vol. 14, 1532–1543. (cited on pages 5 and 6)

Schneider, N.; Danchik, E.; Dyer, C.; and Smith, N. A., 2014. Discriminative lexicalsemantic segmentation with gaps: Running the MWE gamut. TACL, 2 (2014),193–206. (cited on pages 1 and 2)

Schneider, N. and Smith, N. A., 2015. A corpus and model integrating multiwordexpressions and supersenses. In HLT-NAACL, 1537–1547. The Association for Com-putational Linguistics. http://dblp.uni-trier.de/db/conf/naacl/naacl2015.html. (cited onpages 13, 19, and 20)

http://dx.doi.org/10.1162/neco.1997.9.8.1735









http://www.ingentaconnect.com/content/jbp/li/2007/00000030/00000001/art00002

http://www.ingentaconnect.com/content/jbp/li/2007/00000030/00000001/art00002

http://dblp.uni-trier.de/db/conf/naacl/naacl2015.html

Bibliography 33

Snell, J.; Swersky, K.; and Zemel, R. S., 2017. Prototypical networks for few-shotlearning. CoRR, abs/1703.05175 (2017). http://arxiv.org/abs/1703.05175. (cited onpage 10)

Sutskever, I.; Vinyals, O.; and Le, Q. V., 2014. Sequence to sequence learn-ing with neural networks. In Advances in Neural Information Processing Systems27 (Eds. Z. Ghahramani; M. Welling; C. Cortes; N. D. Lawrence; and

K. Q. Weinberger), 3104–3112. Curran Associates, Inc. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. (cited on page 7)

Synced, 2017. A brief overview of attention mechanism âASsyncedreview âAS medium. https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129. (cited on pages ixand 8)

Vinyals, O.; Blundell, C.; Lillicrap, T. P.; Kavukcuoglu, K.; and Wierstra, D.,2016. Matching networks for one shot learning. CoRR, abs/1606.04080 (2016).http://arxiv.org/abs/1606.04080. (cited on page 10)


http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129


Documents

Multi-word Expression Recognition with Few Shot Learningcourses.cecs.anu.edu.au/courses/CSPROJECTS/18S1/reports/u6133… · Multi-word Expression Recognition with Few Shot Learning