11
Deep Learning for Category-Free Grounded Language Acquisition Nisha Pillai [email protected] Francis Ferraro University of Maryland, Baltimore County, Baltimore, Maryland [email protected] Cynthia Matuszek [email protected] Abstract We propose a learning system in which lan- guage is grounded in visual percepts without pre-defined category constraints. We present a unified generative method to acquire a shared semantic/visual embedding that en- ables a more general language grounding ac- quisition system. We evaluate the efficacy of this learning by predicting the semantics of ground truth objects and comparing the per- formance with each of a predefined category classifier and a simple logistic regression clas- sifier. Our preliminary results suggest that this generative approach exhibits promising results in language grounding without pre-specifying visual categories such as color and shape. 1 Introduction . Grounded language acquisition, in which linguis- tic constructs are paired with visual constructs to learn the perceived world, has been of notable interest in the rapidly growing field of robotics. Joint modeling of language and vision (Matuszek et al., 2012; Pillai and Matuszek, 2018), where natural language is paired with sensor information to train visual classifiers, allows learning when both the language space and the perceptual space are novel—that is, such systems are able to learn novel words describing objects, attributes, or ac- tions that are not preexisting in the formal repre- sentation language. While this method learns language groundings from visual features for multiple attribute classes, in most work in this area, classifiers are still trained for specific domains, such as object type or color. However, modeling semantics specific to particular attribute types still constrains language acquisition. Words denote visual classifiers, which are then trained with visual features extracted for a fixed set of semantic categories. The approach is limited to learning a predefined visual category such as color, shape, and object words. In this work, we present general visual classi- fiers that learn the language without relying on predefined visual categories. Our method gener- alizes language acquisition by using novel, gen- erally applicable visual percepts from natural de- scriptions of real-world objects. Instead of creat- ing classifiers for every high-level category, such as color, shape and object, we use a combina- tion of features in order to create a general clas- sifier for terms that we observe in language used to describe real-world objects. We use deep gen- erative models to obtain a representative unified visual embedding from the combination of visual features to move away from category-specific lan- guage learning constraints (see fig. 1). Our primary contribution is a proposition to generalize language acquisition by moving away from predefined categories to category- free grounded language acquisition. In order to compare to existing work, we evaluate against attribute-specific color, shape, and object words; however, in contrast to most existing work, the system we present does not rely on these as exist- ing categories. We compare against systems with and without predefined categories. A high-level view of our approach can be for- mulated as follows: 1) Join all observed visual features. 2) Use the latent feature discrimination method (Kingma et al., 2014) based on an un- supervised neural variational autoencoder to ex- tract meaningful, representative latent embedding from the cumulative feature set. 3) Learn a gen- eral visual classifier using the latent embedding created from the cumulative feature set (See figure 1). These approaches are trained and tested on an RGB-D image dataset (Pillai and Matuszek, 2018) created using a Kinect2 sensor and crowdsourced

Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

Deep Learning for Category-Free Grounded Language Acquisition

Nisha Pillai

[email protected]

Francis FerraroUniversity of Maryland, Baltimore County, Baltimore, Maryland

[email protected]

Cynthia Matuszek

[email protected]

Abstract

We propose a learning system in which lan-guage is grounded in visual percepts withoutpre-defined category constraints. We presenta unified generative method to acquire ashared semantic/visual embedding that en-ables a more general language grounding ac-quisition system. We evaluate the efficacy ofthis learning by predicting the semantics ofground truth objects and comparing the per-formance with each of a predefined categoryclassifier and a simple logistic regression clas-sifier. Our preliminary results suggest that thisgenerative approach exhibits promising resultsin language grounding without pre-specifyingvisual categories such as color and shape.

1 Introduction

.Grounded language acquisition, in which linguis-tic constructs are paired with visual constructs tolearn the perceived world, has been of notableinterest in the rapidly growing field of robotics.Joint modeling of language and vision (Matuszeket al., 2012; Pillai and Matuszek, 2018), wherenatural language is paired with sensor informationto train visual classifiers, allows learning whenboth the language space and the perceptual spaceare novel—that is, such systems are able to learnnovel words describing objects, attributes, or ac-tions that are not preexisting in the formal repre-sentation language.

While this method learns language groundingsfrom visual features for multiple attribute classes,in most work in this area, classifiers are stilltrained for specific domains, such as object typeor color. However, modeling semantics specific toparticular attribute types still constrains languageacquisition. Words denote visual classifiers, whichare then trained with visual features extracted fora fixed set of semantic categories. The approach

is limited to learning a predefined visual categorysuch as color, shape, and object words.

In this work, we present general visual classi-fiers that learn the language without relying onpredefined visual categories. Our method gener-alizes language acquisition by using novel, gen-erally applicable visual percepts from natural de-scriptions of real-world objects. Instead of creat-ing classifiers for every high-level category, suchas color, shape and object, we use a combina-tion of features in order to create a general clas-sifier for terms that we observe in language usedto describe real-world objects. We use deep gen-erative models to obtain a representative unifiedvisual embedding from the combination of visualfeatures to move away from category-specific lan-guage learning constraints (see fig. 1).

Our primary contribution is a propositionto generalize language acquisition by movingaway from predefined categories to category-free grounded language acquisition. In order tocompare to existing work, we evaluate againstattribute-specific color, shape, and object words;however, in contrast to most existing work, thesystem we present does not rely on these as exist-ing categories. We compare against systems withand without predefined categories.

A high-level view of our approach can be for-mulated as follows: 1) Join all observed visualfeatures. 2) Use the latent feature discriminationmethod (Kingma et al., 2014) based on an un-supervised neural variational autoencoder to ex-tract meaningful, representative latent embeddingfrom the cumulative feature set. 3) Learn a gen-eral visual classifier using the latent embeddingcreated from the cumulative feature set (See figure1). These approaches are trained and tested on anRGB-D image dataset (Pillai and Matuszek, 2018)created using a Kinect2 sensor and crowdsourced

Page 2: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

Figure 1: Design diagram of the unified discriminative method. For every object, all kinds of features are extractedand learn it through a latent feature generative model to generate a representative embedding of the feature vector.The extracted embedding is joined with a visual classifier denoted by the “language term.”

descriptions of the images obtained from Me-chanical Turk. Initial experiments show promis-ing results on generalizing language groundingfor terms. Our unsupervised neural variational au-toencoder approach provides a matching resultcompared to the predefined category classifier.This suggests that this work has promise in learn-ing more generalized language groundings.

2 Related Work

Notable research exists in the field of groundedlanguage acquisition (Harnad, 1990) where lin-guistic constricts are acquired through interact-ing with the perceptual world. Popular language-vision grounding applications include generatingdescriptions from images or videos (Vinyals et al.,2015), grounding commands, directions, or actionwords (Artzi and Zettlemoyer, 2013; Andersonet al., 2018; Misra et al., 2016; Al-Omari et al.,2017; Chai et al., 2018), and visual question an-swering (VQA) (Antol et al., 2015; Yang et al.,2016; Selvaraju et al., 2017).

Our research is based on Pillai and Matuszek(2018), in which language is learned by jointlyconnecting with visual characteristics of realworld objects. We learn visual attributes (Nygaet al., 2017) from a small dataset which is simi-

lar to one-shot learning (Hariharan and Girshick,2017; Tommasi et al., 2010; Vinyals et al., 2016),which learns from a few samples, and zero-shotlearning (Lampert et al., 2014; Elhoseiny et al.,2013) which learns from no samples.

Our experiments are designed to learn color,shape, and object attributes (Berg et al., 2010)from the descriptions of real-world objects with-out specifying the category of the attributes,whereas Paul et al. (2016); Tellex et al. (2011)propose to ground spatial concepts, Brawer et al.(2018) learn speech joining with context, and Yuet al. (2016, 2017) grounds natural language refer-ring expressions for objects in images. Learningvisual attributes such as color and shape is criticalin robot object grasping (Rao et al., 2018; Levineet al., 2018) and manipulation tasks. There arestudies that ground language by partitioning fea-ture space by context (Thomason et al.), whereaswe intend to learn words moving beyond the at-tribute types without manually separating them.

Deep learning has been successfully appliedto image classification (Krizhevsky et al., 2012),object detection (Hatori et al., 2018; Ren et al.,2015), video classification (Karpathy et al., 2014),image to image translation (Isola et al., 2017),linking motion with language (Plappert et al.,2018) and integrating appearance, motion, gaze,

Page 3: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

and spatial-temporal context (Balajee Vasudevanet al., 2018). However, this requires a large datasetfor processing. Our objective is to gain higher pre-diction with a small, but complex dataset.

Our architecture predicts visual percepts associ-ated with the language by training the representa-tive latent probability distribution generated fromcumulative visual features using a deep generativemodel. We use a one hidden layer deep genera-tive variational autoencoder model to generate la-tent embedding from our visual features. Autoen-coding has been applied to a number of tasks, in-cluding image to image translations (Wang et al.,2018), sign language translation (Cihan Camgozet al., 2018), 3d shape analysis (Tan et al., 2018),hand pose estimation (Wan et al., 2017), sentenceannotations (Ahn et al., 2018), denoising (Maoet al., 2016), and scene understanding (Cadenaet al., 2016).

Silberer and Lapata (2014) learns a stacked au-toencoder that grounds semantic representation ofwords by mapping language and vision into anembedding space. Although our objective is sim-ilar to theirs, they train stacked autoencoders forevery modality by treating them separately andfusing them at the last layer to obtain meaning-ful representation, whereas we combine all avail-able raw visual features before feeding them intoa deep network with no differentiation among theattribute types. Rohrbach et al. (2016) employs adeep network with Long Short-Term Memory net-work (LSTM) to ground textual phrases in imageswith no, a few, or all grounding annotations avail-able, but in this work, we intend to ground thesemantics of the words without specifying its at-tribute type, using the annotations from natural de-scriptions from Amazon Mechanical Turk users.

3 Background on Latent FeatureDiscriminative Model

Our research uses a deep generativemodel (Kingma et al., 2014) of variationalautoencoder to generate latent embedding fortraining visual classifiers. A variational autoen-coder consists of an encoder, a decoder, and aloss function. The encoder is a neural networkwhich translates input data X into latent (hidden)variables Z. We can view it as P (Z|X). Weuse a variational distribution qθ(z) to approxi-mate P (Z|X). And this qθ(Z) is viewed as theencoder. The decoder is also a neural network

Figure 2: Sample RGB images in the dataset, as takenwith a Kinect2 camera and shown to annotators (Pillaiand Matuszek, 2018). In this visually varied dataset,shape and object classification are nontrivial.

which attempts to reconstruct X from the latentvariables Z. We model it as Pφ(X|Z) where φare the weight parameters in decoder.

Here our objective is to learn useful and mean-ingful latent representations (Z), from the inputdata (inference network/encoder network) to uti-lize it in our classification. We approximate theposterior probability using a Gaussian functionqθ(z|x), given by:

qθ(z|x) = N (z|µθ(x), diag(σ2θ(x))

Where σ2(x) is a vector of standard deviations,µ(x) is a vector of means, and µ(x) and σ2(x) isrepresented as multilayer perceptrons (MLPs).

The efficiency of the latent representation is en-hanced using the loss function, L which is calcu-lated as the sum of reconstruction error (expec-tation of negative log-likelihood) and the KL di-vergence of approximation function and prior dis-tribution (KL(q(z|x)||p(z)). We can reduce theloss by minimizing the KL divergence. The over-all loss function is then:

L = −E[log p(x|z)] + KL(q(z|x)||p(z))

4 Approach

This work is similar to approaches in which lan-guage grounding is treated as an association oflanguage tokens (words) with the visual perceptsextracted from real world objects. In previouswork Pillai and Matuszek (2018), we obtained de-scriptions of objects, tokenized them, created onevisual classifier per category of attribute, and usedthem to learn real world objects. The ‘correct’ at-tribute type was assumed to be the classifier with

Page 4: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

the best fit to the training data. Here, instead ofbuilding classifiers within specific categories, wecreate and learn a single visual classifier per termfrom a single general set of features (e.g., insteadof learning separate possible classifiers such asboth “cube-as-shape” and “cube-as-object” clas-sifiers, we learn a single “cube” classifier).

This involves three steps: First, concatenate allthe visual features as a single vector. Second, gen-erate representative and meaningful embeddingsfrom these cumulative feature vectors using a la-tent discriminative variational autoencoder. Last,learn one visual classifier per token, associatingvisual embeddings generated using the learnedweights of variational autoencoder (see section4.2 for details).

4.1 Data Corpus

We use a dataset of images and descriptions thatcontains color and depth images of real-worldobjects in 72 categories, divided into 18 classes(figure 2). Objects include food objects such as‘potato,’ ‘tomato,’ and ‘corn’ and childrens toysin several shapes such as ‘cube’ and ‘triangle’.There are an average of 4.5 images collected forevery object. The language dataset includes 6000descriptions (see figure 3) collected from AmazonMechanical Turk ( 85 descriptions per object).

We use a unigram language model in learningvisual classifiers. Visual classifiers are associatedwith unique words extracted from descriptions.These words are produced by tokenizing the de-scriptions, filtering stopwords, applying stemmingand lemmatization processes, and filtering for do-main relevance using tf-idf. For example, for atomato instance, the description could be “This isan image of red tomato,” giving “image,” “red”and “tomato” as focal tokens.

A binary classifier is then trained for each ofthese terms, In which words used in the descrip-tions form (positive) binary labels. Tf-idf is usedto extract meaningful tokens from this set. Em-pirically, this helps identify domain-meaningfulwords for which to learn classifiers. Using thismetric, the score of a word decreases with thenumber of documents it appears in, and increaseswith the number of times it appears in a document.Terms such as “image,” “picture,” and “object” ap-pear in an overwhelming number of documents,but relevant terms such as “carrot” or “banana”appear disproportionately in fewer documents.

Figure 3: Object samples and language descriptionscollected from Amazon Mechanical Turk annotators.The typographical errors are part of the noisy descrip-tions obtained from Mechanical Turk.

In order to train classifiers, two categories ofvisual features are used: COLOR (averaged RGBvalues extracted from the color image of the ob-ject) and SHAPE (kernel descriptors) (Bo et al.,2011), extracted from color and depth imagesrespectively. Kernel descriptors model size, 3Dshape, and depth edge from the RGB-D depthchannel, and are efficient in shape and object clas-sification.

4.2 Unified Discriminative Learning Method

Our objective is to associate the observed natu-ral language W with the set of real-world objects,O. To learn this grounded association, we createa generalized visual feature embedding out of thefeatures extracted from the object instances anduse it to learn a general classifier. Componentsof the unified discriminative model are explainedas follows. Language refers to the Amazon Me-chanical Turk description used for the object inthe figure. The visual groundings section explainsthe concatenation of visual features and extrac-tion of latent embedding using latent discrimina-tive model. Finally, the category-free visual clas-sifier describes the association between languageand visual characteristics. These components areshown in Figure 4.

Language. We use a unigram language model thatdefines P (W |S), which tokenizes and filters thesentences S into words W . We learn the mean-ingful and relevant words, w ∈ W , using the tf-idf statistical measure described above. Tf-idf isa well-known metric to measure how relevant aword is in a particular document (in this case writ-ten description).

Page 5: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

Figure 4: Diagram of approaches. The predefined category classifier—a current approach in the literature—learnsone visual classifier for every visual category, accepting input from the respective visual characteristics (blue box).Category-free logistic regression is a baseline that learns a single classifier for each word, accepting cumulativevisual features as input (green box). On the far right (yellow box), the Unified Discriminative Method—describedin this paper—generates visual classifiers for each word, accepting latent visual embedding generated from cumu-lative visual features as input.

Category-free Visual Classifiers. Given alearned embedding Z for word w, we learn abinary classifier Pc(y = 1|Z) for the positiveitems and Pc(y = 0|Z) for the negative items (seesection 5). Z is defined as the visual groundingsgenerated from the cumulative features. In ourapproach, instead of creating a “red-as-color”classifier by training on color features, we createa unified general classifier for the word “red” byassociating a generalized probability distributionmade from the visual features extracted from theperceived objects.

Visual Groundings. As described above, we areattempting to form a unified probability distribu-tion, P (z|o) of latent variables (z) out of all fea-ture variables and use it as the general embeddingneeded to learn the visual classifier. We define thecumulative feature input X as < f1, f2...fn >where fi is a type of visual feature extracted fromthe object, o. Here our color features are of dimen-sion 3, and shape features are of dimension 700,reflecting the comparative complexity of shapeand the simple colors of the objects in the datset.In this research, our challenge is to find an effi-cient representation of our feature space P (Z|X)for our grounded learning tasks.

We employ latent feature discriminative vari-ational autoencoder (Kingma et al., 2014)to construct a representative, meaningful low-dimensional embedding, accepting the cumulativefeature vector X as input. Employing an encoderfunction represented by a neural network (see sec-tion 3), we learn the encoder weights of the uni-

fied discriminative model (UDM) by applying allthe training data as input (for more detail on thearchitecture, see section 5).

The above-mentioned components explain themain parts of our grounded language model. Inour framework, for every classifier c, we extract alatent embedding which is the vision grounding Zusing the positive and negative object instance fea-tures. Vectors of the mean µ(x) and the standarddeviation σ2(x) that are extracted from the gener-ator network define the latent embedding, Z. Weused logistic regression on positive and negativegroundings to train the binary classifier languagelearning model for every token.

5 Experimental Results

We use four-fold cross-validation for our experi-ments. We have image dataX = {X1, X2....Xn},where Xi is a cumulative vector constructed fromfeatures of all categories. In this framework, it isthe concatenation of the shape and color features.

We selected positive object instances for everymeaningful token selected using tf-idf. We con-sider an object instance a ‘positive’ example ifthe object is described by that token in any de-scription. If a token is observed for the first time,we create a new visual classifier named after thattoken; when new objects are observed with de-scriptions that include this token, we update visualclassifiers with the new positive instance.

In order to find negative training examples,weutilize semantic similarity measures over the de-scriptions of (Pillai et al., 2018; Pillai and Ma-tuszek, 2018). For this purpose, we treated a con-

Page 6: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

catenation of all the descriptions associated withone object as documents, and converted these ‘de-scriptive documents’ into vector space using theDistributed Memory Model of Paragraph Vectors(PV-DM) (Mikolov et al., 2013a,b). As semanti-cally similar documents will have similar repre-sentation in vector space, we used cosine similar-ity statistical metric to find the most distant para-graph vectors, and selected the respective objectinstances as negative examples for our token.

The latent feature discriminative method is avariational autoencoder (see section 3) with a sin-gle hidden layer neural network. We experimentedwith several hidden layer units ranging from 100to 600, and 500 performed the best. Rectified lin-ear unit (ReLU) non-linear function is a good ap-proximator and is applied as an activation functionbetween the layers. The output layer of the en-coder provides the latent embedding for our clas-sification. We experimented with latent embed-ding lengths ranging from 12 to 100. We learnedthe weights needed for extracting latent embed-ding representation by applying all training datato the latent feature variational autoencoder.

Language acquisition success is measured asa funtion of the prediction performance of thelearned visual classifiers. Our unified discrimi-native method is compared with two other ap-proaches. First, in the ‘predefined’ category clas-sifier, visual classifiers are trained for every tokenand feature category, as per previous work: For ex-ample, “arch” is trained as “arch-as-color,” “arch-as-shape,” and “arch-as-object” classifiers.

In ‘category-free logistic regression,’ logisticregression classifiers are trained for every tokenwith the concatenated feature set. Here, “arch” istrained as “arch-classifier,” accepting a concate-nated set of all features as its input (see figure 4for the high-level design diagram).

Table 1: Overall summary of the F1-score distributioncomparisons plotted in Figure 5. The minimum, meanand the maximum of our method (latent dimension50) are higher than all other approaches. Our methodshows promising results in learning the language be-yond category constraints.

Figure 5: The comparison of the F1-score distributionof all tokens of the unified discriminative method vs.baselines (leftmost two bars). Minimum, mean, andmaximum F1-score performance of UDM using 50 la-tent dimensions is both high and consistent comparedto both baselines and other latent dimension variants.

In addition to the comparison with two base-lines, we conducted experiments with varyinglengths of latent dimensions to ensure the bestquality prediction. We experimented unified dis-criminative method with latent dimensions 12, 50,and 100 to analyze the performance variation ingrounded language prediction.

For all analyses, we used 72 objects and 6000descriptions. Four-fold cross-validation is appliedto the all methods. We used the image and lan-guage dataset for evaluating all the methods andtrained visual classifiers on “color,” “shape,” and“object” words. For every learned classifier, weselected 3–4 positive and 4–6 negative imagesfrom the test set; this number varied with seman-tic distance. If the predicted probability for a testimage is above 0.5 it is considered a positive re-sult. We calculated averaged F1-score conducting10 experiments for every word.

Language Prediction Probabilities. Table 2shows the association between visual classifiersand the ground truth after learning the lan-guage and vision components through our uni-fied discriminative method. Color classifiers showpromising results. The “yellow” classifier is ableto predict “yellow” ground truths successfully, aswell as “lemon.” In our dataset, the variation of

Page 7: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

Table 2: Prediction probabilities of selected visual clas-sifiers (x-axis) against ground truth objects (y-axis) se-lected from a held-out test set. This confusion matrixexhibits the prediction confidence of the unified dis-criminative method (UDM) run against real-world ob-jects. Color, shape, and object variations add complex-ities to the performance.

“yellow” objects ranged from bananas to corn,while “purple” objects were limited to eggplant,plum, and cabbage (a wider color range, sinceeggplants are frequently nearly black).

Compared to color classifiers, object classifiersare able to predict object instances with greatprediction strength. The “lemon” classifier showsthe positive association with ‘yellow objects, andstrong predictive ability on a lemon. The shapefeatures of a carrot are complex compared to alemon, so it is unsurprising that the predictivepower of the learned “carrot” classifier is notstrong compared to a “lemon” classifier. From dif-ferent angles, pictures of carrots show very differ-ent shapes, while lemons are almost the same fromall angles. The complexity of the features affectsthe classification accuracy substantially.

Performance comparison for specific words.Table 3 shows the performance comparison of twobaselines and the variants of the unified discrimi-native method for every meaningful token selectedusing tf-idf. As mentioned above, there were 4 tri-als for every method, and every token is tested 10times in each trial with 3-6 positive and negativetest instances. For every token, F1-score is cal-culated in every test and is averaged to calculatethe overall measure treating every token test resultequally.

The predefined category-specific baselinegrounds color specific language “terms” ex-ceptionally well compared to other approaches.On average, color classifiers had an F1-scoreof 0.792 for the predefined category classifier,0.578 for category-free logistic regression, 0.51for the unified discriminative method with la-

Table 3: Averaged macro F1-score comparison of ourunified discriminative method against other approachesfor every token. We segment the classifiers by categoryhere for ease of analysis: our UDM models do not con-sider category types. UDM with latent dimension 50is able to provide promising performance in groundedlanguage acquisition for all categories. Color-specificvisual classifiers perform better compared to the cate-gory free logistic regression baseline. Object and shapeclassifiers perform well with our method (UDM) withlatent dimension 50 compared to other approaches.

tent dimension 12, 0.611 for UDM with latentdimension 50, and 0.511 for UDM with latentdimension 100. However, our method with latentdimension 50 is able to perform better than thecategory-free logistic regression where classifierinput is accepted as a vector of raw features.

Our method with latent dimension 50 outper-forms both baselines for shape classification, withan average F1-score of 0.69, where category-freelogistic regression scored 0.52, UDM with latentdimension 12 scored 0.64, UDM with latent di-mension 100 scored 0.57, and category-specificapproach scored 0.61.

In the case of object classification, which iscomparatively complex, our method with latentdimensions 12 and 50 perform better comparedto predefined category classifier and category-freelogistic regression. Scores of the methods are asfollows: Predefined category classifier: 0.6744,category-free logistic regression: 0.6163, UDM

Page 8: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

with latent dimension 12: 0.7154, UDM with la-tent dimension with 50: 0.7537, and with latentdimension 100: 0.6865. When the minimum F1-score for UDM with dimension 50 is 0.626, thebaseline predefined category classifier scores aslow as 0.246 and category free logistic regressionscores 0.233.

Figure 6: Averaged micro F1-score performance ofvisual classifiers. The unified discriminative method(UDM) shows improved performance than predefinedcategory classifier where classifiers are learned per cat-egory and the category-free logistic regression wherethe concatenated feature set is learned per word.

Comparison of macro F1-score distributionsfor all tokens. The plot 5 shows the distributionalcomparison of other approaches with discrimina-tive model variants. Table 1 shows the overallsummary of these distribution comparisons, whilethe boxplot visualizes the median (middle line inthe box), two hinges, two whiskers, and all out-liers. The lower and upper hinges outline the 25th

and 75th percentiles of the data distribution. Allscores are higher than baselines, with a minimumF1-score performance is 0.4560 for our methodwith dimension 50.

Overall micro averaged F1-score comparisons.The plot 6 indicates the micro-averaged F1-scoreof performance comparison of our method with allother approaches. Our method scores 0.72 whichis high compared to all other performances. Thisindicates that extracting meaningful embeddingfrom existing features is an efficient method tocarry out grounded language prediction.

6 Conclusion

An unconstrained general classifier is essentialfor learning language in association with the fea-tures extracted from objects. In this work, wepresent an approach for learning a category-freeunified language grounding model. Our results in-dicate that learning meaningful embedding fromthe cumulative unified feature set is a great ap-proach for learning linguistic constructs beyondconstrained domains. We demonstrate that such aunified model, when using carefully chosen pa-rameters, can efficiently ground linguistic con-cepts with unconstrained natural language usingsensor data. This reduces the need to learn classi-fiers with specific category features.

Analysis with our unified discriminativemethod, which extracts the relevant represen-tation of feature sets, suggests that the methodis effective. The efficient use of such a learningsystem can potentially reduce the need for se-lecting important tokens from a large corpora. Inthe future, we intend to run the approach with amore varied and complex dataset. We also intendto compare our results with deep net classifierwith no auto-encoder layers. In addition, we planto compare our models with pre-trained wordand resnet models. We also aim to spot semanticsimilarity in words using word vectors withoutvisual data.

ReferencesHyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo,

and Songhwai Oh. 2018. Text2action: Generativeadversarial synthesis from language to action. In2018 IEEE International Conference on Roboticsand Automation (ICRA), pages 1–5. IEEE.

Muhannad Al-Omari, Paul Duckworth, David C Hogg,and Anthony G Cohn. 2017. Natural language ac-quisition and grounding for embodied robotic sys-tems. In Proceedings of the 31st National Confer-ence on Artificial Intelligence (AAAI), pages 4349–4356.

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,Mark Johnson, Niko Sunderhauf, Ian Reid, StephenGould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environ-ments. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR),volume 2.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,

Page 9: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

and Devi Parikh. 2015. Vqa: Visual question an-swering. In The IEEE International Conference onComputer Vision (ICCV).

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associ-ation for Computational Linguistics (TACL), 1:49–62.

Arun Balajee Vasudevan, Dengxin Dai, and LucVan Gool. 2018. Object referring in videos with lan-guage and human gaze. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

T. Berg, A. Berg, and J. Shih. 2010. Automatic at-tribute discovery and characterization from noisyweb data. Computer Vision–ECCV 2010.

Liefeng Bo, Kevin Lai, Xiaofeng Ren, and Dieter Fox.2011. Object recognition with hierarchical kerneldescriptors. In Computer Vision and Pattern Recog-nition (CVPR). IEEE.

Jake Brawer, Olivier Mangin, Alessandro Roncone,Sarah Widder, and Brian Scassellati. 2018. Situ-ated human–robot collaboration: predicting intentfrom grounded natural language. In 2018 IEEE/RSJInternational Conference on Intelligent Robots andSystems (IROS), pages 827–833. IEEE.

Cesar Cadena, Anthony R Dick, and Ian D Reid. 2016.Multi-modal auto-encoders as joint estimators forrobotics scene understanding. In Robotics: Scienceand Systems.

Joyce Y Chai, Qiaozi Gao, Lanbo She, Shaohua Yang,Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan-guage to action: Towards interactive task learningwith physical agents. In IJCAI, pages 2–9.

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller,Hermann Ney, and Richard Bowden. 2018. Neu-ral sign language translation. In The IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgam-mal. 2013. Write a classifier: Zero-shot learning us-ing purely textual descriptions. In The IEEE Inter-national Conference on Computer Vision (ICCV).

Bharath Hariharan and Ross Girshick. 2017. Low-shot visual recognition by shrinking and hallucinat-ing features. In The IEEE International Conferenceon Computer Vision (ICCV).

Stevan Harnad. 1990. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3):335–346.

Jun Hatori, Yuta Kikuchi, Sosuke Kobayashi, KuniyukiTakahashi, Yuta Tsuboi, Yuya Unno, Wilson Ko,and Jethro Tan. 2018. Interactively picking real-world objects with unconstrained spoken language

instructions. In 2018 IEEE International Confer-ence on Robotics and Automation (ICRA), pages3774–3781. IEEE.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, andAlexei A. Efros. 2017. Image-to-image translationwith conditional adversarial networks. In The IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Andrej Karpathy, George Toderici, Sanketh Shetty,Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.2014. Large-scale video classification with con-volutional neural networks. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 1725–1732.

Durk P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In Advancesin Neural Information Processing Systems, pages3581–3589.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in NeuralInformation Processing Systems, pages 1097–1105.

Christoph H Lampert, Hannes Nickisch, and StefanHarmeling. 2014. Attribute-based classification forzero-shot visual object categorization. IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence, 36(3):453–465.

Sergey Levine, Peter Pastor, Alex Krizhevsky, JulianIbarz, and Deirdre Quillen. 2018. Learning hand-eye coordination for robotic grasping with deeplearning and large-scale data collection. The In-ternational Journal of Robotics Research, 37(4-5):421–436.

Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016.Image restoration using very deep convolutionalencoder-decoder networks with symmetric skip con-nections. In Advances in Neural Information Pro-cessing Systems, pages 2802–2810.

Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012. A JointModel of Language and Perception for GroundedAttribute Learning. In Proceedings of the 2012 In-ternational Conference on Machine Learning, Edin-burgh, Scotland.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word repre-sentations in vector space. In International Con-ference on Learning Representations Workshop Pro-ceedings.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems (NIPS).

Page 10: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

Dipendra K Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2016. Tell me dave: Context-sensitive grounding of natural language to manip-ulation instructions. In International Journal ofRobotics Research (IJRR).

Daniel Nyga, Mareike Picklum, and Michael Beetz.2017. What no robot has seen beforeprobabilis-tic interpretation of natural-language object descrip-tions. In 2017 IEEE International Conferenceon Robotics and Automation (ICRA), pages 4278–4285. IEEE.

Rohan Paul, Jacob Arkin, Nicholas Roy, and ThomasM Howard. 2016. Efficient grounding of ab-stract spatial concepts for natural language inter-action with robot manipulators. In Proceedingsof Robotics: Science and Systems (R:SS) 2016.Robotics: Science and Systems (RSS).

Nisha Pillai, Francis Ferraro, and Cynthia Matuszek.2018. Optimal semantic distance for negative ex-ample selection in grounded language acquisition.Robotics: Science and Systems Workshop on Mod-els and Representations for Natural Human-RobotCommunication.

Nisha Pillai and Cynthia Matuszek. 2018. Unsuper-vised end-to-end data selection for grounded lan-guage learning. In Proceedings of the 32nd NationalConference on Artificial Intelligence (AAAI), NewOrleans, USA.

Matthias Plappert, Christian Mandery, and TamimAsfour. 2018. Learning a bidirectional mappingbetween human whole-body motion and naturallanguage using deep recurrent neural networks.Robotics and Autonomous Systems, 109:13–26.

Achyutha Bharath Rao, Krishna Krishnan, and Hong-sheng He. 2018. Learning robotic grasping strategybased on natural-language object descriptions. In2018 IEEE/RSJ International Conference on Intel-ligent Robots and Systems (IROS), pages 882–887.IEEE.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In Ad-vances in Neural Information Processing Systems,pages 91–99.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu,Trevor Darrell, and Bernt Schiele. 2016. Ground-ing of textual phrases in images by reconstruction.In European Conference on Computer Vision, pages817–834. Springer.

Ramprasaath R Selvaraju, Michael Cogswell, Ab-hishek Das, Ramakrishna Vedantam, Devi Parikh,and Dhruv Batra. 2017. Grad-cam: Visual explana-tions from deep networks via gradient-based local-ization. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 618–626.

Carina Silberer and Mirella Lapata. 2014. Learn-ing grounded meaning representations with autoen-coders. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 721–732.

Qingyang Tan, Lin Gao, Yu-Kun Lai, and Shihong Xia.2018. Variational autoencoders for deforming 3dmesh models. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR).

Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew R Walter, Ashis Gopal Banerjee, SethTeller, and Nicholas Roy. 2011. Approachingthe symbol grounding problem with probabilisticgraphical models. AI Magazine, 32(4):64–76.

Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Pe-ter Stone, and Raymond J Mooney. Learning multi-modal grounded linguistic semantics by playing“ ispy”.

Tatiana Tommasi, Francesco Orabona, and BarbaraCaputo. 2010. Safety in numbers: Learning cate-gories from few examples with multi model knowl-edge transfer. In 2010 IEEE Computer Society Con-ference on Computer Vision and Pattern Recogni-tion, pages 3081–3088. IEEE.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap,Daan Wierstra, et al. 2016. Matching networks forone shot learning. In Advances in neural informa-tion processing systems, pages 3630–3638.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

Chengde Wan, Thomas Probst, Luc Van Gool, and An-gela Yao. 2017. Crossing nets: Combining gans andvaes with a shared latent space for hand pose es-timation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages680–689.

Yaxing Wang, Joost van de Weijer, and Luis Herranz.2018. Mix and match networks: Encoder-decoderalignment for zero-pair image translation. In TheIEEE Conference on Computer Vision and PatternRecognition (CVPR).

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,and Alex Smola. 2016. Stacked attention networksfor image question answering. In Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 21–29.

Licheng Yu, Patrick Poirson, Shan Yang, Alexander CBerg, and Tamara L Berg. 2016. Modeling contextin referring expressions. In European Conferenceon Computer Vision.

Page 11: Deep Learning for Category-Free Grounded Language ….Grounded language acquisition , in which linguis-tic constructs are paired with visual constructs to learn the perceived world,

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara LBerg. 2017. A joint speaker-listener-reinforcermodel for referring expressions. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 7282–7290.