Linking WordNet to 3D Shapescompling.hss.ntu.edu.sg/events/2018-gwc/pdfs/GWC2018_paper_66.pdf · Linking WordNet to 3D Shapes Angel X Chang, Rishi Mago, Pranav Krishna, Manolis Savva,

Linking WordNet to 3D Shapes

Angel X Chang, Rishi Mago, Pranav Krishna, Manolis Savva, and Christiane FellbaumDepartment of Computer Science, Princeton University

Princeton, New Jersey, [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract

We describe a project to link the Prince-ton WordNet to 3D representations of realobjects and scenes. The goal is to estab-lish a dataset that helps us to understandhow people categorize everyday commonobjects via their parts, attributes, and con-text. This paper describes the annotationand data collection effort so far as well asideas for future work.

1 Introduction

The goal of this project is to connect Word-Net (Fellbaum, 1998) to 3D representations of realobjects and scenes. We believe that this is a natu-ral step towards true grounding of language, whichwill shed light on how people distinguish, catego-rize and verbally label real objects based on theirparts, attributes, and natural scene contexts.

Our main motivation is to establish a datasetconnecting language with realistic representationsof physical objects and scenes using 3D computer-aided design (CAD) models to enable research incomputational understanding of the human cogni-tive process of categorization.

Categorization is the process by which wegroup entities and events together based on salientsimilarities, such as shared attributes or functions.For example, the category “furniture” includes ta-bles, chairs and beds, all of which are typical partsof a room or house and serve to carry out activi-ties inside or around the house. Subcategories are“seating furniture,” which includes chairs and so-fas and “sleeping furniture,” which includes beds,bunkbeds and futons. Note that some categorieshave a simple verbal label (a name, like “furni-ture”), but often category names are compounds(like “sleeping furniture”). Compounding, a uni-versal feature of human language that accounts inpart for its infinite generativity, allows us to make

up names on the fly whenever we feel the need todistinguish finer-grained categories, such as “col-lege dorm room furniture.” Of course, not all lan-guages share the same inventory of simple labels.

We form and label categories all the time. Cat-egories help us to recognize never-before-seen en-tities by perceiving and assessing their attributesand functions and, on the basis of similarity toknown category members, assign them to a cate-gory (Rosch, 1999). Young children in particularlearn to form categories by being exposed to anincreasing number of different category membersand gradually learning whether they belong to onecategory or another. Importantly, categories allowus to reason: if we know that beds are made forlying down on and sleeping, encountering a newterm like “sleigh bed” will tell us that such a bedis likely to have a flat surface on which a personcan lie down. Conversely, seeing a sleigh bed forthe first time and identifying this salient featurewill prompt us to call it a “bed.” Categorizationis so fundamental to human cognition that we arenot consciously aware of it; however, it remains asignificant challenge for computational systems intasks such as object recognition and labeling.

Parts, attributes, and natural contexts of objectsare all involved in category formation. Objects aremade of parts, and parts often imbue functional-ity, especially in the broad category of “artifacts.”Thus, seat surfaces are a necessary part for func-tioning chairs. Parts and the functionality theyenable are fundamentally intertwined with catego-rization (Tversky and Hemenway, 1984).

Beyond their concrete parts, objects are per-ceived to have a general set of attributes. For ex-ample, the distinction between a cup, a goblet anda mug relies less on the presence or absence of spe-cific parts and more on the geometric differencesin aspect ratio of the objects themselves.

Lastly, real objects occur in real scenes, mean-ing that they possess natural contexts within which

they are observed. In language, context reflects avariety of aspects beyond functionality, includingsyntactic patterns and distributional properties. Incontrast, physical context is concrete and definedby the sets of co-occurring objects and their rela-tive arrangements in a given scene.

To study these three aspects of category for-mation, we will connect WordNet at the part,attribute, and contextual level with a 3D shapedataset. The longer term goal of this project is toask how people distinguish and categorize objectsbased on how they name and describe the parts andattributes of the objects. We will focus primarilyon the category “furniture.”

2 Existing datasets

There has been much prior work linking Word-Net (Fellbaum, 1998) to 2D images. Themost prominent effort in this direction is Ima-geNet (Deng et al., 2009), which structures a largedataset of object images in accordance with Word-Net hierarchies. Our project differs from Ima-geNet because even though Imagenet is based onthe WordNet hierarchy, the focus of our projectis on annotating parts and attributes on 3d mod-els rather than the labeling of images. Followingthis, the SUN (Xiao et al., 2010) dataset focuseson scene images, and the VisualGenome (Krishnaet al., 2016) dataset defines object relations and at-tributes in a connected scene graph representationwithin each image. Another line of work focuseson detailed annotation of objects and their parts— a prominent recent example is the Ade20Kdataset (Zhou et al., 2016). However, a fundamen-tal assumption of all this work is that objects andtheir properties can be adequately represented inthe 2D image plane. This assumption does notgenerally hold, as many object parts, spatial re-lations between objects and a full view of objectcontext are hard to infer from the limited field ofview of a 2D image.

More recently, there has been some workthat links 3D CAD models to WordNet. TheShapeNet (Chang et al., 2015) dataset is a largecollection of CAD representations (curating closeto 65,000 objects in approximately 200 commonWordNet synsets), whereas the SUNCG (Song etal., 2017) dataset contains CAD representationsof 45,000 houses, composed of individual objects(in 159 WordNet synsets). These two datasets areboth “synthetic” in the sense that the 3D CAD

representations are designed virtually by a hu-man expert. A different form of 3D representa-tion is obtained by scanning and 3D reconstruc-tion of real world spaces. Recent work intro-duced ScanNet (Dai et al., 2017) and the Matter-port3D (Chang et al., 2017) dataset, which bothcontain 3D reconstructions of various public andprivate interior spaces (containing 409 and 430 ob-ject synsets respectively).

Though both synthetic and reconstructed 3Ddata are increasingly available, no effort currentlyexists to connect such 3D representations to Word-Net at the part, attribute, and contextual levels.Such a link of WordNet entries to 3D data canprovide much richer information than 2D imagedatasets. Naturally, 3D representations allow usto reason about unoccluded parts and symmetries,arrangement of objects in a physically realisticthree dimensional space, and to account for emptyspace, a critical property of real scenes which isnot observable in 2D images. Moreover, 3D rep-resentations are appropriate for computationallysimulating real spaces and the actions that can beperformed within them. The ability to do this isa powerful tool for investigating and understand-ing actions (Pustejovsky et al., 2016). Therefore,our project aims to annotate 3D models in one ofthe existing datasets and link them to the appro-priate synset in the WordNet database at the part,attribute, and contextual level.

3 Project description

Our project has so far focused on annotating partand attribute information on 3D CAD objects inSUNCG (Song et al., 2017) and linking themto the corresponding WordNet synsets. We areworking with a preliminary categorization of theobjects performed in prior work, which estab-lishes their connection to WordNet synsets denot-ing physical objects. However, we plan to refinethe granularity of this categorization by introduc-ing finer-grained categories — e.g. partitioning“doors” into “garage doors” and “screen doors”among others.

We chose this dataset because the 3D objectsin SUNCG (approximately 2,500) are used acrossa large number of 3D scenes (more than 45,000).This means that for each object, we can automati-cally establish many contextual observations. Thisproperty of the SUNCG dataset differs from othercommon 3D CAD model research datasets such

Figure 1: Interface for labeling object parts. Top left: astove and range object is displayed to a user. Top right: theuser paints the “countertop” part. Bottom left: the user linksthe “range” part to the corresponding WordNet synset. Bot-tom right: the fully annotated object with parts in differentcolors.

as ShapeNet (Chang et al., 2015) where each ob-ject is de-contextualized. For the latter dataset wewould have to additionally compose scenes usingavailable objects in order to acquire observationcontexts for the objects.

We first augment the SUNCG objects with partannotations that are linked to WordNet synsets.We defer the assignment of attributes to the sameobjects as it is an easier annotation task in terms ofinterface design. To perform the part annotation,we designed an interface with a “paint and name”interaction where the user paints parts of the sur-face of an object corresponding to a distinct partand assigns a name to that part. The details of theinterface and annotation task are described in thefollowing section.

4 Annotation interface

Our interface is designed to allow for efficientannotation of parts by inexperienced workers oncrowdsourcing platforms such as Amazon’s Me-chanical Turk. The interface is implemented injavascript using three.js (a WebGL-based graphicslibrary). It can be accessed on the web using anymodern browser, and does not require specializedsoftware or hardware.

Figure 1 shows a series of screenshots from theinterface illustrating the annotation process. Theuser first sees a rotating “turn table” view that re-veals the appearance of the object from all direc-tions. The user then types the name of a part ina text panel and drags over the surface of the ob-ject to select the regions corresponding to the part.The process is repeated for each part and the fi-

Figure 2: Partially annotated object with parts highlightedin different colors. The ”door panel” label is selected, indi-cating that this part was just annotated.

nal result is a multi-colored painting of the objectwith corresponding part names for each color. Thepartitioning of the object’s geometry into differentparts is saved and submitted to a data server uponcompletion. Figure 2 shows a close-up of the in-terface while an object is in the process of beingannotated.

In order to enable an efficient multi-level paint-ing interaction, the size of the paint brush can beadjusted by the annotator. The object geometryis pre-segmented for several levels of granularity:segmentation into surfaces with the same materialassignment, a segmentation with a loose surfacenormal distance criterion, and finally a segmen-tation into sets of topologically connected com-ponents of the object geometry. This multi-levelpainting allows the speed of labeling to be adjustedto accommodate both small parts (e.g., door han-dles) and large parts (e.g., countertops on kitchenislands).

The annotators use freeform text for the partnames, requiring that we address the problem ofmapping this part name text to a WordNet synset.We implemented a simple algorithm that restrictsthe candidate synset set to physical objects in theWordNet hierarchy, preferring furniture (since weare dealing with indoor scenes that are predomi-nantly composed of furniture). Given the objectcategory, we can additionally use the meronym re-lations in WordNet to suggest and rerank possiblepart synsets.

After applying this algorithm to connect eachpart name’s text to a WordNet synset, we manu-ally verify and if needed fix the inferred link usingan interface that displays the WordNet synset as-signment for the given part (in the same view asthe part annotation view) and allows the user toselect a different synset.

Figure 3: A fully completed part annotation example for acar 3D object. Different colors correspond to distinct partswith corresponding names provided by the annotator on theright.

Figure 4: List of 3D models to be annotated. This is theinterface through which one selects a model to annotate orcan examine a previously annotated model.

5 Initial annotations and statistics

Five persons have worked on the annotation thusfar. Annotating one 3D model takes roughly fiveminutes on average, with time being mostly afunction of the complexity of the particular ob-ject. To maintain consistency while annotatingthe objects in the database, the annotators wereinstructed to name the geometric and functionalparts of the object, not decorations or stylistic ele-ments (e.g., a picture of a fish on a lamp shade).

Using the interface described above, we haveso far collected more than 100,000 part instanceannotations. An example of a car annotated by acrowd worker is shown in Figure 3. SUNCG in-cludes a total of 2547 models, of which we haveso far annotated 1021 during the prototyping pro-cess for developing our interface (Figure 4). Fig-ure 5 shows several other example objects withtheir part annotations visualized.

6 Limitations

The initial stages of the annotations were limitedby two major factors: the quality of the 3D mod-els, and language in general. A handful of themodels in the database had segmentation issues,i.e., any error in how the model was broken up

Figure 5: Example objects annotated with parts. Ob-jects are assigned to the following WordNet synset,from top left: double bed.n.01, bunk bed.n.01,door.n.01, desk.n.01, straight chair.n.01,swivel chair.n.01.

into regions. For example, a segmentation issuefor a table is encountered when more than one legis connected into one region, thus making it im-possible even with the smallest brush size to labelthe individual legs. To solve this problem, the seg-mentation algorithm must be improved upon. Amore widespread problem with the database, how-ever, was the lack of internal structure in manyof the models. For example, in book cases withdoors, only the doors would be present in themodel, while the internal structures — in this casethe shelves — were omitted.

Some linguistic factors can cause the annota-tion to be less than straightforward. For example,should the link be to the British or the U.S. word?If we annotate a coffee cup, should we annotate thepiece of cardboard around the cup that is designedto protect our hand from the heat of the coffee us-ing its specific but rare name “zarf,” or should wechoose a more common but less specific term suchas “sleeve” or even “piece of cardboard?” More-over, some parts do not have proper labels: whatshould we call the beams that help stabilize sometables other than “support beams?” See Figure 6for an example.

Figure 6: An example demonstrating some of the limita-tions of our annotation system. Parts of the table are identi-fied as “storage support beam” and “shelf support panel” dueto lack of a better term.

7 Future Work

Our work so far has focused on developing theinfrastructure and annotation interfaces to collect3D object part annotations at scale. This partdata linked to WordNet is of tremendous potentialvalue, which we plan to investigate as our projectcontinues.

A very interesting direction of work is in build-ing contextual multimodal embeddings. Many ob-ject parts and attributes are rarely mentioned inlanguage. For example, a stool doesn’t have aback, but people don’t refer to stools as “back-less chairs.” Neither do speakers encode the factthat chairs often have four legs or 5 wheels; onlynon-default exemplars might be labeled in an adhoc fashion as “five-legged chairs,” for example.Furthermore, the physical contexts of objects (seeFigure 7) provides richer information than is foundin text. In this regard, the text and 3D modali-ties are complementary and provide an excellenttarget for building multimodal distributional rep-resentations (Bruni et al., 2014). Multimodal em-beddings are a promising semantic representationwhich has been leveraged for various Natural Lan-guage Processing and vision tasks (Silberer andLapata, 2014; Kiela and Bottou, 2014; Lazaridouet al., 2015; Kottur et al., 2016).

Another direction for future work is to leveragethe object part and attributes and their correspon-dences to WordNet to go beyond the set of Word-Net synsets and automatically induce new senses,along the lines of recent work on sense induc-tion (Chen et al., 2015; Thomason and J. Mooney,2017). For example, we have found that WordNetsynsets do not have good coverage of some fairlymodern categories of objects that we observe in

Figure 7: An example of the same nightstand object (out-lined in blue) in two different 3D scene contexts. A con-textual embedding afforded by the full 3D representation ofthe scene within which the nightstand is observed would be apowerful way to analyze and disentangle different usage con-texts for common objects.

our 3D object datasets, including iPads, iPhonesand various electronic devices such as game con-soles.

ReferencesElia Bruni, Nam-Khanh Tran, and Marco Baroni.

2014. Multimodal distributional semantics. J. Ar-tif. Intell. Res.(JAIR), 49(2014):1–47.

Angel X. Chang, Thomas Funkhouser, LeonidasGuibas, Pat Hanrahan, Qixing Huang, Zimo Li, Sil-vio Savarese, Manolis Savva, Shuran Song, HaoSu, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015.ShapeNet: An information-rich 3D model reposi-tory. Technical Report arXiv:1512.03012 [cs.GR],Stanford University — Princeton University — Toy-ota Technological Institute at Chicago.

Angel Chang, Angela Dai, Thomas Funkhouser, Ma-ciej Halber, Matthias Niessner, Manolis Savva, Shu-ran Song, Andy Zeng, and Yinda Zhang. 2017.Matterport3D: Learning from RGB-D data in indoorenvironments. International Conference on 3D Vi-sion (3DV).

Xinlei Chen, Alan Ritter, Abhinav Gupta, and TomMitchell. 2015. Sense discovery via co-clusteringon images and text. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 5298–5306.

Angela Dai, Angel X. Chang, Manolis Savva, MaciejHalber, Thomas Funkhouser, and Matthias Nießner.2017. ScanNet: Richly-annotated 3D reconstruc-tions of indoor scenes. In Proc. Computer Visionand Pattern Recognition (CVPR), IEEE.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. ImageNet: A large-scale hi-erarchical image database. In Computer Vision andPattern Recognition (CVPR).

Christiane Fellbaum. 1998. WordNet. Wiley OnlineLibrary.

Douwe Kiela and Leon Bottou. 2014. Learning imageembeddings using convolutional neural networks forimproved multi-modal semantics. In EMNLP, pages36–45.

Satwik Kottur, Ramakrishna Vedantam, Jose M. F.Moura, and Devi Parikh. 2016. Visual word2vec(vis-w2v): Learning visually grounded word embed-dings using abstract scenes. In The IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR), June.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,Michael Bernstein, and Li Fei-Fei. 2016. Vi-sual genome: Connecting language and vision usingcrowdsourced dense image annotations.

Angeliki Lazaridou, Marco Baroni, et al. 2015. Com-bining language and vision with a multimodal skip-gram model. In Proceedings of the 2015 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 153–163.

James Pustejovsky, Tuan Do, Gitit Kehat, and NikhilKrishnaswamy. 2016. The development of mul-timodal lexical resources. In Proceedings of theWorkshop on Grammar and Lexicon: interactionsand interfaces (GramLex), pages 41–47.

Eleanor Rosch. 1999. Principles of categorization.Concepts: core readings, 189.

Carina Silberer and Mirella Lapata. 2014. Learn-ing grounded meaning representations with autoen-coders. In ACL (1), pages 721–732.

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang,Manolis Savva, and Thomas Funkhouser. 2017. Se-mantic scene completion from a single depth image.IEEE Conference on Computer Vision and PatternRecognition.

Jesse Thomason and Raymond J. Mooney. 2017.Multi-modal word synset induction. In IJCAI.

Barbara Tversky and Kathleen Hemenway. 1984. Ob-jects, parts, and categories. Journal of experimentalpsychology: General, 113(2):169.

Jianxiong Xiao, James Hays, Krista A Ehinger, AudeOliva, and Antonio Torralba. 2010. SUN database:Large-scale scene recognition from abbey to zoo. InComputer vision and pattern recognition (CVPR),2010 IEEE conference on, pages 3485–3492. IEEE.

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler,Adela Barriuso, and Antonio Torralba. 2016. Se-mantic understanding of scenes through the ade20kdataset. arXiv preprint arXiv:1608.05442.

Documents

Linking WordNet to 3D Shapescompling.hss.ntu.edu.sg/events/2018-gwc/pdfs/GWC2018_paper_66.pdf · Linking WordNet to 3D Shapes Angel X Chang, Rishi Mago, Pranav Krishna, Manolis Savva,