Research Statementilkery/papers/research_statement.pdf · 2016-11-15 · face recognition where, as a rarity, data from brain imaging, single-cell recordings, and behavior all come

Research Statement

Ilker Yildirim

My research is driven by a basic puzzle of perception: How can our perceptual experiences be so richin content, so robust to environmental variation, and so fast to construct from sensory inputs – all at thesame time? When we see an object we do not only see it as a kind of thing – a book, a bookcase, a table,a teacup, a car, or a tree. From a quick glance, the touch of an object, or a brief sound snippet, ourbrains map raw sensory signals into rich mental representations of physical scenes, shapes, surfaces, andspatial layouts. I study visual and multisensory perception to understand both the brain’s representationsof physical objects and the algorithms used to construct them. I study these representations via noveltasks such as recognizing visually presented objects by touch or under heavy occlusion, recognizing objectsunder unusual viewing conditions, estimating physical object properties (e.g., weight and friction) from howobjects interact in a dynamic scene, and predicting the outcomes of those interactions. My work weavestogether computational, behavioral, and neural approaches to build a unified understanding of higher-levelperception and its links with cognition.

My computational work embraces an old approach to studying how sensory inputs give rise to richmental representations – the approach of analysis-by-synthesis. In this approach perception is viewedas the agreement of the physical sensory inputs and their computational re-construction or re-creationfrom abstract hypotheses. Despite its long history and unique appeal, this approach lagged behind otherapproaches to perception. My work modernizes and efficiently re-instantiates analysis-by-synthesis bydrawing on methods from multiple paradigms in psychology, neuroscience and artificial intelligence. Thesemethods include probabilistic programs, deep neural networks, and forward models such as video gameengines. Together these methods provide not only a unified account of perception in neural and cognitiveterms, but also have applications in building more human-like machine and robotic perceptual systems.

The goal of my behavioral work is to study mental representations by tasks that require combining verylimited sensory inputs with internal models of the world to make inferences about unobservable physicalproperties. I approach this goal with novel experimental paradigms that rely on advanced simulationsoftware and 3D printing technology for visual, non-visual, and multisensory stimulation. For example, inone experiment I showed subjects life-like images of an unknown object (e.g., a chair) draped in a clothwhere they had to make inferences about the hidden object; in another experiment, I had them feel anunfamiliar 3D-printed object and match it to images. These paradigms show that people can infer richphysical object properties from very limited sensory inputs, and provide the data to test specific models.

The goal of my computational neuroscience work is to establish links between the neural activity andphysical object representations. Because internal models of the physical world can be built from suchrepresentations, my work is well positioned to uncover the “mental life of neurons” or to relate neuralactivity to cognition. I have developed multiple collaborations with primate electrophysiology researchers,which involve a combination of analyzing data they have already collected, jointly planning new experimentsto test the models I build, and using the results of those experiments to inspire new model designs. I usethese and alternative models to make inferences about the neural activity obtained during tasks that requirerepresenting objects such as vision-based planning to grasp objects.

In the following sections I describe several of these findings progressing from strictly visual objectrepresentations through multisensory representations up to the richest but most abstract forms of physicalobject representations. At the end of each section, I outline my future research plans.

1. Visual object representations

Vision is often well characterized as the solution to an inverse problem: inverting the process of how three-dimensional (3D) scenes give rise to images to reconstruct the scene that best explains a given observedimage. Yet the neural circuits underlying this inverse mapping is poorly understood. Leading neural modelsof vision concentrate on the task of object identification or categorization, and do not attempt to explainhow vision also constructs rich descriptions of scenes with detailed 3D shape and surface appearance.Alternative inverse graphics or analysis-by-synthesis approaches aim to recover 3D scene content, but aretypically implemented using iterative computations that are hard to map onto neural circuits and muchtoo slow for the speed of natural perception. It remains a mystery how vision constructs such rich scenedescriptions of 3D shape and texture, so quickly – on the order of a hundred milliseconds.

1

Figure 1: (A) Schematic of the recognition model, the generative graphics program, and the way we usethe generative model to train the recognition model. (B) Representational similarity matrices arising fromeach of the neural sites recorded (Freiwald and Tsao, 2010) and from each of the top three layers of therecognition model. (C) The qualitative patterns found in the neural data emerge in our model (“Latents”)and, to a lesser extent, also in an alternative state-of-the-art machine face recognition system (“VGGFace”) trained with millions of images of faces to tell apart thousands of identities. (D) Illustration of theexperiments to test models against behavioral data.

To address this challenge, I developed a computational model that recovers 3D object shape and texturefrom a single image similar to conventional analysis-by-synthesis approaches, but does so quickly using onlyfeedforward computations without the need for iterative algorithms. The model combines a probabilisticgenerative model based on a 3D graphics program for image synthesis with a recognition model basedon a deep convolutional neural network [5, 4]. As a test case, we apply this model in the domain offace recognition where, as a rarity, data from brain imaging, single-cell recordings, and behavior all cometogether to constrain models. However, the scheme of the model is more general than faces and can applyto any object class.

We found that this approach provides the best account of a high-level function in the brain (facerecognition), spanning neural activity and behavior. We found that our model (“Latents”) accounted wellfor the qualitative patterns across populations of neurons in the monkey IT cortex (Figure 1B and C)equally well if not better than an alternative state-of-the-art machine vision system (“VGG Face”; [28])that is trained using millions of images of faces to discriminate thousands of identities. We also foundthat in challenging face recognition tasks (Figure 1D), only our model performed on par with humansubjects. Two alternative discriminative models (VGG Face network and the “Identity” model that issimilar to VGG Face model but is trained with images rendered by the generative model) performedworse than human performance in most case. Moreover, we found that our model –on a trial-by-trialbases– predicted human performance consistently well across all experiments, again unlike the alternativemodels. Our results provide a new way to think about perception as explanation [17], and show howvisual processing can naturally support a broad range of cognitive operations such as causal inference,imagination, communication and action planning in addition to object recognition.

In my future research within the next few years, I plan to further explore rich 3D object representationsin the mind and brain by developing computational models and novel experimental setups as describedbelow.

2

Figure 2: (A) A compositional space of 3D object shapes. Basic parts can be combined in infinitely differentways to obtain new objects. (B) Schematics of the visual forward model and haptic forward model. Thevisual forward model projects a 3D model to images using various viewing angles. A haptic forward modeloutputs grasps of a 3D model (e.g., joint angles of the hand at the time of a stable grasp of the model).(C) Humans’ cross-modal object categorization performance. Left panel shows their performance at theend of the final training block. Right panel shows their cross-model test performance. (D) Our proposedcomputational pipeline for cross-modal face recognition. (E) Example simulated grasps of everyday objects.Our system can generate plausible grasps using a large 3D model bank including more than 50000 shapes.This simulated dataset can spur development of recognition models for the haptic sense.

Future Research 1: Familiar and unfamiliar object recognition. The empirical literature on objectrecognition but in particular on face processing documents a striking distinction between the processing offamiliar and unfamiliar faces [18]. However, the computational foundation of this difference is significantlyunder-explored. In collaboration with Ken Nakayama (Harvard) and Josh Tenenbaum (MIT), we aredeveloping a unified account of the familiar and unfamiliar object recognition, and in particular of facerecognition, both computationally and psychophysically [9, 11]. Our computational framework bringstogether the efficient analysis-by-synthesis idea (Figure 1A) with non-parametric Bayesian methods inorder to implement a flexible and growing memory of familiar items. In parallel to this computationalefforts, we are designing behavioral experiments that will control a subject’s familiarity with a given objectidentity, and will measure their recognition abilities afterwards as a function of their level of familiarity.

2. Multisensory object representations

Despite the different sensory modalities through which we perceive objects, we experience a coherent andunified physical world. This observation indicates an important perceptual invariance –namely, modality-invariance. I developed a computational framework of object recognition that integrates sensory inputsacross modalities into unified representations and produces modality-invariance as a by product [10, 3,2, 6, 8]. This framework has three components. First, it constructs a compositional space of objectsor events such as 3D shape or spatial sequential patterns (e.g., localized visual flashes or brief soundsmoving on a circle). Figure 2A shows image of an example 3D object and how it is composed fromsubparts and primitive parts. Second, it projects such modality-independent objects or events to sensorydomains using sensory-specific forward models. For example, it projects 3D shapes to images using a

3

graphics engine; it projects 3D shapes to touch using a grasping simulator (Figure 2B). Third, in thisframework, perception is implemented as Bayesian inference whereby input sensory signals are mappedonto the unobserved modality-independent representations. In my work, I implemented this framework forvarious computational tasks such as 3D reconstruction from single images, cross-modal (i.e., visual-haptic)recognition of object shape categories [2, 8], and modality-independent reasoning about spatial sequentialpatterns [6].

We behaviorally tested and quantitatively confirmed several of the predictions that these models make:(i), if a person learns to categorize objects based on inputs from one sensory modality (e.g., vision), theperson can categorize these same or similar objects when objects are perceived through another modality(e.g., touch) (Figure 2C) [2]; (ii) if a person learns to categorize spatial sequential patterns based oninputs from one sensory modality (e.g., audition), the person can categorize these same or similar patternswhen they are observed through another modality (e.g., vision) [6]; (iii) a person’s similarity judgments ofpairs of unfamiliar objects match within (e.g., the person sees both objects) and across different sensorymodalities (i.e., the person sees one object and feels the other object). [8].

This work so far explored novel and fairly simple categories of objects and patterns. Within the nextfew years, I plan to extend this framework to settings including more complex and everyday shapes.

Future Research 2: Cross-modal recognition of complex object shapes. I plan to extend mymultisensory object recognition framework to more complex shapes (Figure 2E), and in particular to faces.In collaboration with Christian Wallraven (Korea University), we are designing a cross-modal recognitionexperiment where human participants will see an image of a face before feeling a 3D printed face sculptureand judging whether the two faces belong to the same individual or to two different people. In parallel, Iam extending my efficient analysis-by-synthesis model described in § 1 such that the model will project its3D shape inferences into haptic modality (Figure 2D). This project will be the first to quantitatively modelhuman performance in a cross-modal complex object recognition task. The combination of the computa-tional and behavioral methods in this project will be a stronger test of whether a unified representation ofshape shared between vision and touch suffices to explain human behavior than our and others’ previousworks.

Future Research 3: Computational study of the object representations in the anterior intra-parietal cortex (AIP). What object-level transformations does the brain perform for visually guidinggrasping? Within the next few years, I plan to infer a detailed picture of the object representations inAIP, which is the brain region that is implicated in vision-based grasp planning. Three reasons make AIPof particular interest: (i) AIP visually processes 3D shape [33]; (ii) AIP participates in visual and tactileintegration [34]; and (iii) AIP participates in visually-guided grasping [31]. I established collaboration withWinrich Freiwald (The Rockefeller University) and members of his lab in order to guide the developmentand evaluation of the models using their existing neural data.

I will build computational models inspired by the efficient analysis-by-synthesis framework. Eachmodel will embody a hypothesis about population-level representations in AIP. Two discrete dimensionswill specify all models. First, the ultimate task of a model can be one of the following four: (i) objectidentification (or categorization), (ii) 3D shape recovery, (iii) grasp type identification, (iv) pre-shapingthe hand for grasping. Second, the inputs to a model could be unisensory or multisensory: (i) vision-onlyinputs (i.e., images), (ii) visual-haptic inputs (i.e., both the images and simulated grasps of objects).

3. Physical object representations

Humans demonstrate remarkable abilities to predict physical events in dynamic scenes, and to infer thephysical properties of objects from static images or dynamic videos [20, 24, 26]. As [19] noted, weintuitively know how objects rest on and support each other, what would happen if they collide or fall,and how much force would be required to move them. These observations raise a central question: howdo visual inputs lead to rich physical scene understanding? I developed a computational model of physicalscene understanding that takes as input real-world images or videos and infers physical properties of objectssuch as their mass, density and friction. Based on these inferences, the model makes predictions about the

4

Figure 3: (A) Frames from a representative subset of videos that our model and human subjects saw. (B)Integrated physical scene understanding model takes as input real world videos or images and interpretsthem using a convolutional neural network and a game engine that approximates expected interactionsbetween physical objects. For a future project, we hypothesize that the convolutional neural network whichserves as a vision pipeline represents the computations taking place in the visual cortex, and the simulationengine which serves as an approximation to intuitive physics knowledge represents computations takingplace in the parts of the parietal cortex and the pre-motor cortex. (C) Simulation-based reconstructions ofthree videos by our model. Note that the model has to infer the physical properties of the objects in thescene. (D) A behavioral task that requires visual and physical understanding of the image. (E) Behavioraland computational results from a match-to-sample task where subjects had to match images of unoccludedobjects to fully occluded scenes where an object is draped in cloth.

future of the unfolding physical events in the scene [1, 13]. Frames from example real-world scenes thatour model can process are shown in Figure 3A.

Figure 3B illustrates the schematic of our model. One of the two core elements of our model is acomplex forward model, namely a 3D video game engine [25], operating on an object-based representationof physical properties, including mass, density, spatial position, 3D shape, and friction, and the interactionsof these objects estimated using game engine physics. The other core element of our model is a deep neuralnetwork that maps visual inputs (e.g., images) to their physical object representations such as their shapeand mass. This neural network is pre-trained on a large image corpus. We finetune it using MCMC-basedinference results from the generative model. Note that this training procedure is a form of self-supervisedor semi-supervised learning where the data required for training is obtained using the model itself, thusavoiding any costly data collection step (e.g., images of objects labeled by their weights). After training,the neural network and the generative model are combined into an integrated physical scene understandingmodel: the neural network serves as a recognition model – it is a feedforward vision pipeline that mapsimages or videos to parameters in the generative model such as mass and friction. The generative model

5

predicts feature outcomes for given inputs, generalizes across different scenarios, and can be used forcausal and counterfactual inferences. Figure 3C shows frames from three input videos and frames fromthe corresponding simulation-based inference results. In behavioral experiments, we showed that both themodel and humans can accurately predict the outcomes of object collisions and the relative mass of theobjects.

This work opens up exciting future directions interfacing cognitive science, neuroscience and artificialintelligence.

Future Research 4: Physical object representations and visual object recognition. Does intu-itive physical knowledge guide visual object recognition? Despite its centrality for everyday perception andcognition, the interaction between intuitive physical reasoning and visual object recognition is an under-explored research area. Consider a scenario where an object is draped in a cloth and so completely hidden.Can we recognize what that hidden object might be? In certain settings, this judgment is intuitive to us:for example, in Figure 3D, it is easy to tell which of the two objects might be a chair. In collaboration withJosh Tenenbaum (MIT), I am studying such intuitive judgments behaviorally and computationally [12].Preliminary behavioral and computational results indicate that human participants successfully performthis difficult task, and that a generative model of perception can account for this behavior (Figure 3E).

Future Research 5: Computational study of neural circuits responsible for intuitive physicalinference. Although the neural substrate supporting physical inferences such as estimating the velocityof an object or the weight of an object is studied, how the brain represents interactions of multiple physicalobjects is severely under-explored. Inspired by the work of Fischer et al. [19] and in collaboration withWinrich Freiwald (The Rockefeller University) and Josh Tenenbaum (MIT), we are planning behavioralexperiments with monkeys that will leverage their physical intuitions. For example in one planned ex-periment, monkeys will “catch” a ball amongst multiple balls that are colliding with each other and withwalls present in the scene. Single-cell recordings will be performed during these experiments from severalpre-determined brain regions including parietal and pre-motor areas. Within the next few years, this jointwork gives me the unique opportunity to computationally characterize the neural activity responsible forintuitive physical inferences in the brain.

References

[1] Yildirim, I.*, Wu, J.*, Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving Physical Object Properties by Integrating

a Physics Engine with Deep Learning. In Advances in Neural Information Processing Systems (pp. 127-135).

[2] Yildirim, I., & Jacobs, R. A. (2013). Transfer of object category knowledge across visual and haptic modalities: Experimental and

computational studies. Cognition, 126(2), 135-148.

[3] Yildirim, I., & Jacobs, R. A. (2012). A rational analysis of the acquisition of multisensory representations. Cognitive science, 36(2),305-332.

[4] Yildirim, I., Kulkarni, T. D., Freiwald, W. A., & Tenenbaum, J. B. (2015). Efficient analysis-by-synthesis in vision: A computationalframework, behavioral tests, and comparison with neural representations. In Proceedings of the 37th Conference of the CognitiveScience Society.

[5] Yildirim, I., Freiwald, W. A., & Tenenbaum, J. B. (in submission). A neural circuit for analysis-by-synthesis in the primate faceprocessing system.

[6] Yildirim, I., & Jacobs, R. A. (2015). Learning multisensory representations for auditory-visual transfer of sequence category knowledge:

a probabilistic language of thought approach. Psychonomic bulletin & review, 22(3), 673-686.

[7] Bates, C. J., Yildirim, I., Tenenbaum, J. B., & Battaglia, P. W. (2015). Humans predict liquid dynamics using probabilistic simulation.In Proceedings of the 37th Conference of the Cognitive Science Society.

[8] Erdogan, G., Yildirim, I., & Jacobs, R. A. (2015). From Sensory Signals to Modality-Independent Conceptual Representations: A

Probabilistic Language of Thought Approach. PLoS Comput Biol, 11(11), e1004610.

[9] Allen, K. R., Yildirim, I., & Tenenbaum, J. B. (2015). A model of familiar and unfamiliar 3D face recognition. NIPS workshop onblackbox learning

[10] Yildirim, I. (2014). From Perception to Conception: Learning Multisensory Representations (Doctoral dissertation, UNIVERSITY

OF ROCHESTER).

6

[11] Allen, K. R., Yildirim, I., & Tenenbaum, J. B. Integrating identification and perception: A case study of familiar and unfamiliarface processing. In Proceedings of the 38th Conference of the Cognitive Science Society.

[12] Yildirim, I., Siegel, M. H., & Tenenbaum, J. B. Perceiving Fully Occluded Objects via Physical Simulation. In Proceedings of the38th Conference of the Cognitive Science Society.

[13] Yildirim, I.*, Wu, J.*, Du, Y., & Tenenbaum, J.B. Interpreting Dynamic Scenes by a Physics Engine and Bottom-up Visual Cues.

1st Workshop on Action and Anticipation for Visual Learning, European Conference on Computer Vision (ECCV).

[14] Barron, J. T., & Malik, J. (2012). Shape, albedo, and illumination from a single image of an unknown object. In Computer Vision

and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 334-341). IEEE.

[15] Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015). Picture: An imperative robabilistic programming languagefor scene perception. In Computer Vision and Pattern Recognition.

[16] Barrow, H., & Tenenbaum, J. (1978). Recovering intrinsic scene characteristics. Comput. Vis. Syst., A Hanson & E. Riseman (Eds.),3-26.

[17] Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The helmholtz machine. Neural computation, 7(5), 889-904.

[18] Jenkins, R., White, D., Van Montfort, X., Burton, A.M. (2011). Variability in photos of the same face. Cognition, 3, pp. 313 - 323.

[19] Fischer, J., Mikhael, J. G., Tenenbaum, J. B., & Kanwisher, N. (2016). Functional neuroanatomy of intuitive physical inference.Proceedings of the National Academy of Sciences, 201610344.

[20] Sanborn, A. N., Mansinghka, V. K., & Griffiths, T. L. (2013). Reconciling intuitive physics and Newtonian mechanics for colliding

objects. Psychological review, 120(2), 411.

[21] DiCarlo, J. J., Zoccolan, D., & Rust, N. C. (2012). How does the brain solve visual object recognition?. Neuron, 73(3), 415-434.

[22] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances

in neural information processing systems (pp. 1097-1105).

[23] Brady, T. F., Konkle, T., Alvarez, G. A., & Oliva, A. (2008). Visual long-term memory has a massive storage capacity for object

details. Proceedings of the National Academy of Sciences, 105(38), 14325-14329.

[24] Smith, K. A., & Vul, E. (2013). Sources of uncertainty in intuitive physics. Topics in cognitive science, 5(1), 185-199.

[25] Coumans, E. (2010). Bullet physics engine. Open Source Software: http://bulletphysics. org, 1.

[26] Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings

of the National Academy of Sciences, 110(45), 18327-18332.

[27] Freiwald, W. A., & Tsao, D. Y. (2010). Functional compartmentalization and viewpoint generalization within the macaque face-

processing system. Science, 330(6005), 845-851.

[28] Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. Proceedings of the British Machine Vision, 1(3), 6.

[29] Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis?. Trends in cognitive sciences, 10(7), 301-308.

[30] Su Keun, J. & Yaoda, X. (2016). Behaviorally Relevant Abstract Object Identity Representation in the Human Parietal Cortex.

Journal of Neuroscience, 36(5), 1607-1619.

[31] Schaffelhofer, S., & Scherberger, H. (2016). Object vision to hand action in macaque parietal, premotor, and motor cortices. eLife,5, e15278.

[32] Janssen, P., & Scherberger, H. (2015). Visual guidance in control of grasping. Annual review of neuroscience, 38, 69-86.

[33] Durand, J. B., Nelissen, K., Joly, O., Wardak, C., Todd, J. T., Norman, J. F., ... & Orban, G. A. (2007). Anterior regions of monkey

parietal cortex process visual 3D shape. Neuron, 55(3), 493-505.

[34] Andersen, R. A., Snyder, L. H., Bradley, D. C., & Xing, J. (1997). Multimodal representation of space in the posterior parietal

cortex and its use in planning movements. Annual review of neuroscience, 20(1), 303-330.

7

Documents

Research Statementilkery/papers/research_statement.pdf · 2016-11-15 · face recognition where, as a rarity, data from brain imaging, single-cell recordings, and behavior all come