4
1 NIMBUS: A Hybrid Cloud-Crowd Realtime Architecture for Visual Learning in Interactive Domains Nick DePalma and Cynthia Breazeal Personal Robots Group MIT Media Lab 20 Ames Str. Cambridge, MA 02139 Abstract—Robotic architectures that incorporate cloud-based resources are just now gaining popularity [1]. However, re- searchers have very few investigations into their capabilities to support claims of their feasibility. We propose a novel method to exchange quality for speed of response. Further, we back this assertion with empirical findings from experiments performed with Amazon Mechanical Turk and find that our method im- proves quality in exchange for response time in our cognitive architecture. I. I NTRODUCTION Real, working, robotic systems must incorporate systemic architecture designs. The simple heterogeneity present in au- tonomous electro-mechanical (and biological) systems means that interfaces between differing subsystems are inevitable consequences of assembly. But while cognitive architecture research has, by definition, been inspired by biology, neuro- science, cognitive science, or psychology, we must also realize that robots are connected to resources that humans and animals don’t have access to. Wireless and network connectivity allows many robots lives to be connected to internet resources to support the interaction. Our software framework has been built over the past few decades around the idea of synthetic lifelike agents [2], and was adapted into a more embodied approach that takes into account sensors and motors and incorporates elements of interactive task learning [3] and principles of animation [4]. In our recent past, our architecture has forked into two separate initiatives: one in which emphasizes the ability to take ad- vantage of small, embedded, and ever improving smartphones [5], [6], [7] and an architecture that emphasizes its ability to operate across small and large scale computer systems [8], [9]. This paper will focus on a simple pipeline to unify these movements to support lightweight calls to resources not available to the robot alone. With this innovation, the robot may make decisions to connect with individuals on the internet based on the availability of resources and participants not present for the agent interacting with the world. Our server, Nimbus 1 , operates as an always available re- source for robots in the field and for interaction partners. Nimbus serves to collect and coordinate information from an 1 This resource is always online at http://nimbus.media.mit.edu always-on and resilient data collection interface, crowdsource on-demand when absolutely needed by the robot, and to act as a computational resource for highly intensive processes. This paper documents underlying questions regarding speed of response and security that are highly relevant to interactive domains like human-robot interaction. The trade-off between quality of response from people on the Internet and the speed at which a response can be obtained by an autonomous system are still not well understood. Our research group over the past few years has built a resilient cloud infrastructure to support the robots on the ground and in the field. One of the higher level features of the system is to retrieve information from databases in the cloud and when unavailable, obtain high quality labels from crowds of people on systems such as Amazon Mechanical Turk or Crowd Flower. In this paper, we measure round trip times from the robot to the crowd and back and discuss what one can expect from a system that leverages crowdsourcing as a core cognitive resource. We introduce a novel mechanism to improve quality of the response at the expense of response speed. We conclude with remarks and thoughts about the future of technology built for the purpose of efficient and interactive labeling to support interaction. Related Work Few architectures have integrated real-time crowdsourcing and cloud-based resources into their cognitive architectures and if these architectures have, they have not been integrated into socially intelligent robotics tasks to support these dynamic interactions. Many surveys cover the broad spectrum of re- search that utilizes crowdsourcing in robotic learning tasks that range from perception learning [10] to action learning [11]. For a survey on various projects that use crowdsourcing in machine learning and robotic tasks, see the comprehensive reviews by Kehoe et. al. [12], [13]. For example, the RoboEarth project vision is still the most mature vision of cloud resources integrating backend architectures to support perception, action, and mapping [14]. Rapyuta, the underlying framework, can be very powerful but it’s emphasis on non-interactive tasks makes real-time latency issues secondary for their level of inquiry [15]. Details on many of these architectures are sparse, with arXiv:1602.07641v1 [cs.RO] 24 Feb 2016

NIMBUS: A Hybrid Cloud-Crowd Realtime Architecture · PDF file... A Hybrid Cloud-Crowd Realtime Architecture for Visual Learning in ... crowdsourcing in machine learning and ... used

Embed Size (px)

Citation preview

1

NIMBUS: A Hybrid Cloud-Crowd RealtimeArchitecture for Visual Learning in Interactive

DomainsNick DePalma and Cynthia Breazeal

Personal Robots GroupMIT Media Lab

20 Ames Str.Cambridge, MA 02139

Abstract—Robotic architectures that incorporate cloud-basedresources are just now gaining popularity [1]. However, re-searchers have very few investigations into their capabilities tosupport claims of their feasibility. We propose a novel methodto exchange quality for speed of response. Further, we back thisassertion with empirical findings from experiments performedwith Amazon Mechanical Turk and find that our method im-proves quality in exchange for response time in our cognitivearchitecture.

I. INTRODUCTION

Real, working, robotic systems must incorporate systemicarchitecture designs. The simple heterogeneity present in au-tonomous electro-mechanical (and biological) systems meansthat interfaces between differing subsystems are inevitableconsequences of assembly. But while cognitive architectureresearch has, by definition, been inspired by biology, neuro-science, cognitive science, or psychology, we must also realizethat robots are connected to resources that humans and animalsdon’t have access to. Wireless and network connectivity allowsmany robots lives to be connected to internet resources tosupport the interaction.

Our software framework has been built over the past fewdecades around the idea of synthetic lifelike agents [2], andwas adapted into a more embodied approach that takes intoaccount sensors and motors and incorporates elements ofinteractive task learning [3] and principles of animation [4]. Inour recent past, our architecture has forked into two separateinitiatives: one in which emphasizes the ability to take ad-vantage of small, embedded, and ever improving smartphones[5], [6], [7] and an architecture that emphasizes its abilityto operate across small and large scale computer systems[8], [9]. This paper will focus on a simple pipeline to unifythese movements to support lightweight calls to resources notavailable to the robot alone. With this innovation, the robotmay make decisions to connect with individuals on the internetbased on the availability of resources and participants notpresent for the agent interacting with the world.

Our server, Nimbus1, operates as an always available re-source for robots in the field and for interaction partners.Nimbus serves to collect and coordinate information from an

1This resource is always online at http://nimbus.media.mit.edu

always-on and resilient data collection interface, crowdsourceon-demand when absolutely needed by the robot, and to actas a computational resource for highly intensive processes.This paper documents underlying questions regarding speedof response and security that are highly relevant to interactivedomains like human-robot interaction.

The trade-off between quality of response from people onthe Internet and the speed at which a response can be obtainedby an autonomous system are still not well understood. Ourresearch group over the past few years has built a resilientcloud infrastructure to support the robots on the ground andin the field. One of the higher level features of the systemis to retrieve information from databases in the cloud andwhen unavailable, obtain high quality labels from crowds ofpeople on systems such as Amazon Mechanical Turk or CrowdFlower. In this paper, we measure round trip times from therobot to the crowd and back and discuss what one can expectfrom a system that leverages crowdsourcing as a core cognitiveresource. We introduce a novel mechanism to improve qualityof the response at the expense of response speed. We concludewith remarks and thoughts about the future of technology builtfor the purpose of efficient and interactive labeling to supportinteraction.

Related Work

Few architectures have integrated real-time crowdsourcingand cloud-based resources into their cognitive architecturesand if these architectures have, they have not been integratedinto socially intelligent robotics tasks to support these dynamicinteractions. Many surveys cover the broad spectrum of re-search that utilizes crowdsourcing in robotic learning tasks thatrange from perception learning [10] to action learning [11]. Fora survey on various projects that use crowdsourcing in machinelearning and robotic tasks, see the comprehensive reviews byKehoe et. al. [12], [13]. For example, the RoboEarth projectvision is still the most mature vision of cloud resourcesintegrating backend architectures to support perception, action,and mapping [14]. Rapyuta, the underlying framework, can bevery powerful but it’s emphasis on non-interactive tasks makesreal-time latency issues secondary for their level of inquiry[15]. Details on many of these architectures are sparse, with

arX

iv:1

602.

0764

1v1

[cs

.RO

] 2

4 Fe

b 20

16

2

NIMBUS(Server Farm)

Perception-ActionLoop

Deictic Gesture Triggers Change in Perception

Deictic Action Triggers Behavioral Saccades

Interactive RobotFront-End

Large Scale Data and Response Management

Back-End

Agent-Cloud

Interactions

Resource or Label Request

Cloud or Crowd Response

Crowd Management

Label Request

Crowd Response

Round Trip Loop 2Round Trip Loop 1

Figure 1. Cognitive Architecture for Visual Learning Problem

Rapyuta [15] and the work by Osentoski et. al. utilizing closed-loop control [16] with a single online participant providing themost concrete details. Other architectures treat crowdsourcingas a secondary or tertiary characteristic of the research process,performed after data collection to support annotation.

The research previously mentioned (except perhaps [15])does not utilize the cloud as a resource it can autonomouslydecide to use when other resources are unavailable. Rapyutamanages to schedule available resources by the cloud and thereal-time system using specialized scheduling algorithms inwhich large tasks can be scheduled to operate on cloud basedhardware or on local embedded systems inside of the robotdepending on the needs and demands of the entire architecture.Our system, instead, focuses on the labeling task as a sourceof information that can be made available on-demand througheither a social interaction between the interaction partner andthe robot or as an interaction it can request online.

In this work we use this crowdsourcing system to supportinteractive perceptual learning between the robot and its en-vironment. To support this level of inquiry, we also reviewreal-time perceptual learning systems. Raptor [17] is a recent(and closely related) successful dynamic training process thatutilizes the power of ImageNet[18], a large database of objectcategories, to discover early image priors that it can use asin-situ training instances. ImageNet is used entirely for figure-ground patch discovery. These image patches can then be usedto train fast whitened HOG-based object detectors dynamicallythroughout the interaction. This process takes under 30 sec-onds for new image patch to reify into an object detector. Ourperception-action loop extends this type of system and allowsthe robot to take deictic actions toward exchanging figure-ground hypothesis enumeration with a social partner to supportobject discovery. This paper serves to support this perception-action loop (documented more thoroughly in [19], [20], [21])and focuses primarily on incorporating crowd based labels intothis cognitive architecture.

II. ON-THE-FLY VISUAL LABELING FROM LOCAL ANDREMOTE SOURCES

Our system seeks to realize a simple abstraction: thatperception-action loops help define our visual labeling task

interactively and that backend loops completed between thereal-time embedded system and our servers are present tosupport the needs of the interactive system (see Figure 1). Wehave chosen a visual labeling task as this is an area where earlystimuli need to be reified into abstract representations that canbe reused within the interaction. Interactive visual labelingin which a system is able to take early visual stimuli anddynamically create stable object detectors (or more abstractlydeictic pointers[22], [23]) in real time is still a large openproblem space that is rich with potential impact.

To demonstrate the power of this system, we have built asystem to learn labels both through an interaction and throughour server architecture that manages an Amazon MechanicalTurk account. Figure 1 shows the cognitive architecture thatsurrounds our system. At the interaction level, the systemutilizes gaze and deictic gesture to visually guide the robottowards relevant areas in the scene. It uses finite horizonreasoning and simple low-level cues to visually extract selectimage patch foregrounds from the scene. This interactive guid-ing process allows the robot to be directed towards stimuli itwouldn’t normally be directed towards if it operated in a purelydiscretized object-oriented manner and to additionally usethose early stimuli as training instances for socially relevantdeictic and lexical referencing. In the event that the user walksaway, the robot still has the ability to saccade to new stimuliand collect labels from Amazon Mechanical Turk for a smallfee.

The following two sections introduce a natural tension in thecrowdsourcing architecture between the trade off of speed vsquality. We introduce two methods to document the trade offthat we can expect out of systems of this type. One method iswhat we call one-shot labeling. In one-shot labeling the systemattempts to collect the label as fast as possible. The othermethod, called Rollover Labeling, attempts to reliably get alabel in the presence of noisy, error prone, crowd responsesby rolling over the result from one crowdsourcing participantto another in an attempt to correct and clarify previous labels.

III. TIME-TO-RESPONSE MEASUREMENTS

What’s the shortest latency that one could expect if therobot needs the label as quickly as possible? And what do

3

Caption: Can you find the object in my hand?

Snap SceneUpload

Scene

Labeling Strategies

One ShotLabeling

LabelingWith Rollover

Verify

Mechanical TurkWrapper

Return Label

Return Label

Decision to Label

Figure 2. Data Collection Procedure

Figure 3. Response times of two backend data collection strategies inseconds.

we lose if we want the highest quality response? In our case,our experience is that to use crowdsourcing as an on-demandresource, the robot must:

1) take a snapshot from its sensors2) upload the data and wrap it in a visually legible and

descriptive manner to a server where it can be reliablyserved to a participant online

3) find a participant to label it for a price2

4) wait for the participant to accept the job and read theinstructions

5) await the response

We used our architecture to snap a picture to label from awebcam on a simple robot (see Figure 2). The image andrequisite interface details were uploaded to Nimbus and aprocess began to synchronously wait for a response. Totalround-trip times are reported in Figure 3. Our process usedAmazon Mechanical Turk at a competitive rate of $0.25 for a45 second labeling task. We ran the experiment 20 times foreach strategy.

2This seems to be the slowest process.

Figure 3 shows that one-shot labeling can be a relativelyquick process. Response times for one-shot labeling fell in therange of [12.9s, 520s] with a median of 37.5s while a rolloverprocess fell in the range of [95s, 1217.8s] with a median of263.58 seconds, or approximately 4.3 minutes. It is clear thatone-shot labeling has the potential to be a very fast process,optimistically only taking about 13 seconds!

IV. QUALITY-SPEED TRADEOFF

On the other hand, what kind of data quality can you expectout of quick 30 second one-shot label acquisition? In previousexperiments[8], we found that the quality of online data canvary widely. Using the same results that we acquired fromour 20 participant online experiment, we calculated the meanbounding box and use it as a baseline answer. We calculatedthe mean squared error (MSE) with respect to the baselineand found that rollover significantly reduced the overall error.Additionally, we saw that the responses from rollover pro-vided a reduction in response variance. This suggests that theparticipants’ extra time spent on the task provided a higherfidelity bounding box around the training instance and thusa higher quality result. Table I shows that while the answerscan begin with many errors, rollover helps correct them andthat the mean squared error of the final answer is significantlybetter than the one-shot labeling.

One-Shot Labeling With Rollover (Final)MSE from baseline 518.6px 341.5pxStandard Deviation 22.73px 18.4px

Table IQUALITY-SPEED TRADEOFF BETWEEN ONE-SHOT LABELING AND

LABELING WITH ROLLOVER

V. DISCUSSION & FUTURE WORK

The conclusions of this experiment show that to unify acognitive architecture that incorporates interactive perception-action loops and interactions between the robot and crowds of

4

people on the Internet, there is a clear need to reason about thespeed at which you expect a response (if the robot is presentlyinteracting with a scene) and the expectation of what quality ofa response you can expect. Even in the best circumstance, one-shot labeling can take at least 12.74 seconds and take upwardsof 4 minutes depending on the crowd response latency. Timingand scheduling when and to which interaction partners thequery must be made is an interesting dynamic challenge forthis always-on hybrid architecture.

In performing this experiment, other strategies have madethemselves salient. For example, we found that while some ofthe bounding boxes and labels that were provided are rife witherror, on average, Amazon Mechanical Turk workers providedgood results. Our assumption was that one-shot labeling willnot on average provide good results and that rollover wouldbe needed, but instead, we found that with many answers, aresearcher can statistically remove outliers based on the resultsof many parallel queries. This opens up the possibility to analternate strategy from strict rollover and to parallel one-shotqueries that may allow the agent to rely on the wisdom ofmany to remove the untrustworthiness of smaller individuals,and then to simultaneously model these workers and weightthem appropriately. This parallel process may simply capitalizeon one-shot labeling speed and still keep high quality results.

As we build social, embodied agents, we stake out a positionthat emphasizes that many of the “usable representations”must have social utility. These robots will inevitably be leftto their own devices at some point in their life cycle andwhile we can rely on the idea that unsupervised learningcan save us, this work emphasizes the idea that robots arenever alone when they are connected to the Internet. It is asimple trade-off to begin to have interactions with participantson-line. Additionally, we believe that scaling our researchfrom individual participants to groups of people to crowds ofindividuals allows us to begin to investigate social reasoningat a larger scale, opening up the idea of relationships betweenparticular robotic agents in the field and particular individualsonline, and opening up the possibility that an autonomousagent can begin to model the characteristics of individualsand reason about their reliability at a larger scale than everthought possible with simple human-robot interaction alone.

ACKNOWLEDGEMENTS

This work was partially funded by the MIT Media LabConsortia, and by NSF Grant CCF-1138986. We want to alsothank Hae Won Park for her many comments on earlier draftsof this paper, Jeff Orkin for his creativity and Rob Millerfor his leadership in pushing a crowdsourcing agenda thathas been inspiring to everyone involved with the CambridgeCrowdfoo Initiative.

REFERENCES

[1] S. Thielman, “Man behind darpa’s robotics challenge: robots will soonlearn from each other,” 2015. The Guardian. Accessed January 2016.

[2] R. B. D. I. M. Downie and Y. I. B. Blumberg, “Creature smarts: Theart and architecture of a virtual brain,” 2001.

[3] A. L. Thomaz, Socially guided machine learning. PhD thesis, Mas-sachusetts Institute of Technology, 2006.

[4] C. Breazeal, A. Brooks, J. Gray, M. Hancher, J. McBean, D. Stiehl, andJ. Strickon, “Interactive robot theatre,” Communications of the ACM,vol. 46, no. 7, pp. 76–85, 2003.

[5] K. Williams and C. Breazeal, “Reducing driver task load and promotingsociability through an affective intelligent driving agent (aida),” inHuman-Computer Interaction–INTERACT 2013, pp. 619–626, Springer,2013.

[6] K. J. Williams, J. C. Peters, and C. L. Breazeal, “Towards leveragingthe driver’s mobile device for an intelligent, sociable in-car roboticassistant,” in Intelligent Vehicles Symposium (IV), 2013 IEEE, pp. 369–376, IEEE, 2013.

[7] G. Gordon, C. Breazeal, and S. Engel, “Can children catch curiosityfrom a social robot?,” in Proceedings of the Tenth Annual ACM/IEEE In-ternational Conference on Human-Robot Interaction, pp. 91–98, ACM,2015.

[8] N. DePalma, S. Chernova, and C. Breazeal, “Leveraging online virtualagents to crowdsource human-robot interaction,” in Proceedings of CHIWorkshop on Crowdsourcing and Human Computation, 2011.

[9] C. Breazeal, N. DePalma, J. Orkin, S. Chernova, and M. Jung, “Crowd-sourcing human-robot interaction: New methods and system evaluationin a public en-vironment,” Journal of Human-Robot Interaction, vol. 2,no. 1, pp. 82–111, 2013.

[10] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme:a database and web-based tool for image annotation,” Internationaljournal of computer vision, vol. 77, no. 1-3, pp. 157–173, 2008.

[11] A. Sorokin, D. Berenson, S. S. Srinivasa, and M. Hebert, “People helpingrobots helping people: Crowdsourcing for grasping novel objects,” inIntelligent Robots and Systems (IROS), 2010 IEEE/RSJ InternationalConference on, pp. 2117–2122, IEEE, 2010.

[12] K. Goldberg and B. Kehoe, “Cloud robotics and automation: A surveyof related work,” EECS Department, University of California, Berkeley,Tech. Rep. UCB/EECS-2013-5, 2013.

[13] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research oncloud robotics and automation,” Automation Science and Engineering,IEEE Transactions on, vol. 12, no. 2, pp. 398–409, 2015.

[14] “Roboearth | a world wide web for robots.” http://www.roboearth.org.Accessed January 2016.

[15] D. Hunziker, M. Gajamohan, M. Waibel, and R. D’Andrea, “Rapyuta:The roboearth cloud engine,” in Robotics and Automation (ICRA), 2013IEEE International Conference on, pp. 438–444, IEEE, 2013.

[16] S. Osentoski, C. Crick, G. Jay, and O. C. Jenkins, “Crowdsourcing forclosed loop control,” in Proc. of the NIPS Workshop on ComputationalSocial Science and the Wisdom of Crowds, NIPS, 2010.

[17] D. Goehring, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell,“Interactive adaptation of real-time object detectors,” in Robotics andAutomation (ICRA), 2014 IEEE International Conference on, pp. 1282–1289, IEEE, 2014.

[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.

[19] N. Depalma and C. Breazeal, “Object discovery vs. selection in socialaction: Benefits of a competitive attention system,” International Con-ference on Intelligent Robots and Systems (IROS), 2015.

[20] N. Depalma and C. Breazeal, “Sensorimotor account of attention shar-ing in hri: Survey and metric,” 2nd Annual Symposium on ArtificialIntelligence and Human-Robot Interaction, 2015.

[21] N. DePalma and C. Breazeal, “Towards bootstrapping socially signifi-cant representations through robotic interaction alone: the joint guidedsearch task,” in Fifth International Symposium on New Frontiers inHuman-Robot Interaction at the Artificial Intelligence and Simulationof Behavior Convention (AISB’16), 2016.

[22] A. G. Brooks and C. Breazeal, “Working with robots and objects:Revisiting deictic reference for achieving spatial common ground,” inProceedings of the 1st ACM SIGCHI/SIGART Conference on Human-robot Interaction, pp. 297–304, ACM, 2006.

[23] D. H. Ballard, M. M. Hayhoe, P. K. Pook, and R. P. Rao, “Deicticcodes for the embodiment of cognition,” Behavioral and Brain Sciences,vol. 20, no. 04, pp. 723–742, 1997.