Voice Control of Fetch Robot Using Amazon Alexa · Siri, Amazon’s Alexa, and Google Assistant, do not currently have any physical functions. As an important part of the internet

Voice Control of Fetch Robot Using Amazon Alexa

Purong Liu

Thesis submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Science

in

Mechanical Engineering

Alexander Leonessa, Chair

Alan Asbeck

Kaveh Akbari Hamed

Feb 21, 2020

Blacksburg, Virginia

Keywords: Robotics, Voice Control, Alexa, Internet of Things

Copyright 2020, Purong Liu


Purong Liu

ABSTRACT

With the rapid development of computers and technology, virtual assistants (VA) are be-

coming more and more common and intelligent. However, virtual assistants, such as Apple’s

Siri, Amazon’s Alexa, and Google Assistant, do not currently have any physical functions.

As an important part of the internet of things (IoT), the field of robotics has become a

new trend in the usage of VA. In this project, a mobile robot, Fetch, is connected with

the Amazon Echo Dot through the Amazon web service (AWS) and a local robot operation

system (ROS) bridge server. We demonstrated that the robot could be controlled by voice

commands through an Amazon Alexa. Given certain commands, Fetch was able to move in

a desired direction as well as track and follow a target object. The follow model was also

learned by Neural Network training, which allows for the target position to be predicted in

future maps.


Purong Liu

GENERAL AUDIENCE ABSTRACT

Nowadays, virtual personalized assistants (VPAs) exist everywhere around us. For example,

Siri or android VPAs exist on every smartphone. More and more people are getting household

Virtual Assistants, such as Amazon Alexa, Google Assistant, and Microsoft’s Cortana. If

the virtual assistants can connect with objects which have physical functions like an actual

robot, they will be able to provide better services and more functions for humans. In this

project, a mobile robot, Fetch, is connected with the Echo dot from Amazon. This connection

allows us to control the robot by voice command. You can ask the robot to move in a given

direction or track and follow a certain object. In order to let the robot learn how to predict

the position of the target when the target is lost, a map is built as a influence factor. Since

a designed algorithm of target position prediction is difficult to implement, we opted to use

a machine learning method instead. Therefore, a machine learning algorithm was tested on

the following model.

Acknowledgments

First of all, I would like to express my sincere gratitude to my advisor Prof. Alexander

Leonessa for continuously support my research project and Master study with his immense

knowledge and patience. His guidance helped me in all the time of research and writing of

this thesis. Other than my advisor, my sincere thanks also goes to my thesis committees,

Prof. Alan Asbeck and Prof. Kaveh Akbari Hamed for their insightful comments and

encouragement and being supported. I would also like to thank my lab mate Alex Fuge for

helping with the new camera case design, and Dr. Garret Burks for his help with editing

this thesis.

iv

Contents

List of Figures viii

List of Tables x

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Review of Literature 7

2.1 Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Virtual Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 IoT Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Mapping and Neural Network Training . . . . . . . . . . . . . . . . . . . . . 14

2.5 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Materials and Methods 18

3.1 Fetch Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

v

3.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.3 Track and follow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.4 Improve Follow Task Performance . . . . . . . . . . . . . . . . . . . . 28

3.1.5 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.6 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Amazon Echo Dot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Skill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.2 Cloud-based Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Marvin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.1 Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.2 Devices, ROS and Alexa . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Results 43

4.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Robot Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Train Model Evaluation And Mapping . . . . . . . . . . . . . . . . . . . . . 52

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vi

5 Discussion and Future Work 58

5.1 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 Software Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.2 Continued Navigation Development . . . . . . . . . . . . . . . . . . . 59

5.2.3 Other Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusions 61

Bibliography 63

Appendices 72

Appendix A Alexa 73

A.1 Turtle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Appendix B Neural Network 78

vii

List of Figures

1.1 AR tags provided in ROS ar_track_alvar package . . . . . . . . . . . . . . . 3

1.2 Pokemon Go with AR technique [26] . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Fetch Robot[4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 PrimeSense Carmine 1.09 and Intel SR 300 . . . . . . . . . . . . . . . . . . . 20

3.3 TF tree for camera and AR marker . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Kinetic Model of Fetch’s Base and Reference Frame . . . . . . . . . . . . . . 24

3.5 ar_track node connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.6 ar_track node to ar_follower node . . . . . . . . . . . . . . . . . . . . . . . 26

3.7 Velocity Computation Algorithm Flowchart . . . . . . . . . . . . . . . . . . 28

3.8 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.9 Interaction Model Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.10 Cloud watch summary from AWS Lambda . . . . . . . . . . . . . . . . . . . 37

3.11 Device Communication for Alexa Controlled Robot . . . . . . . . . . . . . . 39

3.12 Image message conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.13 JSON message conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 AR Tag Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

viii

4.2 Compared camera performance . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Turtle trajectory for AR tag following . . . . . . . . . . . . . . . . . . . . . 47

4.4 Output comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.5 Example JSON input and output . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Training and validation loss for neural network training process . . . . . . . 53

4.7 Training model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.8 Map from two SLAM method . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.9 Map built by Cartographer with trajectory . . . . . . . . . . . . . . . . . . . 56

4.10 Trajectory for one loop and two loops . . . . . . . . . . . . . . . . . . . . . . 57

A.1 Turtle simulation for Alexa via AWS Lambda . . . . . . . . . . . . . . . . . 73

A.2 Turtle simulation for Alexa control . . . . . . . . . . . . . . . . . . . . . . . 74

A.3 Complete JSON input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.4 Complete JSON output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.5 Turtle simulation for AR tag following . . . . . . . . . . . . . . . . . . . . . 76

A.6 Detail device log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B.1 Training Algorithm [47] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

ix

List of Tables

3.1 Camera comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Camera Parameters Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Neural Network Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Cartographer 2D SLAM configuration . . . . . . . . . . . . . . . . . . . . . 32

3.5 Interaction Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Data loss ratio under different situation . . . . . . . . . . . . . . . . . . . . 44

4.2 Turtle Simulation Response with AWS Lambda . . . . . . . . . . . . . . . . 48

4.3 Turtle Simulation Response with BST proxy . . . . . . . . . . . . . . . . . . 49

4.4 Process time for Alexa intent requests . . . . . . . . . . . . . . . . . . . . . 51

x

List of Abbreviations

ADA Americans with Disabilities Act

AMR Autonomous Mobile Robot

AR Augmented Reality

AWS Amazon Web Service

FoV Field of View

IDL Interface Definition Language

IMU Inertial Measurement Unit

IoT Internet of Things

NN Neural Network

RFID Radio-Frequency Identification

ROS Robot Operation System

SLAM Simultaneous Localization And Mapping

SSH Secure Shell

TF Transfer Frame

URDF Unified Robot Description Format

VA Virtual Assistant

WSN Wireless Sensor Network

xi

Chapter 1

Introduction

In this chapter, the background and the motivation of this project are introduced in section

1.1, while the objective of the project is shown in section 1.2.

1.1 Background and Motivation

As one of the most important inventions today, the internet is all over our lives. It is used in

our daily life for work, school, as well as in our social lives. Along with the widespread use of

the internet, a new trend of internet applications, the Internet of Things, is also developing.

The concept of the IoT was first brought up in 1991 by Mark Weise [64]. The term “Internet

of Things” was used for the first time in 2009 by Kelvin Ashton [40]. Essentially, the IoT

can be considered as a network consisting of devices, machines and objects that are all able

to transfer data to each other without required human intervention [50].

With the development of smartphones and smart speakers, these applications are becoming

more and more common. For example, the camera in front of your house can be monitored

by your smartphone. The bedroom light can be remotely turned on and off using a virtual

household assistant. As one of the IoT applications, the field of IoT robotics attracts our

attention.

According to research from the University of Michigan School of Information, one of the

1

2 Chapter 1. Introduction

three main uses of current household VA involves the IoT voice control commands [64].

However, currently the IoT applications have numerous limitations related to their func-

tionality, specifically related to power and mobility. Current IoT controls through a VA

are limited to stationery appliances such as lights, air conditioners, and switches. While

these functions are convenient and ease our lives, the current IoT applications have minimal

physical functions and no physical interactions with humans themselves. It will be much

more beneficial for elderly individuals as well as people with disabilities if they can control a

mobile robot with only their voice, and if they can enable a robot to assist when necessary.

For example, an elderly person with a back injury may ask a mobile robot to follow and pick

up a dropped item using only voice commands.

Considering the increasing number of older adults, the geriatric and general medical service

needs are also expected to significantly increase in the coming years. As a result, healthcare

robots are beginning to draw more and more scientists’ attention [49]. In [49], the authors

summarized some of the current research on healthcare robots with applications for older

adults and the problems that can arise for elderly individuals who desire to live independently.

There are three types of healthcare robots that are used in the home which include: assistance

robots, companion robots, and monitoring robots. The first type, assistant-type robots, can

help the elderly individuals dealing with problems such as housework as well as difficulties

with mobility and bathing. In addition, companion robots focus on the psychosocial needs for

older adults. In [11], 41 publications involving four different robotic systems were reviewed:

(1) NeCoRo, a cat-like robot; (2) Bandit, a socially interactive robot with humanoid torso

mounted on a mobile platform; (3) AIBO, a dog-like robot; (4) Paro, a seal-like robot.

NeCoRo, AIBO and Paro are used to imitate the interaction between humans and animals.

The authors of [11] concluded that the use of robots in elderly care seems to have potential to

improve quality of life. Lastly, monitoring robots are used to monitor the health conditions

1.2. Objective 3

of users.

With the same concerns regarding the aging populations ability to remain independent, the

authors in [66] developed a daily life support robot, ApriAttenda. This robot can follow a

specified person and avoid obstacles. As a result, it is designed not only for elderly care,

but also to be used in baby-sitting settings and as a shopping assistant in a shopping center.

The control algorithm used to follow people is based on proportional controller. When the

person is too far, the robot will move forward. And when the person is too close, the robot

moves backward.

1.2 Objective

Similar to the purpose of ApriAttenda, this project focuses on robot assistants for at home

use. Our goals are to enable the robot to follow a person, to simplify the robot control

for users with voice control, and to build an intelligent navigation system for better people

following performance. As a fundamental approach of people following, we enabled the robot

to follow an AR tag as shown in Figure 1.1. AR techniques have been used in phone games

like Pokemon Go to simulate the appearance of virtual objects in real world. Figure 1.2

displays the game interface of Pokemon Go.

Figure 1.1: AR tags provided in ROS ar_track_alvar package[42]

In addition to implementing the following algorithm, the IoT robotic control is enabled with


Figure 1.2: Pokemon Go with AR technique [26]

an artificial intelligence VA. The IoT control avoids learning complicated control of robot.

With VA, users can control the robot via voice command. To achieve the goal of this project,

the Amazon Echo Dot and the Fetch robot were selected.

We chose the Amazon Echo Dot because Alexa is one of the most popular standalone virtual

assistants. Amazon held a 61.1% share of the smart speaker market in March 2019 [33],

which is almost three times the second largest market share holder Google. Additionally,

Amazon provides a lot of convenient functions on Alexa, such as shopping, searching, and

controlling smart electronics. In addition to the provided functions, Amazon also allows

developers to customize their desired functions through the Amazon Developer Console. On

the console platform, users are able to design Alexa skills.

In addition, the robot Fetch is selected because it is designed to be able to traverse ADA-

compliant buildings [65]. Fetch is an AMR from Fetch Robotics Inc, and is equipped with a

head-on RGB-D camera, one 7 degrees of freedom manipulator, a 2D laser scanner LiDAR,

1.2. Objective 5

and an adjustable torso body. It also contains an internal computer in the base. The

computer runs a Linux system, Ubuntu 14.04 LTS, and ROS Indigo. It can be remotely

controlled as long as it is connected to the wireless internet. To do the remote control, a

base computer that is connected to the same network will be needed.

ROS is a package-based middleware for controlling robots. On March 2, 2010, Willow Garage

released the first distribution of ROS 1.0 [3]. Indigo is the eighth distribution of ROS 1.0.

It provides plenty of useful packages and convenient tools for users. ROS supports four

programming languages: C++, Python, Octave, and LISP [45]. In order to implement the

cross-language development, ROS describes messages as a language-independent IDL. These

strictly typed messages are passed through ROS topics. By subscribing to a ROS topic,

a ROS node, a computation process for ROS, can retrieve messages from topics. And the

messages sent by a node will be done by publishing the message to a given topic.

Because of the strictly typed data structure of the ROS messages, we needed to transfer the

output messages from Alexa to ROS messages. One more challenge for this project was to

enable the robot to detect and follow an AR-tag automatically. To provide a better following

performance, we let the robot learn the hard-coded follow algorithm, which can be further

improved to imitate human following behavior, using neural network learning. Furthermore,

we would like to integrate an intelligent navigation system with the robot so that the robot

assistant can provide more services. Therefore, a reliable map for the precise localization will

be needed. The map information can also help the robot with learning human navigation

behavior. In summary, the objectives of the project were:

1. Enable the robot to detect and follow an AR-tag.

2. Ensure that Alexa correctly interprets voice commands and sends the appropriate

requests to the robot.


3. Confirm that the robot can understand the request sent by Alexa and perform the

tasks.

4. Train the robot to learn current target follow navigation behavior using a Neural

Network

5. Build a reliable map.

1.3 Outline

To achieve the objectives mentioned above, we first designed and validated the experiment.

In Chapter 2, the related previously reported literature is reviewed. Next, the materials

and methods are discussed in Chapter 3. The following sections in Chapter 4 illustrates the

results of the project and in the second to last chapter, Chapter 5, we discuss the limitations

and future work. The final chapter, Chapter 6, summarizes this thesis and gives a conclusion.

Chapter 2

Review of Literature

In this chapter, the history and development of the IoT will be discussed. In particular,

we will emphasize applications of the IoT with virtual assistants. Further, we discuss the

possible future applications and challenges of the IoT with VA. Additionally, we focus on

robotic applications of the IoT while also presenting previous works in the field of IoT

robotics with VA.

2.1 Internet of Things

The history of “Internet of Things” began in the early 1990s, when the concept of IoT was

first brought up by Mark Weiser [64]. However, according to [40], the term “IoT” was likely

to be used for the first time by Kelvin Ashton in 1999. In the article Gubbi et al. [25], the

authors listed three elements of the IoT [25]. The first one is hardware, which usually refers

to actuators, sensors, and controllers. The second element is middleware, which consist of

storage data analytics tools for computing. The last element is the presentation, which

should be a lucid visualization tool that can be designed for various applications. The

implementation of the IoT middleware consists of five key technologies: (1) RFID, (2) WSN,

(3) addressing, and (4) data storage and (5) analytics [25, 58].

(1) RFID systems consist of two parts, readers and tags. Basically, RFID allows for the

automatic identification of objects, which are assigned to a unique tag. One of the common

7

8 Chapter 2. Review of Literature

applications of RFID technology can be found in bar-code readers in grocery stores. (2)

A WSN is composed of a large number of sensing nodes. These nodes can be used to

monitor environmental conditions, such as temperature, or implantable medical devices,

such as implantable cardiac defibrillators (ICDs). (3) Addressing schemes are essential for

the IoT since they allow devices to be identified by their unique addresses. Based on wireless

technologies, such as Wi-Fi and RFID, the ability to identify things was developed. IPv4 can

address a geographically unique identification. However, the individual identification issue

is expected to depend on the development of IPv6. (4,5) Data storage issues are a result of

the unprecedented amount of data created from novel emerging fields using the internet. To

solve these storage issues, cloud-based storage arise. Along with the development of cloud

data storage, the field of cloud-based data analysis is also growing. Based on the progress and

development of computing services, the cloud-based data analysis and storage are foreseen

to be a new trend [25, 58].

With the three elements (hardware, middleware, and presentation), the IoT applications

can be classified into four groups based on different occasions: (1) Personal and home, (2)

Enterprise, (3) Utilities, and (4) Mobile [25]. (1) The personal and home application focuses

on household and personal electronics. For example, the monitor at a front door can be

connected to the VA with a screen. (2) Enterprise usage of the IoT concentrates on the work

environment electronics and utility management. Many these applications overlap with the

usage in personal and home applications and utilities. For example, a sample usage would

be to use a smartphone to control the coffeemaker in the conference room make coffee rather

than in the home. (3,4) Utilities and mobile IoT applications can be used to contribute to

smart cities. The usage of the IoT in these two groups two can be used to manage energy,

water, transportation, and logistics. For example, using smartphone to track the location of

buses in real-time.

2.1. Internet of Things 9

In a more recent article Stankovic [58], Stankovic [58] highlighted eight current research topics

in IoT, which include “massive scaling, architecture and dependencies, creating knowledge

and big data, robustness, openness, security, privacy, and human-in-the-loop” [58]. Massive

scaling includes the storage of massive created data and insufficient IPv address. Architecture

and dependencies relate to how “things” in the IoT be connected and controlled. Creating

knowledge and big data focuses on the use of generated data from the IoT. For example,

can we utilize the raw data in a way that provides some useful knowledge? The topic of

robustness considers the how devices deal with deteriorating conditions. A common example

of the deterioration problem, which is highlighted in more detail in [58], can be found in the

clock synchronization problem. The topic of openness focus on the accessibility of devices

and their control systems. It is important to note that security and privacy problems arise

with openness. If a device is easily accessed, it is also possible that others can control the

device and steal information from it. Lastly, the human-in-the-loop topic concentrates on

problems when humans are involved in the control loop. For example, in an automobile

system a human can dictate the vehicle speed by pushing more or less on a pedal. In

addition to discussing the highlighted topics described above, John A. Stankovic also detailed

the potential challenges related to the human-in-the loop topic. Four subcategories of the

human-in-the-loop applications are classified in [58] and relate to: systems directly controlled

by a human; systems which monitor humans and take proper actions; systems that model

human’s physiological parameters; and the combination of the previous three applications.

There are three main concerns mentioned [58]: (1) The necessity to understand all type

of human-in-the-loop control, (2) System identification or other techniques that need to be

extended so that the models of human behaviors can be learned, (3) The need to determine

a method which brings the human behavior models into the feedback control.

The focus of this thesis is on the voice control function of the AMR, which can be cataloged


under the last research topic mentioned above, human-in-the-loop control. This research

focuses on one of the cases of the supervisory control applications where the system receives

commands and takes action autonomously, and also sends feedback and waits for the next

command [58].

2.2 Virtual Assistant

Based on the article [25] in 2013, the estimated number of interconnected devices in 2012 is

9 million. Further, that number is expected to increase up to 24 billion by 2020. Accord-

ing to [12], the latest research from Strategy Analytics shows that the number reached 22

billion in 2018 and is expected to exponentially increase to 40 billion by 2025. Among the

interconnected devices, smart home devices are expected to be one of the fastest growing

[7]. The Smart Speaker Consumer Adoption report in March 2019 stated that by the end

of 2018, there were 66.4 million U.S. adults that had a smart speaker [33]. In addition, 85%

of smart speaker users chose to use either an Amazon Echo or Google device. In the same

report, the market share for these two device makers was also presented. By January 2019,

Amazon had a 61.1% market share while Google had 23.9% of the smart speaker market.

As the market of smart speakers continues to expand, researchers have started to wonder

how people use VA. In [52] from June 2018, researchers gathered and analyzed 278,654 voice

commands from Alexa users, and concluded that 14.7% of voice commands were used for

smart homes, which means that 14.7% of commands were used to control appliances such

as lights, TV, or air-conditioning units. Similar research was performed in [10], which was

published in April 2019. By analyzing 193,665 voice commands from Alexa and Google

Home users, researchers revealed that the IoT control commands occupy 16.7% of the total

commands. Both articles show that the leading usage of the voice command is music, which

2.2. Virtual Assistant 11

is 25% in [52] and 28.5% in [10]. Predictably, the IoT technology progress propels IoT

development.

As the technology of the IoT evolves, more applications can be incorporated into smart

speakers. The article Kepuska and Bohouta [32] discusses some possible applications of

VPA in the areas of education assistance, medical assistance, robotics and vehicles, and

disabilities systems [32]. For example, robots could be used deliver medicine to patients in

a hospital setting, or individuals with a visual impairment could use smart speakers to do

online shopping. To improve the services provided by VPA, Veton Këpuska proposes to add

more elements to the current dialogue-based systems and generate mulit-modal dialogue

systems such as the Gesture Model and Graph Model [32]. The Gesture model analyzes

a user’s motion and facial expression and responds based on the analysis. Graph models

analyze image and video data and return appropriate results. User models will collect all

the user information in advance, such as the user’s preferences. Then when users need a

response from a VPA, the previously gathered information will help guide the final response

[32].

As smart speakers get more intelligent and popular, the potential issues are gradually drawing

more attention from users. In the article, “Integration of Cloud computing and Internet of

Things: A survey”, the authors mention seven concerns related to Cloud IoT applications

[12]: (1) Privacy, (2) Security, (3) Large scale, (4) Legal and social aspects, (5) Reliability, (6)

Performance and (7) Heterogeneity. The concerns about privacy and security arise because

of the possibilities of attacks on cloud environments, which can lead to the leakage of user

information. The large-scale challenges occur when there are numerous devices involved in

one scenario. The insufficient data storage and device monitor security issues will be hard

to overcome. Additionally, when cloud services are based on data provided by users from all

over the world, various international laws need to be complied with. Because cloud IoT is


mission-oriented, when it receives a request, it will respond without thinking. For example,

if you send a move forward command to a vehicle even when there is a river directly in front,

the vehicle will complete the task without doubt. Therefore, the reliability of the IoT device

is of critical importance.

The performance challenge primarily affects the real-time applications, such as environmental

monitoring. The last challenge, heterogeneity, is a threat to all kinds of applications. It

happens when multiple devices with various systems are involved in one task. Finally, the

integration of all of these subsystems is also a significant challenge.

The concern about user privacy and data is also mentioned in [44]. Their research indicated

that Amazon collected and stored some of the users’ data [44]. The authors also highlighted

a criminal investigation from 2015, where the police department obtained recordings from

an Amazon Echo as relevant evidence. In addition, the reliability issues were investigated in

the article, “Emerging Threats in Internet of Things Voice Services” [36]. As stated in [36],

68.9% of an investigated 572,319 audio samples were accurately interpreted. By analyzing

the misunderstood samples, the article extracted that 41.7% were due to phonetic confusion,

33.3% were homophones, 8.3% were compound words, and the rest at 16.7% were due to

other factors.

2.3 IoT Robotics

As a prospective application of the IoT, indoor robotic control has potential to be used to

create a smart home for individuals with multiple disabilities or elderly individuals. In [9],

a smart home control framework was built based on ZigBee, which can establish a personal

network. The structure successfully integrated and controlled devices such as a doorbell, fire

alarm, light and refrigerator door. In addition to providing disabilities assistance, the IoT

2.3. IoT Robotics 13

can also be used in industrial environments. For example, the voice control of an industrial

robot was designed and tested in [43]. In this paper, the author used the Microsoft Speech

Engine to process speech recognition. The human commands were then converted to text.

The converted text was then used to control two industrial robots that performed two simple

tasks. One robot completed a pick and place while the other performed simple linear welding.

In addition to the Microsoft Speech Engine mentioned above, Amazon Alexa also can also

be used as a speech recognition engine for research. The Alexa voice assistant can be used

for Natural Language Processing in the integration of the Alexa assistant as a voice interface

for robotics platforms [28]. The authors demonstrate how the Amazon Alexa be connected

with other devices so that the VA users can control devices by voice. One example of these

features is demonstrated in Alexa’s ability to connect to a Raspberry Pi with ROS via the

MQTT network communication protocol. The remote server is composed of Mosquitto, a

message broker and RedBot platform, as well as a chatbot platform based on Node-RED.

Another research use of Amazon Alexa can be found in the speech to text conversion as high-

lighted in [19]. To explore the collaboration environment between robots and humans, Craig

Douglas and Robert Lodder integrated Amazon Alexa and a ROS-based double telepresence

robot for human identification and localization. ROS gazebo was used for the simulation and

TensorFlow is used for Artificial Intelligence, which allows the robot to recognize humans in

a crowded environment.

In [57], an Amazon Tap Speaker was used to activate a lawnmower. Along with Alexa voice

services, the researchers applied a free web-based service, If This Than That (IFTTT), which

built simple conditional statements for triggering and controlling the speed of lawnmower.

In this case, IFTTT plays the role of cloud service in the Alexa control schemes. In addition

to using the Alexa, the researchers also built a web-based GUI to control the speed of

lawnmower.


Another approach of using Alexa to control devices is through the adoption of the AWS

services: AWS Lambda and AWS IoT. These methods are demonstrated in [30]. AWS

Lambda is a cloud computing service and AWS IoT allows for bi-directional communication

between IoT devices and an AWS cloud, which is Lambda in this case. Moreover, a Raspberry

Pi, a microcomputer, can be used to transmit messages from the cloud service to the local

network. The authors’ intelligent robotic assistance system, equipped with several devices,

can then be controlled over the local network. Additionally, the devices’ state can be updated

to the cloud service.

2.4 Mapping and Neural Network Training

In addition to being used for multiple device communication, the IoT system can also be used

in the areas of mapping and localization. In the review paper [55], the authors mentioned

the concept of the location of things (LoT) and the importance of the LoT for the IoT

infrastructure. The LoT acts like a search engine for data and device management and the

integration of LoT and IoT is proposed in [41]. In Nath et al., researchers built a voice-

based location detection system using an Alexa Echo and ultrasonic sensor and integrated

the system with a smart home. Another voice interface, Google Assistant, was used for

indoor navigation in [54]. Instead of having mobile devices for navigation, stationary devices

were used for localization.

The work presented in [55] also summarized the methodology and applications of SLAM,

which is often used for path planning and obstacle avoidance in robotics. SLAM has been

an active research area and has gained even more attention since the autonomous car won

the DARPA Grand Challenge in 2005 [60]. The 2D SLAM technique [38] is extended to

3D SLAM using 3D laser range finder in [48] and Visual SLAM in [31] using the fusion

2.4. Mapping and Neural Network Training 15

3D camera data information. Currently multiple SLAM algorithms and ROS packages are

available. In the article [23], the authors compared three 2D laser-based SLAM methods,

GMapping, Hector SLAM, and Cartographer. Their results indicated that both Hector

SLAM and Cartographer had small RMSE and absolute trajectory errors. Further, they

indicated that Cartographer was more robust to environment change than the other options.

Cartographer was released in 2016 and supported in ROS [27]. It can provide real-time

mapping solutions as well as offline mapping from recorded rosbag. To minimize the drift

from odometry, Cartographer applied a new algorithm that generates several submaps with

constrains and landmarks as the local SLAM. Based on the collected submaps, the backend

subsystem will perform the scan-matching scan for loop closure and optimize the final map

and trajectory. It allows developers to use both 2D and 3D SLAM.

In addition to using SLAM, machine learning is another popular application in the field of

robotics. There are different types of machine learning methods, including: reinforcement

learning, supervised learning and unsupervised learning [35]. Machine learning is widely

applied in many research areas, such as trajectory prediction [20], natural language processing

[16], and close-loop control [24]. In [46], a gradient-based online learning algorithm is used

to improve the controller performance. The research group in [47], used a Neural Network

algorithm to train a controller model from a unicycle kinetic model before applying the

trained model to a Quadcopter. The result in [47], indicated that the training model from a

simple system can be applied to a more complicated system.

In addition to the previous applications, machine learning can also be used in SLAM to im-

prove the mapping and localization performances. For example, a deep recurrent convolution

neural network is used to improve the robot localization in [63]. The authors in [53] proposed

a self-localization method which was supported by support vector machine algorithm.


2.5 Proposed Work

Summarized from section 2.1, the technologies development of RFID, WSN, addressing, etc.,

promote the expansion of the IoT. IoT applications are increasingly entering our daily life.

Based on its evolving speed, IoT technologies can be applied in many fields in the future,

including smart homes, smart cities, and smart offices. Section 2.2 indicates that as one of

the critical devices in IoT smart home application, smart speakers, embedded with a virtual

assistant, are becoming more prevalent. Together with the popularization of smart speakers,

the usage of smart speakers is spreading out rapidly. The VA embedded in smart speakers

also can be used as a speech recognition engine.

Current IoT applications are limited by the devices’ mobility and power. Therefore, re-

searchers are interested in exploring new applications of the IoT through VA, and particu-

larly applications related to the IoT robotics. In section 2.3, several methods of connecting

VA and robots was discussed. To control the robot successfully, tools other than a speech

recognition engine are needed. Since the message transfer from device to device will affect

the response time, we choose the most simple one among different proposed methods in lit-

erature. This project connects the VA, Amazon Alexa, with an AMR, Fetch, through AWS

Lambda and BTS proxy tools, so that the robot can be directed in a certain direction and

can follow a desired target.

Being able to successfully carry out target following tasks is fundamental to a final intelligent

navigation system, and it is also an important task for a robot companion. To assist the user

in time, the robot needs to follow a target consistently. However, the hand-coded algorithm

is sensitive to noise such as incorrect target detecting. In order to improve the performance of

the robot companion task, we proposed a data-driven neural network algorithm for the robot

to learn target following control. The neural network computed model is expected to be more

2.5. Proposed Work 17

robust to changes in the environment. In addition to learning simple navigation behavior,

robot localization is also important for further development of navigation. Therefore, we

needed a reliable map for precise localization. The map information can then be used when

the robot lost the target and needed to navigate itself back to a safe place.

Chapter 3

Materials and Methods

As mentioned in chapter 2, there are three main elements of the IoT: hardware, middleware

and presentation. In our project, the hardware used is the mobile robot, Fetch, which is

made up of actuators, sensors and embedded communication hardware. Our remote server,

Marvin, is the middleware. Marvin will perform as a computing tool for voice control and

visualization. Last but not least, an Echo Dot embedded with Alexa will be used as the

presentation, which is an interpretation tool for various applications.

In this chapter, the materials and methods for the voice control of the robot will be discussed.

Section 3.1 introduces the configuration and task implementation method of Fetch. Section

3.2 presents how Alexa was used for voice control. Section 3.3 shows the setting of the remote

server, Marvin. Section 3.4 focuses on communication between the devices.

3.1 Fetch Robot

Fetch is a commercial robot available from Fetch Robotics Inc, and equiped with a mobile

base, a manipulator, an adjustable torso, and a head camera [65]. Additionally, a laser

scanner is included in the robot in the base. The base design limits its mobility and, as

a result, Fetch can only be used in an indoor environment. The robot is configured with

an embedded computer, which runs Ubuntu 14.04 and ROS indigo. ROS is a middleware

software based on packages. It is not a actual operation system even there is a ‘operation

18

3.1. Fetch Robot 19

system’ in the name. ROS is a complex robot control software and uses package to organize

the software that it contains. To control the motion of Fetch, we need to send ROS accepted

messages to the corresponding control node. Figure 3.1 displays a picture of Fetch robot

similar to the one used in the experimental work presented in this thesis.

Figure 3.1: Fetch Robot[4]

3.1.1 Camera

Fetch originally had a built-in camera (model PrimeSense Carmine 1.09) on the head. The

camera model is a 3D RGB Depth sensor with a 0.35 − 3 m operation range and was best

calibrated in the range of 0.35−1.4 m [65]. PrimeSense also generates PointCloud image data,

which can show the spatial position of objects relative to the camera [51]. Unfortunately,

this original camera was broken and had no image available. The issue was suspected to be

a software conflict caused by previous research efforts. Despite attempting to fix the issues

20 Chapter 3. Materials and Methods

by reinstalling the camera drivers and reinstall most of the related packages, no images were

able to be obtained from the camera. We removed the camera from the robot in order to test

for hardware issues, however, the camera problems were still unable to be diagnosed. This

camera model has been out of the market since 2013, because Apple Inc bought PrimeSense

and stopped making Carmine 1.09. Therefore, a new camera model was needed to replace

the Carmine 1.09.

(a) PrimeSense Carmine 1.09 [5]

(b) Intel SR 300 [6]

Figure 3.2: PrimeSense Carmine 1.09 and Intel SR 300

The replaced camera was chosen based on the following metrics: operation range, accuracy,

provided data information, and size. We required a similar operation range because the

robot needs to see a target within a short range and with high accuracy. Further, the size

of the camera decided if we were able to place the camera at the initial position on the head

of the robot. In the article Carfagni et al. [13], the authors compared the performance of

three cameras, the Kinect v2, Carmine 1.09 and SR300 [13]. Their results indicated that the

3.1. Fetch Robot 21

Carmine 1.09 and the SR300 had both close probing errors PF and flatness errors F . Further,

the SR300 had less sphere-spacing error SS than the Carmine 1.09. Table 3.1 summarized

their results.

Device Carmine 1.09 SR 300 Kinect v2PF [mm] 9.32 8.3 20.13F [mm] 6.71 6.88 12.58SS [mm] 26.08 6.05 19.7

Table 3.1: Camera comparison[13]

The operation range of the SR300 is 0.2− 1.5 m [17]. And the operation range of Kinect v2

is 0.5− 4.5 m. The size of SR300 measures 110× 12.6× 4.4 mm while the size of Carmine

1.09 is 180× 25× 35 mm [1]. Kinect v2’s size is 360.68× 139.7× 165.1 mm. The comparison

is displayed in Table 3.2

Model Carmine 1.09 SR300 Kinect v2Operation Range (m) 0.35− 1.4 0.2− 1.5 0.5− 4.5

Size (mm) 180× 25× 35 mm 110× 12.6× 4.4 360.68× 139.7× 165.1

Image DataColor ✓ ✓ ✓Depth ✓ ✓ ✓

PointCloud ✓ ✓ ✓

Table 3.2: Camera Parameters Comparison

After evaluating options, it was determined that the Intel RealSense SR300 Depth camera

satisfied all of the required conditions. As a result, the Intel RealSense SR300 with a new

3D-printed camera case substituted with Carmine 1.09. Images of the two cameras can be

seen in Figure 3.2


3.1.2 Software

ROS provides numerous convenient tools and packages for work related to visualization,

simulation and control. In particular, Rviz is one of the most common tools for visualization.

Rviz allows users to view multiple messages in the same window, including camera images,

laser scan data, as well as the robot model. The robot model can be built based on two

packages, TF and URDF. URDF describes the unchanged parameters of the robot, such as

the radius of the base and where the camera is located related to the head tilt pan link,

while TF keeps track of the relative position of the robot from one frame to another.

Along with the packages mentioned above, ar_track_alvar is an important package in this

project. With this package, the robot can detect the designed AR tag and receive the tag’s

position and orientation relative to a chosen frame. In addition, the detection of the AR-tag

has a small error in pose estimation of about 4 cm [29]. By changing the subscription frame

to the head camera RGB frame, ‘head_camera_link’, the corresponding relative position

messages will be published to the ROS topic /ar_pose_marker. Figure 3.3 highlights the

TF tree for a detected AR tag.

Since Fetch is a mobile robot and is able to interact with the surrounding environment, it

also has the potential to hurt people around it or harm itself. Therefore, every step of the

experiment was simulated in advance using the turtlesim package. Turtlesim is usually used

in ROS tutorials. We chose to use the package to test the performance of a given task due

to its movement control topic, /turtle1/cmd_vel, which uses the same type of message as

the robot’s, /cmd_vel. They are both geometr_msgs/Twist messages, which contains three

linear and three angular components. The topic /cmd_vel means command velocity. The

message represent the linear and angular velocity that is sent by command. The infomation

3.1. Fetch Robot 23

Figure 3.3: TF tree for camera and AR marker

in /cmd_vel can then be written as a matrix, T in Equation 3.1,

T =

vω

=

vx vy vz

ωx ωy ωz

(3.1)

where the first row contains the linear components, vx, vy, and vz and the second row contains

the angular components, ωx, ωy, and ωz.

In order to interact with Alexa, all of the ROS messages need to be converted to JSON mes-

sages for Alexa. Accordingly, two more packages were needed: roslibjs and rosbridge_server.

More detail will be provided in the later section 3.4 about the use of these packages.

3.1.3 Track and follow

To implement the following function, a follower node /ar_follower (/ar_follower_turtle for

turtlesim) was created to perform computations. The marker position information was


published to the ROS topic /ar_pose_marker as a ar_track_alvar/AlvarMarkers message,

which is a customized message type defined in the package. The information needed, such

as the position data, could then be retrieved from AlvarMarkers.markers.pose.pose.position.

The position information was comprised of three components, x, y and z, where x represents

the distance from the tag to the camera, y is the horizontal displacement from the center of

the FoV, and z is the vertical displacement from the center.

The position of the AR tag is referred to the camera frame, which can be consider as the

robot’s frame in this project, because the robot’s head is stationary in reference to the robot’s

base. When the robot moves, the robot’s frame moves accordingly.

Figure 3.4: Kinetic Model of Fetch’s Base and Reference Frame

The kinetic model of the robot is a unicycle model, which can be expressed using the differ-

ential equation in Equation 3.2 [59].

3.1. Fetch Robot 25

x

y

ω

=

cos(Φ) 0

sin(Φ) 0

0 1

vω

(3.2)

As shown in the Figure 3.4 and Equation 3.2, the robot can’t move linearly in y or z direction

or rotate along x or y axis, the /cmd_vel message matrix T can be simplified to Equation

3.3.

T =

vx 0 0

0 0 wz

(3.3)

Since the z position of the tag will not affect the movement of Fetch, we will only need pose.x

and pose.y from the position of AR tag. In 3.4 shows the position relationship between the

robot and AR tag.

The mechanism of tracking and following the AR tag in ROS is illustrated in Figure 3.5.

Additionally, a combined and simplified version of two nodes’ relationship is displayed in

Figure 3.6.

As the figure shows, the image received from the camera was processed in a ROS node

/ar_track_alvar and where a ar_track_alvar/AlvarMarkers message was generated. The

message was then published to a ROS topic /ar_pose_marker. Node /ar_follower (/ar_follower_turtle

for turtlesim) subscribes to the /ar_pose_marker and which then computes the geome-

try_msgs/Twist message for the base movement. In the same way, the message will be

published to the base control topic /cmd_vel (/turtle1/cmd_vel for turtlesim).

When the robot receives a marker position, the computation will begin. Although there are

three position components, x, y and z, for this computation we only required the x and y


/ar_follower

/ar_track_alvar

/head_camera/rgb/image_raw

/cmd_vel/tf

/ar_pose_marker

Shape: Square à Topic

Eclipse à Node

Color: Dark Blue à Publish

Grean à Subscribe

Light Blue à Publish + Subscribe

Figure 3.5: ar_track node connection

Figure 3.6: ar_track node to ar_follower node

positions. We can define a matrix, P , as a simplified input matrix based on the position

components, and a matrix O as an expected output matrix, which should be the message

send to /cmd_vel. Therefore, 0 is using the same format as T .

P =

pose.x 0

0 pose.y

; O =

vω

=

o11 0 0

0 0 o23

(3.4)

Due to safety concerns, the output velocity was limited. The minimum speed was set so that

3.1. Fetch Robot 27

the robot was able to move smoothly.

Max =

max{vx}

max{ωz}

=

max1

max2

; Min =

min{vx}

min{ωz}

=

min1

min2

(3.5)

A weighting matrix W is defined as Equation 3.6, where scale.x and scale.y are the weighting

values for pose.x and pose.y. And the goal position of the AR tag goal.x and goal.y form

the goal position matrix G.

W =

scale.x 0 0

0 0 scale.y

; G =

goal.x 0

0 goal.y

(3.6)

To let the robot follow the AR tag, the first step was to check if the target is within the

threshold using equation 3.7, where H is the offset from goal position. If the target falls

outside of our threshold, we continued to calculate the desired velocity D with equation 3.8

with goal.y = 0. ∣∣∣∣H∣∣∣∣ = ∣∣∣∣P −G

∣∣∣∣ =∣∣∣∣∣∣∣h1 0

0 h2

∣∣∣∣∣∣∣ (3.7)

D = H ∗W = (P −G)∗W =

(x− goal.x) ∗ scale.x 0 0

0 0 y ∗ scale.y

=

d1 0 0

0 0 d2

(3.8)

After we computed the desired velocity, we then needed to compare it with the limit 3.5. A

flowchart for the complete algorithm can be seen in the illustration displayed in Figure 3.7,

where i ∈ {1, 2}.

After the computation, the final geometry_msgs/Twist message will be output in the matrix,


Figure 3.7: Velocity Computation Algorithm Flowchart

O, as shown in equation 3.9

O =

o1 0 0

0 0 o2

=

linear.x 0 0

0 0 angular.z

(3.9)

The final output will then be published to the topic /cmd_vel as a geometry_msgs/Twist

message.

3.1.4 Improve Follow Task Performance

The target follow algorithm in the above section operates as a proportional controller. How-

ever, this following model is very sensitive to noise and it is hard to filter all the noise with

a hard-coded algorithm. Other than the proportional control, there are also other control

methods which can be used for object following and navigation. The authors in [37], pro-

posed a saturation feedback controller to simultaneously solve the trajectory tracking and

regulation problems for path planning. The controller enables a unicycle-modeled mobile

3.1. Fetch Robot 29

robot to follow a designed line or circle. Another control strategy for mobile robot object

following is designed based on two discrete PID controllers at each wheel [34]. The con-

trollers in [34] considers the situation that the robot needs to follow a target in a dynamic

environment.

In addition to using controllers to smooth the trajectory of the robot and improve the robust-

ness for target following, researchers also aim at robot intelligent navigation. The authors of

[61] introduce a cognitive map based on the spatial cognition of objects. The proposed frame-

work decomposes the cognitive map into two parts. One is feature extraction of object and

environment. The other one is understanding and reasoning about the environment. This

cognitive map allows the robot to approach human’s spatial cognition. In [21], researchers

demonstrate another intelligent navigation system, human-awareness navigation based on

the de Social Force Model. This navigation system takes into consideration that robot in-

teracts with humans or obstacles. The goal of [21] is to use this navigation algorithm for a

robot companion task. In addition, a socially aware navigation method is proposed in [14].

Unlike the hand-code heuristic navigation algorithm presented in [21], the authors of [14]

applied deep reinforcement learning to let the robot learn how to walk in a pedestrian-rich

environment while avoiding collisions.

Machine learning methods are not only used in navigation, but also used in robotic control.

The author of [22] presented a neural network computed torque controller for a nonholonomic

mobile robot. This controller can be applied for trajectory tracking, path following, and

posture stabilization. With the neural network controller, priori dynamic parameters of the

robot are no longer needed. Furthermore, the control model improves the performance of the

robot drastically. Similarly, in [56], the authors adopt reinforcement learning for a mobile

robot to perform corridor following and obstacle avoidance. The robot learns the model

from the example answer to the task, which is given by computation or controlled by a


human directly. As a result, their robot manages to learn good control laws faster than the

hand-engineered programming process.

Based on the previous work that has been conducted for robot companion tasks, we believe

that a neural network computed algorithm has the potential to improve the current following

task performance and can be evolved to imitate a human following behavior for navigation.

Therefore, we proposed a data-driven neural network algorithm and used the experimental

data from the hand-coded follow task to train the following algorithm.

3.1.5 Neural Network Training

According to 3.1.3, the base movement control only depends on the position of the marker.

If we want to add dependencies such as the location or trajectory to control the movement

of the robot, numerous code modifications will be needed. Most importantly, the debugging

process will be difficult and time-consuming. Therefore, we would like to let the robot learn

a control model itself using neural network training.

The AR tag follow model is used to test the neural network algorithm. We adapted the

algorithm in [47], and further simplified to algorithm 1. The original algorithm is shown in

B.1.

Algorithm 1 Neural Network Training ProcessRandomly generate validation set Xvalid

for each epoch i dofor each batch j do

Generate training set Xbatch

Train neural network using Xbatch

end forCompute loss(RSME)

end for

To train the best fit model, we applied a neural network using Keras [15] and Tensorflow [8].

3.1. Fetch Robot 31

The inputs were the relative position of the markers, x and y, while the outputs were the

command velocity, linear.x and angular.z. The neural network was built with three hidden

layers, with each layer containing 256 hidden units. We choose to use Leaky ReLU [39] as

the activation function with α = 0.1. The training and test loss were computed using a Root

Mean Square Error (RMSE) function which was defined as,

RMSE =

√∑ni=1(yi − yi)2

n, (3.10)

, where n is the total number of samples, yi is the actual value, and yi is the predicted

value. Figure 3.8 shows the neural network architecture for the follow task. In addition, the

parameters are summarized in Table 3.3.

Parameter Description Valuenl Number of hidden layer 3nn Number of hidden units 256fa Activation function ‘LeakyReLU’α Parameter for activation 0.1

Table 3.3: Neural Network Parameters

.

.

.

.

.

.

.

.

.

Input Layer Hidden Layers[3 X 256]

Output Layer

Pose.X

Pose.Y

cmd_vel.Linear.X

cmd_vel.Angular.Z

Figure 3.8: Neural Network


3.1.6 Mapping

Since we would like to apply the map information to predict a target position, a map that can

be trusted by the robot is necessary. Fetch Robotics provides a navigation package based on

Karto SLAM for building a map. Karto SLAM takes information from LiDAR and uses that

information to build maps in real-time using particle filter localization, which is also known as

Monte Carlo Localization (MCL). In the article [62], Karto SLAM 1.1 has a RMSE = 0.3207

m, and maximum error, errormax = 1.21 m. Although the RMSE is small, the maximum

error is significantly larger which indicates that the Karto SLAM might be unreliable to

use. Therefore, another SLAM package, Cartographer, was tested and compared with Karto

SLAM. According to [23] Cartographer had the best performance among the three tested

SLAM methods, which are GMapping, Hector SLAM and Cartographer. Cartographer can

build maps with only laser scan information. To improve the quality of the final map, sensor-

fusion with IMU data, camera point cloud data and GPS information can be added at the

discretion of the user. To visualize the map generation process in RVIZ, we choose to replay

the rosbag data for SLAM. The configuration of Cartographer in this project is listed in

Table 3.4. Although Point Cloud data is available from Fetch, it will not be used for the

map building because point cloud contain enormous data and recording those data will cause

an error of buffer exceed. Besides, Karto SLAM can’t do 3D mapping. Only the 2D map

will be compared. Therefore, point cloud data is set to 0.

Option Value Related TopicProvide Odom True /odomUse IMU True /imu1/imuUse GPS False -Number of LiDAR 1 /base_scanNumber of Point Cloud 0 -

Table 3.4: Cartographer 2D SLAM configuration

3.2. Amazon Echo Dot 33

3.2 Amazon Echo Dot

The Amazon Echo dot embedded with Alexa is one of the most popular smart speakers

today. It has small size ((32mm×84mm×84mm)) and light weight (163 g). These characters

allow us to place the device on the robot or to carry it easily. In addition to the regular

speaker usage, the Echo dot can also communicate with users and understand the user’s

voice commands. The Echo dot has a 7-microphone array tucked underneath the light ring,

which enables it to pick up voices from every direction of the room. Typically, the hands-free

control is used for searching or controlling Alexa compatible smart appliances. Using these

features, voice commands can be sent to the Alexa voice service and be interpreted.

3.2.1 Skill

Amazon provides the Alexa Voice Service for third-party developers so that they can make

their device Alexa-enabled. Additionally, the Alexa Skill Kit is available for developers to

customize the Alexa skills, so that Alexa can complete more complicated tasks.

A custom skill requires the following components: (1) an invocation name for Alexa to iden-

tify the skill; (2) intents that represent operations; (3) sample utterances which are sample

commands users might use; (4) cloud service to handle intents as structured requests and

send back the appropriate response. Besides the mandatory components, a useful optional

component, slot, was also applied in this project. A slot is always assigned with a slot type,

which clarifies a type of word.

All of the components above can be configured in the Alexa Developer Console. Since our

project includes several extra modules, the cloud service was configured as an external service.

The Alexa Developer Console was able to link to the cloud-based service by configuring the


endpoint. The Alexa custom skill can then be defined as two parts. Part one is the interaction

model which includes the invocation name, intents, utterances, and slots. The second part is

the cloud service that handles the request. Figure 3.9 further illustrates how the interaction

model works.

Figure 3.9: Interaction Model Process

The following description highlights the interaction model the used in this project 3.5. We

chose to use the invocation name “turtle one” as the simulation skill for this project. The

skill contains two intents, MoveBase and FollowIntent. Further, the MoveBase intent has a

slot dir (short for direction), which is defined as a slot type ListOfDirection. This slot type

defines four possible direction words: forward, backward, left, and right.

As a result, to use this skill users can say: Alexa tell turtle one to go forward.

The main purpose of this interaction model is to help Alexa map the voice commands from

the user to the correct intent so that the cloud computing process can handle the requests

3.2. Amazon Echo Dot 35

Skill TurtleSimInvocation Turtle One

Intent Sample Utterance Slot Slot Value

Move

Move {dir}

dir

forwardTurn {dir} backwardGo {dir} left

Move {dir} right

Follow

Follow {tag} tag markerFind {tag} tag

{quit} following quit quit{quit} find stop

Table 3.5: Interaction Model Configuration

sent from Alexa correctly. When the above example command is received by Alexa, two

keywords will be processed: “turtle one” and “forward”. When Alexa recognizes “turtle

one”, it will send a skill launching request. The word “forward” will map to the slots type

ListOfDirection, then map to the MoveBase intent. A MoveBase intent request along with

the slot’s value can then be sent to the cloud service. All of the requests are sent as JSON

messages. A sample JSON input is shown in figure 4.5.

In order to map the keywords to the expected intent, a slot value needs to be unique for

intents. A slot value cannot appear in two intents’ sample utterances. It will cause confusion

for Alexa. Alexa doesn’t have the ability to distinguish which intent request it should be

sent. Other than the Alexa skill configuration, Alexa Developer Console also provides the

Alexa simulation page for users to test the skill without an actual Alexa embedded device or

Alexa app. The simulation page can take both text and voice input. To build a connection

between the Alexa Skill and computing services, the endpoint needed to be configured to

the internet access service.


3.2.2 Cloud-based Service

For the cloud-based service, we used another one of Amazon’s free cloud computing services,

the AWS Lambda function. AWS Lambda allows users to upload and run their codes without

provisioning or managing servers. Additionally, it supports multiple programming languages,

including Node.js, Python and Java. Thus, developers can edit their codes using the code

editor window in AWS Lambda.

The connection between AWS Lambda and Alexa Skill is bidirectional. The Lambda function

connects to the Alexa Skill by adding an Alexa Skill Kit Trigger. The trigger is linked to one

and only one Alexa Skill ID. The Alexa Skill Kit Trigger also protects the Lambda function

so that it cannot be used by others. The function can only can be evoked by the specific

Alexa Skill. To connect the Skill to the AWS Lambda function, we need to configure the

endpoint of skill to the Lambda function ARN (Amazon Resource Name).

AWS Lambda can also be used to test and debug the script using various test events. To test

the launch request handler, a launch test event first needs to be configured. Similarly, we will

need the intent request handler and stop request handler. The test event is a manual JSON

input. For instance, an intent request test is shown below in Figure 3.10. By executing the

test event, the Lambda function shows whether the test passed or failed. The JSON output

will be published, as well as the execution duration, etc. It also generates the summary for

error count and duration as shown in Figure 3.10.

However, in order to test all of the possible requests, users need to make up all of the possible

JSON inputs. In our case, at least eight test events are needed which is time-consuming and

not convenient. Therefore, another server, in this case, a BST proxy lambda server was

also used. Bespoken Proxy (bst proxy) is a tool from Bespoken, LLC. This tool allows

users to communicate with the local service running on the machine via Alexa device, Alexa

3.3. Marvin 37

(a) Error summary from Lambda (b) Duration summary form Lambda

Figure 3.10: Cloud watch summary from AWS Lambda

simulator, or Alexa app. The proxy lambda command can run a Lambda function as a local

service.

To use the BST proxy server, the Skill endpoint needs to be configured to the generated

public URL for accessing the local service. The URL remains the same for one IP address.

Testing and debugging of code can be easily done via the Amazon Developer Console’s

simulator. If an error occurs, the simulation page will not display the designed response,

and the error message will be shown in the terminal. Instead of configuring all possible

JSON inputs, we can let Alexa generate the JSON input from our text or voice command

input.

3.3 Marvin

Marvin is a remote server used for controlling the Fetch. Although Fetch has a built-in com-

puter to send control commands, it is dangerous to have wires like Ethernet cable connected

to the robot while moving the base or arm. In order to accomplish the detect and follow

AR-tag task, Fetch needed to be able to move without any restrictions. Therefore, Marvin

was used for sending commands and receiving messages to and from Fetch.


For the best compatibility, Marvin runs the same version of Ubuntu and ROS as the robot,

which in this case was Ubuntu 14.04 and ROS Indigo. This can reduce the risk of having

conflicts between the robot and the remote server.

The connection between Marvin and Fetch can be built via the ROS_MASTER_URI. We

set the ROS_MASTER_URI of Marvin to be Fetch’s IP address, so that all the ROS related

data from Fetch will be sent to Marvin automatically. One of the benefits of this connection

is that there is no noticeable latency when using ROS tools such as RVIZ or rqt_graph.

However, ROS_MASTER_URI only allows us to receive data and thus, we can’t send

control commands or messages to Fetch. In order to solve this problem, SSH is used. SSH is a

network protocol that enables secure communication from machine to machine by connecting

to a remote host for one terminal session. Using the SSH commands will temporarily make

the current terminal the destination’s terminal. We are able to run a program or edit the

code on Fetch. One of the reasons that we need the SSH is that we need to build a bridge

between the robot and Alexa. If the bridge is built on Marvin, Alexa can only communicate

with Marvin instead of Fetch. The detail of the SSH will be explained in more detail in the

following section.

Besides the remote server, Marvin also runs the local computing service for Alexa’s request.

As we mention in the Alexa section [Section ref], the BST proxy tool allows us to run a local

service as a cloud service. Instead of using Amazon Lambda as a cloud service, we chose to

run Lambda as a local service on Marvin via BST service because Lambda cannot provide

a stable connection between the ROS WebSocket server and Alexa Skills. Along with the

BST proxy server, the Alexa simulator can be used for testing and debugging. The error

messages, as well as the JSON messages, will be published on the terminal directly.

3.4. Communication 39

3.4 Communication

Since multiple devices and programming languages are used, communication among them

is one of the most essential parts of this project. In the following illustration, Figure 3.11

shows a schematic detailing how the messages are communicated between devices. The

communication between machines will be built via SSH as illustrated in the figure. The

communication between the devices and virtual assistants will be built via ROS packages

and the cloud server.

Figure 3.11: Device Communication for Alexa Controlled Robot

3.4.1 Machines

In this project the robot was able to be controlled and monitored by Marvin through SSH.

However, if Marvin tries to receive a large number of messages from the robot, for example,

while using RVIZ to view image data from the camera or laser scan data from LiDAR, the

RVIZ will have approximately a 3-second latency period. Consequently, the visualization of

the ROS message will be done through ROS_MASTER_URI. ROS tools like RVIZ can also

be opened directly without SSH.

SSH is mainly used when we need to run a new program or start a new ROS node. There

are two new nodes that we needed to start on Fetch. One is the node that executes the


detection and following of the AR tag, /ar_follower. The other node is used to build a

bridge for the transfer of messages between Alexa and Fetch, /rosbridge_websocket. Also,

when a test was finished, a a control node /teleop_twist_keyboard was created to allow for

the movement of the robot with a keyboard in order to try to bring the robot to the starting

position. Another usage of SSH was to modify the code remotely. If an error occurred when

a script was run remotely, the code could then be edited immediately.

It is important to note that this communication is not bidirectional. Marvin can send

commands to the robot as well as retrieve messages from the robot. However, the robot

cannot do the same thing because Marvin is not configured to be SSH-enabled.

3.4.2 Devices, ROS and Alexa

For the purpose of smooth communication among devices, there are several message con-

versions and transfers that need to be done due to both ROS and Alexa only accepting

certain types of messages. One of the message conversions can be done in the ROS package

ar_track_alvar. In ROS, images from the sensor will be sent to an ROS hardware driver and

passed around in the ROS message format sensor_msgs/Image. Unfortunately, the ROS im-

age message is not convenient for image processing. Thus, the ROS package ar_track_alvar

is used to convert the ROS message, sensor_msgs/Image, to an OpenCV image message,

cv::Mat, using the package cv_brigde.

The sensor_msgs/Image will be published to a CvBridge node and converted to an OpenCV

image [2]. The new image message will then be republished over ROS. This conversion is

bidirectional. Figure 3.12 shows the message conversion flow.

Other than image message conversion, the project also considers the JSON message conver-

sion for interacting with Alexa. Because the request sent to the service and the response

3.5. Summary 41

Figure 3.12: Image message conservation

sent back from the service will be in JSON format, which is not readable for ROS with-

out conversion. Fortunately, ROS provides a library, rosbridge_library, to do the transfer

between JSON string and ROS message [18]. Although rosbrigde_library allows for the

conversion of JSON message and ROS message, it left the transport layer to another pack-

age, rosbridge_server. This package provides a WebSocket as a transport layer, which has

low latency and allows bidirectional communication. Although Amazon Lambda supports

several programming languages, BTS proxy only supports JavaScript. Consequently, the

cloud service for Alexa Skill is written in JavaScript. And another ROS package, roslibjs,

is needed for cloud service to communicate with ROS. This relationship is shown in figure

3.13.

3.5 Summary

In this chapter, the materials involved in this project were introduced. The main devices that

were discussed include the robot: Fetch, the speech recognition engine: Alexa, and the remote


Figure 3.13: JSON message conversion

server computer: Marvin. In addition, how these materials are connected and communicate

with each other was explained. The messages were converted using ROS packages and the

device were connected via SSH internet protocol and WebSocket layer. Devices should be

able to communicate with each other. The communication within the devices was tested

and the results will be highlighted in the next chapter.

Chapter 4

Results

This chapter will focus the results from both simulations and experimental testing with the

fetch robot. First, the performance of SR300 was evaluated. We also continued to simulate

the follow and speech recognition tasks on the remote server. The last results that will be

displayed demonstrate the voice control tests on the real robot.

4.1 Camera

Since the camera plays a key role in completing the detect and follow functions, the first

step was to test the performance of the new camera, the SR300. Using the ROS package,

ar_track_alvar, multiple AR tags were able be detected. The position relationship can then

be visualized in RVIZ. The camera image and the visualized transfer frame is presented in

figure 4.1b.

To test the performance of SR300, we used a moving AR tag held by a person while we

recorded the marker messages. The tag was moved randomly within the FoV, mainly chang-

ing the position in x and y direction within the approximate sensor range of two camera

model. Since the position of the tag was controlled by a human, we were control the speed

to a certain number. We tried to move the tag steadily which made it easier to eliminate the

outliers from raw data. When the data had a sudden jump in position on any direction, it

can be determined to be a wrong detection. If there were empty marker messages, the cam-

43

44 Chapter 4. Results

(a) Multiple tag detection with RGB image (b) Visualized TF for multiple tag detection

Figure 4.1: AR Tag Detection

era did not detect any AR tag. The following table explains the efficiency of SR300. Since

the Intel camera package allows developers to determine the launch mode of the camera,

both RGB-D mode and Kinect type mode were tested. The efficiency of the original camera

model is tested with a new camera at the end of the project. Since the PrimeSense Carmine

1.09 originally launched with PointCloud2, we only evaluated the Kinect type mode.

Table 4.1: Data loss ratio under different situation

Method Background Camera Launch Type Total Data Empty Data RatioTrack Noisy Kinect Type 26914 22428 83.33%

RGBD 26823 9515 35.47%White Kinect Type 26989 26827 99.40%

RGBD 26827 542 2.02%Follow White Kinect Type 27003 25565 94.67%

RGBD 26999 178 0.66%

Table 4.1 shows that SR300 with RGBD mode provided a reliable tacking function. When

there was a clear white background, the AR tag was visible for 97.80% of the time. Further,

when the the AR tag was placed in a noisy background, which is the normal in a lab

environment, the detection efficiency was 33.45% lower. Both experiments show acceptable

4.1. Camera 45

results. However, with the Kinect type mode, the camera data loss reached up to 99.40% and

83.33% for the white and noisy backgrounds, respectively. Meanwhile, the original camera

model has an 8.33% data loss for 15 minutes of tracking. In addition to the detecting function

only, the detection productivity while following was also verified. The data loss reached a

minimum during the following task.

The Kinect type mode had a large offset between the RGB image and depth image, which is

used to create the distance data in PointCloud Library and therefore we were unable to trust

the recorded data. To address these issues, we attached the AR tag is to a slightly larger

white paper and found that the detection efficiency was improved. The empty data ratio

dropped from 35.47% to 2.02% 4.1. The white paper forms a small area of white background

around the target tag. Consequently, the background builds an obvious contrast with the

edge of the AR tag. Accordingly, it is easier for the camera to detect the corners and blocks

of the tag. However, since the tag is held by a person and moves in arbitrary directions,

it can sometimes moves out of the FoV of the camera, which causes most of the data loss.

During the following test, the robot moves with the target, the tag is less likely to be out the

view because the robot adjusted its position in order to keep a certain distance from the tag.

As a result, the detection efficiency increases for the following task. The figures 4.2a and

4.2b show the detected marker positions, where x is the distance from the tag to the camera,

and y and z are horizontal and vertical distance from the center of FoV, respectively. Based

on the position change of the marker, the moving speed was able to be calculated. The peak

positions in 4.2c and 4.2d indicate the empty marker messages.

According to the result above, we can conclude that the Intel SR300 can be used as an

appropriate substitute for the PrimeSense Carmine 1.09.


(a) Marker position from SR300 (b) Marker position from Carmine 1.09

(c) Marker moving speed from SR300 (d) Marker moving speed from Carmine 1.09

Figure 4.2: Compared camera performance

4.2 Simulation

As we mentioned above, Fetch is a mobile robot that can only be used on a flat surface. In

addition, the weight of the Fetch is 250 lbs (113.3 kgs) [65]. It will be very dangerous if Fetch

hit someone in the room or go down to the stair by accident. Due to these safety concerns,

it was first necessary to simulate results prior to conducting our experimental tests.

The first simulation was to track and follow an AR tag using turtlesim. The simulated turtle

moved based on the position of the detected marker. Figure A.5 displayed the simulation

4.2. Simulation 47

for track and follow.

Figure 4.3: Turtle trajectory for AR tag following

We recorded the actual command velocity that is published to /turtle1/cmd_vel, which

control the movement of turtlesim, as well as the AR tag position that is published to

/ar_pose_marker. The expected trajectory was computed based on 3.7. The actual and

expected trajectory was compared in 4.3. The two trajectories basically overlapped with a

slight misalignment. The misalignment was caused by the mismatched updating frequencies

of the camera and the control topic. The camera frequency fluctuated across a range from

25Hz to 30Hz, while the updating rate of /turtle1/cmd_vel was set a constant 30Hz.

Overall, the turtle was able to follow the AR tag as expected based on the detected relative

position.

In addition to conducting the following simulation, turtlesim was also used to do conduct

simulations that evaluated the ROS and Alexa connection as well as the speech recognition


tasks. Similar to the methods used earlier, we used the turtle to imitate the movement of

a real robot. The Alexa developer console test page was applied for sending commands. If

the ROS and Alexa were not connected, the Alexa simulation page sent back the designed

message, yet the turtle will not perform the tasks. Only when the ROS and Alexa connection

is built, will the turtle perform the requested mission, including moving to a direction or

follow the AR tag.

When using the AWS lambda function as the cloud server, the connection between ROS and

Alexa did not work well. During the event test, the console displayed that the rosbridge was

not connected while the rosbridge_server had launched.

The turtle simulation result using AWS Lambda was displayed in Figure 4.2. The actual

simulation window can be found in Appendix A A.1. Alexa received the correct message but

the turtle did not move as expected.

Table 4.2: Turtle Simulation Response with AWS Lambda

User Command Alexa Response Rostopic Responselinear angular

Alexa start turtle one Where do you want to go NaN NaNAlexa tell turtle one go forward Moving Forward … NaN NaN

Please say another commandAlexa tell turtle one trun left Turning left … NaN NaN

Please say another command

As a result, we decided to use bst proxy server to be the “cloud” server. The simulation

results with the responses from the Alexa simulator and the ROS topic /turtle1/cmd_vel

are listed in 4.3. The actual simulation window using bst proxy is also shown in A.2.

As shown in Figure A.2, the trajectory of the movement of the turtle is satisfied with the

text input command. To distinguish the forward and backward commands, the forward

command is set to move for one step while the backward is set to be one steps.

4.3. Robot Test 49

Table 4.3: Turtle Simulation Response with BST proxy

User Command Alexa Response ROS Topic Responselinear angular

Alexa start turtle one Where do you want to go NaN NaNAlexa tell turtle one go backward Going backward {-2, 0, 0} {0, 0, 0}Alexa tell turtle one go forward Going forward {1, 0, 0} {0, 0, 0}Alexa trun left Turning left {0, 0, 0} {0, 0, 1.6}Alexa go forward Going forward {1, 0, 0} {0, 0, 0}Alexa trun right Turning right {0, 0, 0} {0, 0, -1.5}

The simulation of speech recognition implies that the bst proxy server provides a stable

connection between ROS and Alexa. Alexa receives and interprets the command correctly

when a proper command is sent. After configuring the simulation, the algorithm and the

Alexa voice control was ready to be tested on the Fetch.

4.3 Robot Test

After validating our process through simulations, we were then able to conduct experiments

on the robot. Before we tested the voice control with Alexa, we would first made sure that

the track and following task could be performed as expected. After applying a designed

algorithm, the expected geometry/Twist messages were computed from the marker posi-

tion data. The result was then compared with the actual messages that were published to

/cmd_vel as shown in Figure 4.4.

When applying the RMES equation 3.10, the computational result is shown in Equation 4.1

RMSEcmd =

RMSElinear.x

RMSEangular.y

=

1.5402× 10−13

4.202× 10−13

(4.1)


Figure 4.4: Output comparison

RMSE values in 4.1 is close to zero. The error might come from the mismatched time-steps.

The /cmd_vel topic update frequency is set to be 60Hz, but the marker position topic’s rate

is ranging from 25Hz to 30Hz. To make the expected and the actual messages comparable,

the actual movement data is down sampled to half of the original sample size.

Since the error was very small, we can conclude that the following task was performed as

expected.

After completing the track and follow test, we moved on to the voice control test. The skill

is activated correctly with the innovation name. The four direction commands as well as the

follow and the stop following commands work well via text input in the Alexa simulation

page. Figure 4.5 shows simplified JSON input and output. The complete JSON input and

output can be found in Figure A.3 and A.4 under Appendix A .

In addition to testing the designed commands, the default commands to exit the Alexa were

also tested. Since the messages will need to be converted multiple times, the response times

4.3. Robot Test 51

Input Output

"request":{ "response": { "type": "IntentRequest" "outputSpeech": { "intent": { "type": "SSML", name: { "MessageIntent" "ssml": "<speak> Going forward</speak>" "slots": { } "item": { "should end session": false, "name": "Item", }

"value": "forward",

}

}

}

}

}

Figure 4.5: Example JSON input and output

could be affected. According to the device log in Amazon Developer Console test page, we

estimated the response time for four requested types which are presented in Table 4.4.

Table 4.4: Process time for Alexa intent requests

Intent Sample command Process Time (ms)Launch Alexa start Fetch 934.69

MoveBase Alexa ask Fetch move forward 867.77Follow Alexa tell Fetch follow the tag 920.53Stop Stop 930.29

The sample device log is shown in Appendix A Figure A.6. Because the speech time is

varied based on the designed response output, we only consider the response time from

TextMessage to RequestProgressingCompleted. The process time was then determined based

on the average time of 15 requests. All four type of intent requests spent less than 1 second

to process the request, which meant the robot would be able to react within 1 second after

receiving the command.

The final purpose of the project was to send commands via voice instead of text. Therefore,


the voice command using the Amazon Echo Dot was verified. Unfortunately, since we used

the BST proxy as the local server, we were unable to receive the device log from AWS. Instead

of the response time, the correctness of speech recognition was tested. In the 40 examined

voice commands, 5 of them were launch requests, 3 were stop requests, 16 were movement

requests, and the other 16 were tag-following requests. After competing the study, we found

that all of the launch and stop requests were interpreted correctly. The movement command

had only one failure case where Alexa did not recognize the direction in the command and

exited the skill. Half of the follow request failed. Based on Alexa’s response, the word

“follow” was recognized, but Alexa thought that the “tag” or “AR tag” was a social media

hashtag or account, and intents to follow that account, instead of letting the robot complete

the desired track and follow task. This situation could be improved if the command is said

with the innovation name of the skill. For example, Alexa tell fetch to follow tag. Similarly,

when we want to stop the following, the command should be Alexa tell fetch to stop following

tag.

4.4 Train Model Evaluation And Mapping

Even though the current following algorithm works fine, we would like to further improve

the model using a NN. After applying the algorithm in 1 and running for 100 epochs, we

were able to obtain our best trained model and a loss log for all epochs. After running 100

epochs the training loss is reduced to 0.29958 and the validation loss is reduced to 0.30184.

In Figure 4.6 shows the decrease of the training loss and the validation loss along with more

epochs.

At the first epoch, both training loss and validation losses are high, which meant that the

trained model in epoch 0 was under fit for the problem. Underfit models usually cause

4.4. Train Model Evaluation And Mapping 53

Figure 4.6: Training and validation loss for neural network training process

poor generalization and unreliable predictions. After epoch 50, the validation loss seemed to

converge and kept a certain level for the rest of the training, while the training loss continued

to decrease. This indicates that the model had a high risk of overfitting after epoch 50. The

overfitting problem can also lead to a loss of generalization of the model. Therefore, our

best model was found to be around epoch 50.

After applying the best trained model to the test data, we were then able to evaluate the

performance of the model. Figure 4.7 compared the actual output from the experiment and

the output from the model. The positive linear velocity indicated that the robot moved

forward and the negative velocity meant that the robot was moving backward. There was

bad fitting in the region where the robot had a positive linear velocity. The results of the bad

fitting region is displayed in 4.7b. The actual output of this region showed poor continuity

which was highly suspected to be noise from unexpected input. When the robot was moving

forward, the AR tag was smaller in the view. Consequently, it was easier for the robot to


pick up the environment noise as the input. It was difficult to filter out this kind of noise in

the designed algorithm because the input was within the expected input range, which was

the range of camera. However, the neural network computed model was able to ignore the

noise and create a smoother trajectory.

Applied the Equation 3.10, the error is computed and displayed in 4.2.

RMSEcmd =

RMSElinear.x

RMSEangular.z

=

0.03680.0383

(4.2)

From the evaluation results, we can see that the proposed NN can perform as good as the

original algorithm. In addition, the NN behaved in a more robust way with regard to dealing

with the noise and generating smoother trajectories.

(a) Training model evaluation with test data (b) Detail for bad fitting region

Figure 4.7: Training model evaluation

To achieve the final goal of the navigation system, we needed an accurate map for robot

localization in addition to a good following model. Through the application of the Karto

SLAM method described previously, the map in Figure 4.8a is determined. As can be seen

in the figure, the produced map has a noticeable drift.

4.4. Train Model Evaluation And Mapping 55

(a) Map built by Karto SLAM (b) Map built by Cartographer

Figure 4.8: Map from two SLAM method

The map built from Cartographer is shown in Figure 4.8b. When compared with the map

from Karto SLAM, it is immediately clear that the Cartographer map has significantly less

drift. In addition to generating maps, Cartographer can generate the trajectory of the robot

while building a map. The corresponding trajectory of 4.8b is shown in 4.9, where the green

dot is the starting position and the red dot is the stopping position.

At the beginning of the map generating process, Cartographer also has some drift as shown in

4.10. As shown in the figure, both graphs start at the same position and have the trajectory

at the beginning, but the trajectory starts to drift after certain time. Instead of generating

a map with two drifted loops as displayed in Figure 4.8a, Cartographer closes the loop based

on the laser matching and constrains.

According to the output maps from two SLAM tools, we chose Cartographer to generate the

map for robot localization.


Figure 4.9: Map built by Cartographer with trajectory

4.5 Summary

In this chapter, the performance of camera is validated. The Intel SR300 was able to track

the AR tag continuously using the RGB images. Therefore, the SR300 was used for all exper-

imental work presented in this thesis. Due to the complicated and crowded lab environment

and the potential danger of hitting people or objects in the lab, the turtlesim tool based in

ROS was used for simulation. The simulation results validated that the follow algorithm was

successful. The turtle moved or rested according to the marker’s position. Also, the turtle

simulation conformed that the bst proxy server offered a more stable connection between

Alexa custom skill and ROS. The turtlesim’s movement could be controlled by Alexa via

bst proxy. On the basis of the previous simulation, we changed the controlled topic from

/turtlesim/cmd_vel to /cmd_vel, and mounted SR300 in the place of Carmine1.09. By

4.5. Summary 57

Figure 4.10: Trajectory for one loop and two loops

saying the appropriate commands, Fetch can be controlled to move or to follow an AR tag.

Also, a neural network algorithm was used to train the target following model and validated

with small error. The neural network computed model showed good robustness to the en-

vironment and generated a more continuous moving commands. After comparing different

SLAM tools for map generation, we found Cartographer had the most stable performance.

Hence, we choose Cartographer to build a map for robot precise localization.

Chapter 5

Discussion and Future Work

There were several limitations that we encountered while completing this project which

will be discussed in more detail in this chapter. Additionally, the robot and the methods

presented in this thesis could have other possible applications. In the second section of this

chapter, the possible future applications will be explored.

5.1 Limits

One main limitation of the robot Fetch, is that Fetch is designed to be a indoor robot which

uses wheels to move. Therefore, the movement of the robot on an uneven surface will be

restricted. If there are stairs or holes on Fetch’s moving path, the robot will stop in front of

stairs or stuck in the holes. Also, while completing the detect and following AR tag test, the

camera sometimes lost target because the camera frequency is restricted to be 30 Hz, which

results in some data lost.

In addition to the hardware restrictions, the robot also has multiple software limitations.

The Ubuntu and ROS versions that were used were old and had already stopped being

supported. Updates and support were also not available after April 2019. In addition, a

lot of new functions are not provided for ROS indigo. For example, eval for roslaunch,

which evaluate python expressions, can only be used in the ROS version later than Kinetic.

Besides, ROS 1.0 has dependency conflict with python 3, which greatly limits the range of

58

5.2. Future Work 59

applications. Although the python 2.7 works well so far, it will retire in January 2020. There

are multiple ways to use Python 3 with ROS1. One of the most recommended ways is to

use it in a virtual environment since the installation of python 3 will remove all current ROS

packages, which can mess up with all the work that has been done so far. File back ups will

be highly essential before any updates take place.

5.2 Future Work

5.2.1 Software Upgrade

As I mentioned above, currently there are numerous limitations for Fetch with the majority

of the restrictions coming from outdated software. Therefore, one of the biggest jobs is

upgrade the software for future use. All the important files needs to have a backup since the

update from Ubuntu 14.04 + ROS indigo to Ubuntu 18.04 + ROS melodic is not supported

by the Fetch.Inc. There no direct way to upgrade from Ubuntu 14.04 to 18.08. First, we

will need to upgrade to Ubuntu 16.04, and then upgrade to 18.04. This process will be

time consuming and can not guarantee the result. In addition, this upgrade will not fix the

compatibility issue of python 3. Currently, ROS 2 supports python 3, but ROS 2 is not

completely ready for use.

5.2.2 Continued Navigation Development

After upgrading the operating system, we were then able to continue to develop an intelligent

navigation system. As a first approach to the final goal, we let the robot learn human

following behaviors. Instead of a given program computed answer, the example solution was

given by the human control command. The robot was then controlled by a human with a

60 Chapter 5. Discussion and Future Work

joystick to follow the target. Similarly, the experimental data was used for NN training. In

this case, we expected the robot to imitate a human navigation behavior without a given

algorithm. Also, it is important to note that obstacle avoidance during the following is not

currently considered. The obstacle avoidance implementation will be necessary in future

work to enable robot-human coexistence in various environments. It will help the robot

assistant to behave more like a human assistant.

Furthermore, the map information can be integrated with the navigation system and Alexa

control. If the robot knows where it is on the map, it can predict the possible location of

the target it should follow. In addition, the robot can be used for object delivery with the

map. By enabling navigation with Alexa, we can send the desired destination for the robot

using voice commands.

5.2.3 Other Hardware

In addition to navigation, the application of other hardware can be developed in the future.

First of all, the manipulator was not used in this project at all. The manipulator is one of

the most useful functions of fetch. The arm can reach higher than 180 cm and ± 90◦ from

the center. The arm is designed to support up to 6 kg. It can be used to pick up object

or can simply be a holder for certain objects. The robot can be expected to help the elder

individuals solve their problems in their daily lives. For example, if the user needs assistance

with picking up items around the home, the manipulator will be needed.

Chapter 6

Conclusions

The IoT has as enormous number of possibilities as the internet developing. As one of the

most important components of the IoT, smart speakers are becoming more and more popular.

In addition to playing music, smart speakers are also virtual assistants, which understand

certain voice commands and take actions based on those commands such as searching the

weather, setting timers or alarms, and turning on and off lights. However, the IoT nowadays

is limited by mobility. The facilities that a VA can control usually are stationary. We believe

that the new trending applications of the IoT in the field of robotics, will be able to break

down those limitations. This belief is rooted in the idea that robots can be designed and

have more diverse and numerous functions than typical household utilities. For example, if

a robot is equipped with wheels or legs, it could be able to move and complete tasks around

a home or workplace. If a robot is equipped with arms and grippers, it could be able to pick

or hold items. It a robot is equipped with laser, it can tell us how far a obstacle is.

In this project, we were able to successfully connect a mobile robot, Fetch, to an IVA, Alexa.

The IVA was able to understand given commands and was able to control the robot according

to the commands. The robot could be directed to a given direction. It could start to detect

a AR tag without following it. It was also able to follow the AR tag while maintaining a

certain distance. Based on the hand-engineered algorithm, we applied a data-driven neural

network algorithm to develop a similar following model. The neural network architecture can

be further improved for human navigation behavior imitation. The neural network trained

61

62 Chapter 6. Conclusions

model performed as well as the original algorithm. Moreover, it is less sensitive to the noise

and manage to compute a more continuous control commands. In order to generate a map

for robot localization, two SLAM tools were tested and compared. Cartographer barely has

drift and provide 3D SLAM if needed while Karto SLAM has larger drift and the drift will

not be eliminated by loop closure.

Bibliography

[1] Primesense 3d sensors.

[2] Wiki: cvbridge. URL http://wiki.ros.org/cv_bridge.

[3] Wiki:distributions. URL http://wiki.ros.org/Distributions.

[4] Fetch robotics. URL https://fetchrobotics.com/robotics-platforms/

fetch-mobile-manipulator/.

[5] primesense carmine 1.09. URL http://xtionprolive.com/primesense-carmine-1.

09.

[6] Intel sr 300. URL https://click.intel.com/media/catalog/product/cache/1/

image/9df78eab33525d08d6e5fb8d27136e95/p/s/ps-blasterx_senz3d_front_1.

png.

[7] Number of connected devices reached 22 billion, where is the revenue?, May 2019. URL

https://www.helpnetsecurity.com/2019/05/23/connected-devices-growth/.

[8] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig

Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,

Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal

Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat

Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,

Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay

Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin

Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on

63

http://wiki.ros.org/cv_bridge

http://wiki.ros.org/Distributions

https://fetchrobotics.com/robotics-platforms/fetch-mobile-manipulator/

https://fetchrobotics.com/robotics-platforms/fetch-mobile-manipulator/

http://xtionprolive.com/primesense-carmine-1.09

http://xtionprolive.com/primesense-carmine-1.09

https://click.intel.com/media/catalog/product/cache/1/image/9df78eab33525d08d6e5fb8d27136e95/p/s/ps-blasterx_senz3d_front_1.png



https://www.helpnetsecurity.com/2019/05/23/connected-devices-growth/

64 BIBLIOGRAPHY

heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available

from tensorflow.org.

[9] Raafat Aburukba, AR Al-Ali, Nourhan Kandil, and Diala AbuDamis. Configurable

zigbee-based control system for people with multiple disabilities in smart homes. In 2016

International Conference on Industrial Informatics and Computer Systems (CIICS),

pages 1–5. IEEE, 2016.

[10] Tawfiq Ammari, Jofish Kaye, Janice Y Tsai, and Frank Bentley. Music, search, and

iot: How people (really) use voice assistants. ACM Transactions on Computer-Human

Interaction (TOCHI), 26(3):17, 2019.

[11] Roger Bemelmans, Gert Jan Gelderblom, Pieter Jonker, and Luc De Witte. Socially as-

sistive robots in elderly care: A systematic review into effects and effectiveness. Journal

of the American Medical Directors Association, 13(2):114–120, 2012.

[12] Alessio Botta, Walter De Donato, Valerio Persico, and Antonio Pescapé. Integration of

cloud computing and internet of things: a survey. Future generation computer systems,

56:684–700, 2016.

[13] Monica Carfagni, Rocco Furferi, Lapo Governi, Michaela Servi, Francesca Uccheddu,

and Yary Volpe. On the performance of the intel sr300 depth camera: metrological and

critical characterization. IEEE Sensors Journal, 17(14):4508–4519, 2017.

[14] Yu Fan Chen, Michael Everett, Miao Liu, and Jonathan P How. Socially aware motion

planning with deep reinforcement learning. In 2017 IEEE/RSJ International Conference

on Intelligent Robots and Systems (IROS), pages 1343–1350. IEEE, 2017.

[15] François Chollet et al. Keras. https://keras.io, 2015.

https://www.tensorflow.org/

https://keras.io

BIBLIOGRAPHY 65

[16] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and

Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine

learning research, 12(Aug):2493–2537, 2011.

[17] Intel Corporations. Intel realsense camera sr300 data sheet revision 1.0, 2016.

[18] Christopher Crick, Graylin Jay, Sarah Osentoski, Benjamin Pitzer, and Odest Chad-

wicke Jenkins. Rosbridge: Ros for non-ros users. In Robotics Research, pages 493–504.

Springer, 2017.

[19] Craig C Douglas and Robert A Lodder. Human identification and localization by robots

in collaborative environments. Procedia Computer Science, 108:1602–1611, 2017.

[20] Esther Calvo Fernández, José Manuel Cordero, George Vouros, Nikos Pelekis,

Theocharis Kravaris, Harris Georgiou, Georg Fuchs, Natalya Andrienko, Gennady An-

drienko, Enrique Casado, et al. Dart: a machine-learning approach to trajectory predic-

tion and demand-capacity balancing. SESAR Innovation Days, Belgrade, pages 28–30,

2017.

[21] Gonzalo Ferrer, Anais Garrell, and Alberto Sanfeliu. Robot companion: A social-force

based approach with human awareness-navigation in crowded environments. In 2013

IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1688–

1694. IEEE, 2013.

[22] Rafael Fierro and Frank L Lewis. Control of a nonholonomic mobile robot using neural

networks. IEEE transactions on neural networks, 9(4):589–600, 1998.

[23] Maksim Filipenko and Ilya Afanasyev. Comparison of various slam systems for mobile

robot in an indoor environment. In 2018 International Conference on Intelligent Systems

(IS), pages 400–407. IEEE, 2018.

66 BIBLIOGRAPHY

[24] Nicolas Gautier, J-L Aider, THOMAS Duriez, BR Noack, Marc Segond, and Markus

Abel. Closed-loop separation control using machine learning. Journal of Fluid Mechan-

ics, 770:442–457, 2015.

[25] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami.

Internet of things (iot): A vision, architectural elements, and future directions. Future

generation computer systems, 29(7):1645–1660, 2013.

[26] Xavier Harding. The ’pokémon go’ improved ar mode is now on iphone and android

- here’s how to use it, Oct 2018. URL https://www.mic.com/articles/191915/

pokemon-go-improved-ar-mode-iphone-android.

[27] Wolfgang Hess, Damon Kohler, Holger Rapp, and Daniel Andor. Real-time loop closure

in 2d lidar slam. In 2016 IEEE International Conference on Robotics and Automation

(ICRA), pages 1271–1278. IEEE, 2016.

[28] Alejandro Hidalgo-Paniagua, Andrés Millan-Alcaide, Juan P Bandera, and Antonio

Bandera. Integration of the alexa assistant as a voice interface for robotics platforms.

In Iberian Robotics conference, pages 575–586. Springer, 2019.

[29] Pengju Jin, Pyry Matikainen, and Siddhartha S Srinivasa. Sensor fusion for fiducial tags:

Highly robust pose estimation from single frame rgbd. In 2017 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), pages 5770–5776. IEEE, 2017.

[30] Jan Jungbluth, Rolf Krieger, Wolfgang Gerke, and Peter Plapper. Combining virtual

and robot assistants-a case study about integrating amazon’s alexa as a voice interface

in robotics. In Robotix-Academy Conference for Industrial Robotics (RACIR) 2018,

page 5. Shaker, 2018.

[31] Niklas Karlsson, Enrico Di Bernardo, Jim Ostrowski, Luis Goncalves, Paolo Pirjanian,

https://www.mic.com/articles/191915/pokemon-go-improved-ar-mode-iphone-android

https://www.mic.com/articles/191915/pokemon-go-improved-ar-mode-iphone-android

BIBLIOGRAPHY 67

and Mario E Munich. The vslam algorithm for robust localization and mapping. In

Proceedings of the 2005 IEEE international conference on robotics and automation,

pages 24–29. IEEE, 2005.

[32] Veton Kepuska and Gamal Bohouta. Next-generation of virtual personal assistants

(microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th

Annual Computing and Communication Workshop and Conference (CCWC), pages 99–

103. IEEE, 2018.

[33] Bret Kinsella and Ava Mutchler. smart speaker consumer adoption report 2019. Tech-

nical report, voicebot.ai, 2019. URL https://voicebot.ai/wp-content/uploads/

2019/03/smart_speaker_consumer_adoption_report_2019.pdf.

[34] Adrian Korodi, Alexandru Codrean, Liviu Banita, Vlad Ceregan, Anamaria Butaru,

and Radu Carnaru. Object following control for wheeled mobile robots. In Proceedings

of the 9th WSEAS International Conference on International Conference on Automation

and Information, pages 338–343. World Scientific and Engineering Academy and Society

(WSEAS), 2008.

[35] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A review

of classification techniques. Emerging artificial intelligence applications in computer

engineering, 160:3–24, 2007.

[36] Deepak Kumar, Riccardo Paccagnella, Paul Murley, Eric Hennenfent, Joshua Mason,

Adam Bates, and Michael Bailey. Emerging threats in internet of things voice services.

IEEE Security & Privacy, 2019.

[37] Ti-Chung Lee, Kai-Tai Song, Ching-Hung Lee, and Ching-Cheng Teng. Tracking con-

trol of unicycle-modeled mobile robots using a saturation feedback controller. IEEE

transactions on control systems technology, 9(2):305–318, 2001.

https://voicebot.ai/wp-content/uploads/2019/03/smart_speaker_consumer_adoption_report_2019.pdf

https://voicebot.ai/wp-content/uploads/2019/03/smart_speaker_consumer_adoption_report_2019.pdf

68 BIBLIOGRAPHY

[38] John J Leonard and Hugh F Durrant-Whyte. Simultaneous map building and localiza-

tion for an autonomous mobile robot. In IROS, volume 3, pages 1442–1447, 1991.

[39] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve

neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.

[40] Friedemann Mattern and Christian Floerkemeier. From the internet of computers to

the internet of things. In From active data management to event-based systems and

more, pages 242–259. Springer, 2010.

[41] Rajdeep Kumar Nath, Rajnish Bajpai, and Himanshu Thapliyal. Iot based indoor

location detection system for smart home environment. In 2018 IEEE International

Conference on Consumer Electronics (ICCE), pages 1–3. IEEE, 2018.

[42] Scott Niekum. Ros wiki, Dec 2013. URL http://wiki.ros.org/ar_track_alvar.

[43] J Norberto Pires. Robot-by-voice: Experiments on commanding an industrial robot

using the human voice. Industrial Robot: An International Journal, 32(6):505–511,

2005.

[44] Douglas A Orr and Laura Sanchez. Alexa, did you get that? determining the evidentiary

value of data stored by the amazon® echo. Digital Investigation, 24:72–78, 2018.

[45] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob

Wheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA

workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.

[46] Nathan Ratliff, Franziska Meier, Daniel Kappler, and Stefan Schaal. Doomed: Direct

online optimization of modeling errors in dynamics. Big data, 4(4):253–268, 2016.

http://wiki.ros.org/ar_track_alvar

BIBLIOGRAPHY 69

[47] Hailin Ren, Jingyuan Qi, and Pinhas Ben-Tzvi. Learning flatness-based controller using

neural network. In ASME 2019 Dynamic Systems and Control Conference. American

Society of Mechanical Engineers Digital Collection, 2019.

[48] Luis Riazuelo, Moritz Tenorth, Daniel Di Marco, Marta Salas, Dorian Gálvez-López,

Lorenz Mösenlechner, Lars Kunze, Michael Beetz, Juan D Tardós, Luis Montano, et al.

Roboearth semantic mapping: A cloud enabled knowledge-based approach. IEEE

Transactions on Automation Science and Engineering, 12(2):432–443, 2015.

[49] Hayley Robinson, Bruce MacDonald, and Elizabeth Broadbent. The role of healthcare

robots for older people at home: A review. International Journal of Social Robotics, 6

(4):575–591, 2014.

[50] Margaret Rouse. What is internet of things (iot)? - definition from whatis.com,

Jul 2019. URL https://internetofthingsagenda.techtarget.com/definition/

Internet-of-Things-IoT.

[51] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point cloud library (pcl). In 2011

IEEE international conference on robotics and automation, pages 1–4. IEEE, 2011.

[52] Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I Hong. Hey alexa, what’s up?: A

mixed-methods studies of in-home conversational agent usage. In Proceedings of the

2018 Designing Interactive Systems Conference, pages 857–868. ACM, 2018.

[53] Yosuke Senta, Yoshihiko Kimuro, Syuhei Takarabe, and Tsutomu Hasegawa. Ma-

chine learning approach to self-localization of mobile robots using rfid tag. In 2007

IEEE/ASME international conference on advanced intelligent mechatronics, pages 1–6.

IEEE, 2007.

[54] David Sheppard, Nick Felker, and John Schmalzel. Development of voice commands in

https://internetofthingsagenda.techtarget.com/definition/Internet-of-Things-IoT

https://internetofthingsagenda.techtarget.com/definition/Internet-of-Things-IoT

70 BIBLIOGRAPHY

digital signage for improved indoor navigation using google assistant sdk. In 2019 IEEE

Sensors Applications Symposium (SAS), pages 1–5. IEEE, 2019.

[55] Rathin Chandra Shit, Suraj Sharma, Deepak Puthal, and Albert Y Zomaya. Location

of things (lot): A review and taxonomy of sensors localization in iot infrastructure.

IEEE Communications Surveys & Tutorials, 20(3):2028–2061, 2018.

[56] William D Smart and L Pack Kaelbling. Effective reinforcement learning for mobile

robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation

(Cat. No. 02CH37292), volume 4, pages 3404–3410. IEEE, 2002.

[57] José A Solorio, José M Garcia-Bravo, and Brittany A Newell. Voice activated semi-

autonomous vehicle using off the shelf home automation hardware. IEEE Internet of

Things Journal, 5(6):5046–5054, 2018.

[58] John A Stankovic. Research directions for the internet of things. IEEE Internet of

Things Journal, 1(1):3–9, 2014.

[59] Chin Pei Tang. Differential flatness-based kinematic and dynamic control of a differen-

tially driven wheeled mobile robot. In 2009 IEEE International Conference on Robotics

and Biomimetics (ROBIO), pages 2267–2272. IEEE, 2009.

[60] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron,

James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al.

Stanley: The robot that won the darpa grand challenge. Journal of field Robotics, 23

(9):661–692, 2006.

[61] Shrihari Vasudevan, Stefan Gächter, Viet Nguyen, and Roland Siegwart. Cognitive

maps for mobile robots—an object based approach. Robotics and Autonomous Systems,

55(5):359–371, 2007.

BIBLIOGRAPHY 71

[62] Regis Vincent, Benson Limketkai, and Michael Eriksen. Comparison of indoor robot

localization techniques in the absence of gps. In Detection and Sensing of Mines,

Explosive Objects, and Obscured Targets XV, volume 7664, page 76641Z. International

Society for Optics and Photonics, 2010.

[63] Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-

end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE

International Conference on Robotics and Automation (ICRA), pages 2043–2050. IEEE,

2017.

[64] Mark Weiser. The computer for the 21st century. IEEE pervasive computing, 1(1):

19–25, 2002.

[65] Melonee Wise, Michael Ferguson, Derek King, Eric Diehr, and David Dymesich. Fetch

and freight: Standard platforms for service robot applications. In Workshop on au-

tonomous mobile service robots, 2016.

[66] Takashi Yoshimi, Manabu Nishiyama, Takafumi Sonoura, Hideichi Nakamoto, Seiji

Tokura, Hirokazu Sato, Fumio Ozaki, Nobuto Matsuhira, and Hiroshi Mizoguchi. De-

velopment of a person following robot with vision based target detection. In 2006

IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5286–

5291. IEEE, 2006.

Appendices

72

Appendix A

Alexa

A.1 Turtle Simulation

Figure A.1: Turtle simulation for Alexa via AWS Lambda

73

74 Appendix A. Alexa

Figure A.2: Turtle simulation for Alexa control

A.1. Turtle Simulation 75

Figure A.3: Complete JSON input

Figure A.4: Complete JSON output

76 Appendix A. Alexa

Figure A.5: Turtle simulation for AR tag following

A.1. Turtle Simulation 77

Figure A.6: Detail device log

Appendix B

Neural Network

Figure B.1: Training Algorithm [47]

78

Documents

Voice Control of Fetch Robot Using Amazon Alexa · Siri, Amazon’s Alexa, and Google Assistant, do not currently have any physical functions. As an important part of the internet