An intuitive motion-based input model for mobile devicesAn intuitive motion-based input model for mobile devices Mark Richards Thesis submitted for the degree of Masters of Information

An intuitive motion-based input model

for mobile devices

Mark Richards

Thesis submitted for the degree of Masters of Information Technology (Research) to the

School of Information Systems at the Queensland University of Technology, Australia.

December, 2006.

Keywords Input Model, Human – Computer Interface, Mobile Device, User Interface, Input Devices,

Interaction Styles, Automated Survey, DirectX Mobile, DirectShow, Windows Mobile, Computer

Vision, Image Processing, Edge Detection, Object Detection, Motion Tracking, Scene Analysis,

Human Movement, ARToolkit, Augmented Reality.

Publication arising from this research

Richards, M., Dunn, T.L. and Pham, B. "Developing a Motion Input Model for Mobile Devices."

HCI International 2007.

LNCS Digital Library (LNCS, http://www.springer.com/lncs)

In Preparation:

“Automated Data Collection – User Testing on Mobile Devices”

“Implementing Reactive Object Detection on Windows Mobile”

“Analogue Input: Using a video and light to augment buttons”

1. INTRODUCTION ...................................................................................................... 15

1.1. Research Problem...............................................................................................................16

1.2. Research Aims & Objectives................................................................................................16

1.3. Research Questions ............................................................................................................17

1.4. Research Rationale .............................................................................................................17

1.5. Significance of Research......................................................................................................18

1.6. Limitations of Research.......................................................................................................19

1.7. Organisation of Thesis.........................................................................................................19

2. RESEARCH METHODOLOGY................................................................................ 22

2.1. Research Approach .............................................................................................................22

2.2. Requirement Analysis .........................................................................................................22

2.3. Research Breakdown ..........................................................................................................24 2.3.1. Communication model creation..........................................................................................24 2.3.2. Survey Construction/Consideration ....................................................................................24 2.3.3. Implementation upon desktop ...........................................................................................25 2.3.4. Implementation upon mobile .............................................................................................27

2.4. Deliverable Outcomes.........................................................................................................28

2.5. Reliability of Results ...........................................................................................................28

2.6. Problems Encountered........................................................................................................29

2.7. Ethical Considerations.........................................................................................................30

3. REQUIREMENT SPECIFICATIONS...................................................................... 31

3.1. User Requirements .............................................................................................................31

3.2. Scope of Communications...................................................................................................32 3.2.1. Low-Level.............................................................................................................................32

3.2.2. Textual Input .......................................................................................................................33 3.2.2.1. Hex ..............................................................................................................................33 3.2.2.1. Graffiti & Unistrokes ...................................................................................................35

3.2.3. High Level ............................................................................................................................36 3.2.3.1. DyPERS ........................................................................................................................36 3.2.3.2. Profiling Usage ............................................................................................................37

3.3. Motion and Error Detection ................................................................................................38 3.3.1. Movement Tracking ............................................................................................................38

3.3.1.1 Axial.................................................................................................................................38 3.3.1.2. Rotational....................................................................................................................38

3.3.2. Error Detection and Correction...........................................................................................39 3.3.2.1. Gait Phases..................................................................................................................39

3.4. Detection Algorithms ..........................................................................................................40 3.4.1. Edge Detection ....................................................................................................................41 3.4.2. Object Detection .................................................................................................................43

3.5. Algorithm to determine appropriateness ............................................................................44

3.6. Model Design......................................................................................................................45 3.6.1. Interaction Breakdown........................................................................................................45 3.6.2. Input Types (High level commands) ....................................................................................45 3.6.3. Input Motions......................................................................................................................48

3.7. Gesture Recognition ...........................................................................................................51 3.7.1. Profiling users......................................................................................................................52 3.7.2. Input Prediction...................................................................................................................53

4. SURVEY CONSTRUCTION ..................................................................................... 54

4.1. Aim.....................................................................................................................................55

4.2. Survey Structure .................................................................................................................55

4.3. Survey One – Initial Data Collecting.....................................................................................56 4.3.1. Design..................................................................................................................................56

4.3.1.1. Survey One: Part One – Understanding the Users......................................................56

4.3.1.2. Survey One: Part Two – Gauging User’s Reactions.....................................................58 4.3.2. Participants..........................................................................................................................59 4.3.3. Data Collection Methods.....................................................................................................59 4.3.4. Ethical Considerations.........................................................................................................60 4.3.5. Initial Analysis......................................................................................................................60 4.3.6. Particulars of Note ..............................................................................................................63

4.4. Development of an Autonomous Survey on a PDA ..............................................................64 4.4.1. An automated survey versus traditional means .................................................................64 4.4.2. Smartphone Development ..................................................................................................64

4.4.2.1. Device Information .....................................................................................................65 4.4.2.2. .NET CF and Embedded C++........................................................................................66 4.4.2.3. Camera API..................................................................................................................67

4.4.3. Input mediums of a PDA .....................................................................................................67 4.4.3.1. Textual Input ...............................................................................................................68 4.4.3.2. Touch Input .................................................................................................................68 4.4.3.3. Audio...........................................................................................................................69 4.4.3.4. Video ...........................................................................................................................69 4.4.3.5. Motion ........................................................................................................................70 4.4.3.6. Other...........................................................................................................................70

4.4.4. Information Storage ............................................................................................................71 4.4.4.1. Video and Audio..........................................................................................................71 4.4.4.2. Other Data ..................................................................................................................72

4.4.5. Questions and Survey Design..............................................................................................73 4.4.5.1. Survey .........................................................................................................................73 4.4.5.2. Questions ....................................................................................................................73

4.4.6. Survey Environments ..........................................................................................................74 4.4.6.1. Appropriate environments for use of an automated survey......................................74 4.4.6.2. Preparing environments for more meaningful video results......................................75

4.4.7. Common Obstacles .............................................................................................................76 4.4.8. Device Resources.................................................................................................................77 4.4.9. Automated Survey Summary ..............................................................................................77

4.5. Survey Two – Testing User Decisions and Reactions ............................................................78 4.5.1. Design..................................................................................................................................79 4.5.2. Data Collection Methods.....................................................................................................81

4.5.3. Survey Conditions................................................................................................................82 4.5.4. The Tests .............................................................................................................................84

4.5.4.1. Functionality Required................................................................................................84 4.5.4.2. Choosing Data on the Screen......................................................................................85 4.5.4.3. Adjustment to Counteract the Changing Image .........................................................87 4.5.4.4. Modification - Dealing with a Warped World.............................................................89 4.5.4.5. Confirmation with the Faces.......................................................................................91 4.5.4.6. Scrolling Functionality.................................................................................................92 4.5.4.7. Simulating a Phone Call...............................................................................................94

4.5.5. Participants..........................................................................................................................96 4.5.6. Additional Details/Observations .........................................................................................96

4.6. Survey and Concept Summary.............................................................................................97

5. BASIC MODEL CREATION..................................................................................... 98

5.1. Collected Data Classification ...............................................................................................99

5.2. Situational Motions ............................................................................................................99

5.3. Model Expandability .........................................................................................................100

5.4. Inappropriate Commands .................................................................................................101

5.5. The Base Model ................................................................................................................101 5.5.1. Confirmation .....................................................................................................................103 5.5.2. Movement.........................................................................................................................103 5.5.3. Choosing............................................................................................................................104 5.5.4. Selection............................................................................................................................105

5.6. Summary of Model Creation .............................................................................................106

6. PROTOTYPE DEVELOPMENT ...........................................................................107

6.1. Using DirectShow..............................................................................................................107 6.1.1. DirectShow Filters .............................................................................................................108

6.2. Desktop Development ......................................................................................................109

6.2.1. Using a Filter to detect squares.........................................................................................109 6.2.1.1. The Augmented Reality Toolkit.................................................................................109 6.2.1.2. DirectShow Filter Basics............................................................................................110

6.2.3. Mapping to 3D...................................................................................................................114 6.2.4. Tracking motion from cube transformations ....................................................................115

6.3. Windows Mobile 5 Development ......................................................................................116 6.3.1. Porting detection filters ....................................................................................................117 6.3.2. Camera Initialization .........................................................................................................118 6.3.3. Image Output ....................................................................................................................120 6.3.4. Data Display.......................................................................................................................120

6.4. Prototype Summary..........................................................................................................121

7. CONCLUSIONS AND FUTURE WORK...............................................................123

7.1. Answers to Research Questions ........................................................................................123 7.1.1. What functionalities of a phone's features are appropriate candidates to be used as parts

of a motion input scheme? ................................................................................................................123 7.1.2. Is it possible to construct a rational and useable mapping scheme for phone inputs?....123 7.1.3. Can people adapt to using motion gestures as an input medium and what are considered

suitable (not embarrassing. over-exertive) motions to perform? .....................................................123 7.1.4. How uniformly do people perform motions given to them (different people, slight

difference in movement) and can these variations be adapted to?..................................................124 7.1.5. How suitable are images (collected by the embedded mobile cameras) for in-depth image

processing? ........................................................................................................................................124 7.1.6. Can real-time performance of image detection algorithms and movement calculations on

Smartphones™ be achieved?.............................................................................................................124 7.1.7. Will tracking movement critical to this project unexpectedly interfere with the normal

usage of the phone? ..........................................................................................................................124

7.2. Contributions to Research.................................................................................................125

7.2. Limitations........................................................................................................................126

7.4. Potential Applications.......................................................................................................127

7.5. Future Work .....................................................................................................................128

APPENDIX A – SAMPLE INPUTS ..................................................................................129

APPENDIX B – INPUT TYPE BREAKDOWN ..............................................................131

APPENDIX C – SURVEY ONE HANDOUT....................................................................134

APPENDIX D – SURVEY ONE PARTICIPANT BREAKDOWN................................137

APPENDIX E – SURVEY TWO AUDIO..........................................................................139

APPENDIX F – SAMPLE SURVEY TWO RESULTS....................................................140

APPENDIX G – BASE INPUT COMPRESSION ............................................................142

APPENDIX H – BASE SITUATIONS ..............................................................................145

APPENDIX I – SAMPLE MOTION MODEL ..................................................................146

BIBLIOGRAPHY ................................................................................................................150

10 | P a g e

List of Figures Figure 1-1: Thesis Breakdown.................................................................................................. 19 Figure 3-1: The Hex Interface................................................................................................... 34 Figure 3-2: Hex in Action ......................................................................................................... 35 Figure 3-3: Natural Letter Matching......................................................................................... 35 Figure 3-4: Letter Subsets ......................................................................................................... 36 Figure 3-5: Single Strokes ........................................................................................................ 36 Figure 3-6: DyPERS in Action ................................................................................................. 37 Figure 4-1: Basic Outline of Survey Situation.......................................................................... 59 Figure 4-11: Modification Image, High Brightness.................................................................. 89 Figure 4-12: A Happy Face....................................................................................................... 91 Figure 4-13: Sad Face ............................................................................................................... 91 Figure 4-14: The Functionality Test and the Text within. ........................................................ 92 Figure 5-1: Example of how the Information fits together from the Model ........................... 102 Figure 6-1: GraphEdit ............................................................................................................. 107 Figure 6-2: Image Data with Alpha Channel (Note the 00’s) ................................................. 112 Figure 6-3: Image Data without an Alpha Channel ................................................................ 112 Figure 6-4: Translation of Two-Dimensional Screen Data to Direct3D Polygon Format. ..... 114 Figure 6-5: Image without and with Display Filter applied.................................................... 121 Figure D-1: Age of Participants, Survey One ......................................................................... 137 Figure D-2: Nationality of Participants, Survey One.............................................................. 137 Figure D-3: Education of Participants, Survey One ............................................................... 138 Figure D-4: Employment of Participants, Survey One ........................................................... 138 Figure I-1: Model Map, Direction Down................................................................................ 146 Figure I-2: Model Map, Direction Up..................................................................................... 147 Figure I-3: Model Map, Direction Left................................................................................... 148 Figure I-4: Model Map, Direction Right................................................................................. 149

11 | P a g e

List of Tables Table 1: Header of Video Filenames for Survey Two .............................................................. 81 Table 2: Movement to Input Mapping .................................................................................... 104 Table 3: Precise Input to Motion Mapping ............................................................................. 105 Table 4: Input Breakdown ...................................................................................................... 131 Table 5: Sample 1, Survey Two Motion Breakdown.............................................................. 141 Table 6: Input Compression Part 1, Survey Two.................................................................... 142 Table 7: Input Compression Part 2, Survey Two.................................................................... 144 Table 8: Context Relationships ............................................................................................... 145

12 | P a g e

Acronyms Used .NET Microsoft dotNet Framwork

.NETcf Microsoft dotNet Compact Framework

API Application Programming Interface

ASF Advanced Streaming Format

ATL Active Template Library

AVI Audio Video Interleaved

BDA Broadcast Driver Architecture

BGR Blue, Green, Red (Inverted Image Format)

BGRA Blue, Green, Red, Alpha (Inverted Image Format)

COM Component Object Model

DLL Dynamic Link Library

FOV Field of View

GAPI Graphics Application Programming Interface

GPS Global Positioning System

GUID Globally Unique Identifier

LoG Laplacian of Gaussian

MAP Most Appropriate Polygon

MFC Microsoft Foundation Classes

MPEG4 Moving Picture Experts Group Standard 4

MSDN Microsoft Developer Network

PDA Personal Digital Assistant

RGB Red, Green, Blue (Image Format)

SDK Software Development Kit

USB Universal Serial Bus

WDM Windows Driver Model

XVid Digital Video Compression Format based on DivX (MPEG-4)

13 | P a g e

Statement of Authorship The work contained in this thesis has not been previously submitted to meet requirements for

an award at this or any other higher education institution. To the best of my knowledge and

belief, this thesis contains no material previously published or written by another person

except where due reference is made.

Signature:

Date:

14 | P a g e

Abstract Traditional methods of input on mobile devices are cumbersome and difficult to use. Devices

have become smaller, while their operating systems have become more complex, to the extent

that they are approaching the level of functionality found on desktop computer operating

systems. The buttons and toggle-sticks currently employed by mobile devices are a relatively

poor replacement for the keyboard and mouse style user interfaces used on their desktop

computer counterparts. For example, when looking at a screen image on a device, we should

be able to move the device to the left to indicate we wish the image to be panned in the same

direction.

This research investigates a new input model based on the natural hand motions and reactions

of users. The model developed by this work uses the generic embedded video cameras

available on almost all current-generation mobile devices to determine how the device is being

moved and maps this movement to an appropriate action.

Surveys using mobile devices were undertaken to determine both the appropriateness and

efficacy of such a model as well as to collect the foundational data with which to build the

model. Direct mappings between motions and inputs were achieved by analysing users’

motions and reactions in response to different tasks.

Upon the framework being completed, a proof of concept was created upon the Windows

Mobile Platform. This proof of concept leverages both DirectShow and Direct3D to track

objects in the video stream, maps these objects to a three-dimensional plane, and determines

device movements from this data.

This input model holds the promise of being a simpler and more intuitive method for users to

interact with their mobile devices, and has the added advantage that no hardware additions or

modifications are required the existing mobile devices.

An intuitive motion based model for mobile devices - Introduction

15 | P a g e

1. Introduction Traditionally mobile devices have had a very restricted range of input models for user

interaction, with most current devices still relying on buttons for user input. More advanced

devices may also include 4 to 9-way toggle switches; however these work on an identical input

model. A major advancement, and the only concept, that can claim to have broken away from

button style input on these devices and achieve significant market penetration are the touch-

screen and stylus interfaces. These devices however often simply display “virtual buttons” to

act in a similar way as the physical kind.

Scratchpad writing employs a different approach whereby the user inputs information as if

they are writing with ink and paper. This is typically text-based input as the device deploys

processor power towards ‘hand-writing recognition’ and converts written text directly to

digital text. To further speed up input processing, several methods have been devised to

increase input speed. Graffiti [29] is one such method which simplifies inputs and decreases the

chance of error. Another is uni-strokes [12], a method that breaks letters into single lines

meaning there is no need to lift the stylus at all. Such advancements aid people who own

touch screen devices in their everyday interactions with the device.

Many current mobile devices currently in use (either touch screen or without) incorporate

digital image/video capture into their architecture. This opens up a new medium for users to

transmit information to the device. Image processing algorithms can be used to process video

information sent to the device from the user with total freedom of movement along all 3 axis.

This obviously opens up a new dimension to the touch screen concepts discussed in the

previous paragraph while removing two restrictions, the small input environment and the need

for an external input device (stylus).

This new input dimension allows motions to be tracked from the hand of the user holding the

device. This movement can supply information directly to the device. Such motion can be

taken advantage of as a new form of input, using the device and how it is used as the focus.


16 | P a g e

1.1. Research Problem This project investigated the viability of a motion-based input model on mobile devices as a

whole. The focus has been on current (and next) generation Microsoft mobile devices. As this

is a new area of research, there is little published work regarding the development of a

framework mapping the possible input to device functionality. Therefore, this project began

with the development of a suitable framework to meet the needs of mobile device users, and

potentially make significant advances in the way users interact with these devices. The

framework developed incorporates appropriate commands, extensibility, ease of use and a

simplistic learning curve. Such a design should ensure it has relevance and advantages over

currently available input methods.

Currently devices use button presses as the major input medium. This can often be a problem

as the buttons are often small and difficult to use (this is becoming more and more apparent

with full keyboards being implemented on mobile devices). Many of these inputs are for

simple functions that realistically do not require these presses as the users intent can be

gathered from how they move the device about. Being small multifunctional devices they are

often held and moved in different ways by the user naturally depending on the desired use.

Detecting this movement and hence the users intent allows the removal of one step of the input

process and a more streamlined experience.

1.2. Research Aims & Objectives The ultimate goal of this project was to develop a functional prototype that interprets the user’s

motion input from an embedded camera and translates these into device instructions. This can

be broken down into four factors:

• Investigating a movement/motion framework that is intuitive to end users and adds a

respectable amount of functional possibilities for the phone.

• Creating a base framework by surveying a wide variety of mobile phone users and

mapping functionality to motions based on their responses.

• Developing an application that processes camera data and works out vector motions.

• Developing an application that takes this data and compiles it to be used with (a).


17 | P a g e

1.3. Research Questions The following questions were investigated in this project:

• What functionalities of a phone's features are appropriate candidates to be used as parts of

a motion input scheme?

• Is it possible to construct a rational and useable mapping scheme for phone inputs?

• Can people adapt to using motion gestures as an input medium and what are considered

suitable (not embarrassing. over-exertive) motions to perform?

• How uniformly do people perform motions given to them (different people, slight

difference in movement) and can these variations be adapted to?

• How suitable are images (collected by the embedded mobile cameras) for in-depth image

processing?

• Can real-time performance of image detection algorithms and movement calculations on

Smartphones™ be achieved?

• Will tracking movement critical to this project unexpectedly interfere with the normal

usage of the phone?

1.4. Research Rationale This path of research was chosen and decided upon for a number of reasons, these include:

• The market penetration of mobile devices with embedded cameras that can be used for

more than just taking pictures/video.

• The lack of true user interface advancements upon such mobile devices.

• The recent popularisation of Microsoft and Java mobile devices opening up development

for additional users.

• The recently released Windows Mobile 5 has significant advancements in the ability to

interact with the camera of a device.

• Recent research papers showing some of the possibilities that can be achieved by using a

camera as the input stimulus.


18 | P a g e

1.5. Significance of Research Mobile phone development is moving at a very rapid pace, with these devices integrating more

and more functionality with each generation of phone. Therefore, it is desirable to explore

possible alternative concepts that the enhanced power and functionality make available. A

principal component of phone architecture that has remained relatively unchanged has been the

way users interface with the devices. This is typically via the number pad buttons, soft keys or

a joystick. While it is commonly understood that a 103+ key desktop keyboard delivers more

throughput than a mouse (or other common interface device), the same does not hold for the

more limited 12-20 keys commonly available on smaller mobile devices.

It appears that there is a growing requirement for other more efficient interfacing models,

especially those that target the large proportion of less electronically savvy mobile phone end

users. Without an external device to aid input, motion and voice commands are the obvious

choices. Voice commands are already implemented in many commercial devices and most

motion research is currently reliant on external devices to collect the movement information.

Therefore creating a fully integrated solution that is easily expandable would prove useful in

both the commercial and research fields.

Creating a functional input model that is compatible with mobile devices as a whole would

allow further inroads into natural input interfaces. A device independent model could be

adapted to a multitude of devices, such as wearable equipment (including, but not limited to,

wristwatches and heads-up devices).


19 | P a g e

1.6. Limitations of Research The experimental implementation of this project is obviously limited by currently available

hardware devices. In addition, there are a few other areas of limitation that should be noted:

• The acceptance of such a model by owners and potential owners of applicable devices.

• Only Microsoft Smartphones™ have been targeted, not Symbian™ or Linux based devices,

as development paths would be totally different for these devices.

• Lack of a mature Camera API may limit the number of Smartphone™ devices motion

detection will work on.

• Tuning of suitable means to track motion on devices because of the limited battery and

CPU power.

• Limited funding has restricted testing and the availability of hardware for the development

stage.

• Limited time has restricted the development of a complete package, limiting the output of

this work to the completion of an initial framework.

1.7. Organisation of Thesis This thesis is organised in major categories documenting the steps taken through the research

course. There are three major sections of the thesis, the collection of data, the building of the

model and an implementation phase. This is outlined in the diagram below (Figure 1.1.)

Figure 1-1: Thesis Breakdown


20 | P a g e

A brief overview of the chapters follows:

Chapter 2 – Research Methodology

This chapter deals with the breakdown of what has to happen in terms of this research for it to

be considered a success. There is early planning on how the research should take place and

what steps need to happen in which order. Early dissection of some important topics

(communication) occurs so that further understanding can be achieved.

Chapter 3 – Requirement Specifications

Information regarding what users would need out of an input model is discussed here, along

with further discussion about motion itself and how to best recognise it with the resources

available.

Chapter 4 – Survey Construction

Documents the procedures taken to create and conduct surveys to collect user data for the

model. Two separate surveys were undertaken and their purposes and results are all recorded

here.

Chapter 5 – Basic Model Creation

With data available, a model to start mapping motions to inputs was possible. This chapter

talks about the design of the model itself, how it can be used and possible situations where the

model might not be the best input model to use. Early mappings that incorporate findings

from the surveys are included.


21 | P a g e

Chapter 6 – Prototype Development

With a plausible model to use, effort was directed to determine if the mobile devices of this

day were capable of recording and processing video efficiently enough to track motion.

Approaches had been discussed though out the thesis and by this stage an approach had been

determined. This chapter discusses the processes undertaken to achieve a working prototype

using these means.

Chapter 7 – Conclusions and Findings

This chapter sums up the research performed by discussing the contributions made in the

process as well as uses for the findings, both by higher research and commercial factions.

An intuitive motion based model for mobile devices – Research Methodology

22 | P a g e

2. Research Methodology This chapter breaks down the proposed research at the highest level and discusses the expected

outcomes. Early thoughts on the separate sections (data, model and implementation) of the

research are also displayed along with brief information on how they were performed.

Problems and difficulties are also discussed.

2.1. Research Approach This research can be broken into multiple significant stages that are independent from each

other and required different approaches to reach a conclusion. These are:

• Defining Project Scope (Deliverable Outcomes, Sect 2.4)

• Requirement Analysis (Sect 2.2)

• Motion and input classification (Chapter 3)

• User surveys (Chapter 4)

• Communication model creation (Sect 2.3.1)

• Implementation upon desktop (Sect 2.3.2)

• Implementation upon mobile (Sect 2.3.3)

Strong separation between the tasks was made to ensure that strong goals remained in case of

complications. This approach ensured the tasks were completed and delivered significant

research contributions despite the inevitable unforseen problems encountered.

2.2. Requirement Analysis A general understanding of how to use and analyse both motions and inputs needed to be

undertaken so classifications can be created. These classifications were built upon throughout

the project timeline as the project progressed.

The communication model was initially devised by user input and suggestions via a specially

designed survey. Results of this survey were melded into an initial model that was tested and

revised by more user feedback into a suitable model. This model was then incorporated into a

final creation and integrated into the prototype. Before the surveys were created,

classifications and further understanding needed to be developed. A solid set of base data

(inputs to be understood and motions to be read) needed to be outlined and broken down. A

qualitative approach was taken though out the development of the model.


23 | P a g e

The set of inputs created were varied, and encompassed a great deal of the functionality of the

phone at both low and high-level. Once this was done the inputs can were classified into

similar groups with the idea that similar inputs would have similar motions.

Similar motions may look different by sight only, but what makes them similar is the similarity

in the properties of the motion: short/long, number of direction changes, etc.

Breaking down development into desktop and mobile development was possible because of

the similarity of developing .NET/COM applications on both desktop and mobile, especially

when developing at the low level. An identical development suite and emulators ensured an

increase in productivity while remaining relevant to the final goal. The increased processing

power upon the desktop allowed us to see positive results earlier to ensure the right path was

being taken. Changes and tweaking were then developed to increase performance to such a

degree that the algorithms perform well enough on the less powerful mobile platform.

A large portion of the motion analysis research is already available [16], making the desktop

development stage experimental, involving testing the performance and reliability of available

algorithms. The majority of work was tested for performance and the suitability of approaches

to be ported to a mobile device. Preparation for phone porting was experimental, and included

some exploratory phases, as there is little previous work in this field. Changes to currently

available algorithms have been examined to cater to mobile devices which typically have

poorer cameras and less processing power.


24 | P a g e

2.3. Research Breakdown The three stages (Survey, Model and Implementation) relied on each other for completion of

the research. This section gives a brief introduction to each of these stages.

2.3.1. Communication model creation A suitable communication model for the end-users’ is extremely important, and will ultimately

dictate if such a project is truly viable for general use. Simplified, this model will consist of a

list of commands and/or situations, and a corresponding list specifying how the user is to

achieve or act.

The first step taken towards developing such a communication model, was to collect research

data regarding the various possible input movements. Analysing data collected enabled the

creation of an initial prototype of the model to be created. A methodology was created to

classify the collected information to manage its storage and recall (see Section 3.5).

Experiments with users were then conducted to determine the general suitability of the model,

and user input can suggest possible modifications to the model. With classifications and the

breakdown of inputs into categories, guidelines were created that suggested appropriate

motions for inputs, depending on their category (covered in Sections 4-6).

2.3.2. Survey Construction/Consideration Due to limited resources, the scale of the conducted surveys has been small. Because of the

nature of the information required, it was simply not sufficient to request written answers to

the posed questions. Movement data was most efficiently captured on video. This further

stretched resources and time simply because only one set of answers could be collected at a

time in such a manner.

The initial plan entailed finding 5-15 volunteers and supplying them with the survey details the

day before so that they can think about their answers (motions) overnight. These answers can

be videotaped with audio prompts so they can be archived and analysed.

Common properties of motions were collected over the entire group of volunteers to determine

overall motion information. This was then be verified in the second survey.


25 | P a g e

2.3.3. Implementation upon desktop Creating an initial image processing setup upon the desktop allows us to see the possible

results of our work far earlier. In that sense it can be considered a prototype of the final work.

Developed primarily in C++, a step-by-step approach of the actual final process in retrieving

the required data can be used. These steps were:

• Retrieving data from the camera.

This was emulated by the use of a web camera since they give a typically lower image quality

just like phone cameras. Since the data is a stream, the data was stored and analysed frame by

frame, with the aid of markers around the room.

• Collecting edge information from the camera

Processing the currently stored frame, edges were found in such a way that the resulting image

is not overly complex and hence simplify the next step in the process. This was done in

multiple ways to find the most efficient process (Figure 2.7).

• Creating vertex data from the edge image.

Finding true vertices in each image allowed the comparison of sequential frames. This is

because vertices will generally only change in location and the change will be constant

throughout the image. This vertex data was stored in arrays to allow quick and efficient access

this information.

• Comparing to previous data to obtain result

Finding the general movement direction by comparing arrays enabled us the ability to find

general directions of movement. A robust method was required that ignored false data and

averaged the vertex information over several frames. Some commands incorporated multiple

directions, and hence past information was stored as well. The performance of this method

depends on the complexity of the arrays of temporal data collected.

The development platform was a relatively powerful computer with a web-camera attached to

record information. Such a platform gave us the ability to emulate the final product with a

generally more flexible infrastructure. It also allowed us to find a solution that is applicable,

allowing work to improve efficiency and make the approach more applicable to the final

platform.


26 | P a g e

For a test of edge detection performance in .NET, a simple application was created to load and

dynamically detect edges given certain thresholds. To improve speed and performance, a lot

of the code was created as 'unsafe' (unmanaged) to avoid.NET overheads (such as garbage

collection and memory management). This was an attempt to close in on the environment that

would be used upon mobile devices.

Figure 2-1: A Dynamic Canny Edge Detection Algorithm using .NET


27 | P a g e

2.3.4. Implementation upon mobile Taking the desktop implementation and porting it to a mobile application depended greatly on

consideration taken earlier. Relying upon low-level C++ programming earlier in the project

facilitated the conversion. The greatest change was the modification from web camera data, to

relying on the embedded camera of the devices. Code changes to improve performance on the

more restricted mobile platform were required. But generally, most of the coding was

implemented on the original desktop implementation. Figure 2.2 illustrates the porting of

Figure 2.1 to a mobile device emulator.

A simplified proof of concept of the desktop implementation was created that performed edge

detection. This example used GAPI, however, the final version used the DirectShow

implementation within Windows Mobile 5.

Figure 2-2: Edge Detection upon Mobile Emulator


28 | P a g e

2.4. Deliverable Outcomes There were well defined deliverables expected from each stage to show definite progression in

the work. These were (stages highlighted in the previous section):

• Stage 1

A detailed and appropriate map showing basic functionality and the recommended motions for

that functionality should be deliverable. Documentation on how this was achieved and how it

is suitable for applying to other inputs must be present. Multiple repetitions and documented

changes/advancements should also be presentable as well as user feedback through the

process.

These maps should be easily applied by other designers for their own input motion techniques

and therefore be simple to understand and the steps behind them straight forward to

implement.

• Stage 2

A working prototype that maps direction vectors to the direction a web camera is moving.

This should display and record the information on the direction being moved as well as do

some simple filtering of garbage data collected. This code should be designed in such a way

that it is possible to transfer it relatively simply between the desktop and a mobile device.

Ultimately the two should be capable of being worked on simultaneously.

• Stage 3

Stage 3 consists of a mobile implementation of stage 2, as well as applications to test/showcase

the functionality. This will be done by using the in-built camera of the mobile devices to track

the motion the device is being moved in. Given the limited resources available on these

devices the software performing this task should use few resources so that the device can still

operate as a phone without significant slowdown caused by the application. Some of these

applications should also incorporate data collected from stage 1.

2.5. Reliability of Results Results from stage 2 and 3 were verified by test and demonstration since the original data (the

direction the devices are moved) can easily be verified by results (direction the algorithms


29 | P a g e

believe the device has moved in.) Applications were during the course of the detection

development to test that movement was being tracked correctly. The scope of these

applications gradually increased as the development grew.

Originally the applications developed simply showed this movement information so that

multiple people could judge the correctness of the detection routines. These applications

include firstly the rendering of boxes (rendered upon tracked sections of the image, section

7.2.1.) that moved as the device moved. A simple 3D compass that pointed in the direction the

device was being moved and finally an application that drew (in a 2D representation) the

movement ‘strokes’ the device made as it was moved. These applications are shown

throughout chapter 7 as demonstrations of the development process

The success of stage 1 was judged by taking the designed model and applying it to certain

application types (such as the web-browser in Appendix I) and then determining if indeed it

was useful. This was to be performed by querying end users, but the release of VueFlo

(http://www.theunwired.net/?item=videoview-htc-vueflo-easy-navigation-technology) on the

HTC Athena mobile device allowed a different approach. It uses the same concept but with

specialist hardware, the model uses nearly identical motions to this software and the overall

very positive reviews of VueFlo also show that the model allows for reliable motions to be

used.

2.6. Problems Encountered Being a fast moving and relatively new field, the possible risks are generally not well defined.

The following are the risks identified at the outset of this project, which were encountered

during the course of this work.

• Redundancy of research

With the release of Windows Mobile 5 [22], there was a wealth of undocumented functionality

(in particular with interaction with camera) and improved interaction over older versions of the

mobile device operating system. With this improved functionality, some of the earlier work

(in particular trying to operate with the camera were made redundant. Most of this older work

had to be restarted with the new methods to interact.

• Processing power


30 | P a g e

Currently available (to myself and others) devices simply not have enough grunt to perform

the tasks required, regardless of how optimised the code is. What has been developed stresses

the devices used significantly, and hence less powerful devices currently out there may

struggle. Significantly more powerful devices are just becoming available that will supersede

the current devices, making this work far more effective.

• Camera imagery

The images produced by devices are currently of relatively low quality and frame rates are

quite low. Again newer devices will aid in alleviating this problem with hardware

improvements and allow this work to more closely mimic the desktop counterpart. Lower

quality video mean less information can be extracted from it, and therefore less accurate

motion detection.

2.7. Ethical Considerations The actual development and testing of the model required no special permissions or

considerations in regards to people or animals. However, while developing survey questions it

was necessary to consider cultural backgrounds, because certain motions may be considered

offensive or degrading. The ethical statement included in the transcript of Survey 1 is in

Appendix C.

An intuitive motion based model for mobile devices – Requirement Specifications

31 | P a g e

3. Requirement Specifications When developing a model aimed at use by normal everyday users, consideration has to be

given to ensure it suits the diverse requirements of the target users. Discussion of such

requirements is included in section 3.1. Possible expansions of these requirements once the

model and implementation have become more mature are also discussed, along with factors

that are of no importance to the end-user, such as the question of how to best find motion from

video.

3.1. User Requirements To be considered a success, this project needed to meet users’ needs in such a way that the

created framework offers significant improvements over existing input frameworks.

Requirements to achieve these goals included:

• Easy to remember commands

• Logical actions

• Rapid access to functions

• Reliable input parsing

• Easily extendable

• Inter-operable with other input methods

Easy to remember commands – Short and simple commands that make sense to the user are

highly practical and far more likely to be used than commands that make no sense (i.e. Moving

the device up, left then forward to choose an option to the left is confusing and should be

avoided at all cost). The same applies to command time. A long command with many

instructions is inferior to a short one or two stroke command.

Logical actions – Moving the device left to do a command that naturally involves a left

inclination is far more logical than an ambiguous set of commands. This can be applied to

more complex actions as well. The natural reaction to an overly loud volume is to move the

device away from the ear, so this would be a logical command for this action.

Rapid access to functions – If a hierarchical system is used to access commands then the

traversal though these menus should be streamlined. This is to decrease the time taken to


32 | P a g e

access commands. This suggests that options such as shortcuts for most commonly used

commands should be available.

Reliable input parsing – Understanding ambiguous commands is a necessity with such input

methods. Should the device drop commands it can't fully understand or try to derive the user's

intending meaning. And if option two is chosen do we compare it to the most commonly used

commands give rankings and judge with actions it resembles?

Easily extendable – Both commands and actions must be able to be easily added to the

completed work.

Inter-operable with other input methods – Users may wish to also use other available

communication methods with their devices. The framework should respect this and keep such

methods available for use at all times.

3.2. Scope of Communications Applying meaning to a user’s motion is the crux of this project, and as time has passed the

importance of different levels of communication the user can perform has become more

apparent. Initially, the scope of this project was centred on the input of text, which has taken a

far less important role over the duration of this work.

Communication with the device can incorporate simple computer like instructions or more

complex motions. Such motions can be mapped to specific functionality depending on the

current situation and context.

3.2.1. Low-Level Low level communications, much like its language counterpart, are simple phrases that

perform a pre-defined action. In the typical computer domain such actions as file deletion,

clicking a button or closing a window would be considered low-level interactions by the user [1, 16, 28].

Many human motions can also be considered low-level actions. Nodding the head in

confirmation or denial are such examples. A simple Yes/No situation will occur often while


33 | P a g e

interacting with a mobile device. Input for such events can easily be transferred to a simple

movement of the device that imitates the head nodding [4].

Other basic motions can also be applied to other low level situations. Simple examples

include:

• Rolling the device down/up to scroll though a list of options.

• Moving the device along a plane to pan around a displayed image.

• A simple shake of the device to choose an option or to acknowledge an event.

Many of these simple inputs currently require forces applied to numerous input receptors

(buttons, scroll wheels, toggle sticks and scratchpads) simply because of the limitations of

these switches. This can cause inconsistency problems between applications and extra effort

as the user tries to operate multiple input mechanisms simultaneously. A single input model

based on a user’s natural actions will provide significant progress.

Identifying when the user’s motion is to be read and interpreted may be problematic, therefore

it is quite possible that a way to activate the reading of motion may be advantageous. A

simple push of a button to enable/disable the reading of camera motion was investigated in the

early stages of development.

3.2.2. Textual Input T9 [34] and other button press methods are not the only ways a user can input text into a device.

Motion, much like the natural writing of text can be applied to a mobile device as an input

mechanism. Whether it is the moving of a pen-like implement (stylus) over a flat surface, or

the moving of the device itself, text input has always been a significant portion of human-

computer interaction [13].

3.2.2.1. Hex Hex [40] is currently under development at the University of Glasgow as a text-entry human-

computer interface. A typical PDA device is connected to an accelerometer (a device that

detects movement, and in particular device rotation) while the user is presented with a

graphical sphere of hexagons as an input aid. A dot is manipulated upon the screen by tilting

the device and letting the dot ‘fall’ in a certain direction.


34 | P a g e

Figure 3-1: The Hex Interface

Letters are grouped in the hexagonal grids depending on certain guidelines that are designed to

aid text input speed. Vowels are grouped together at the top since they are commonly

accessed. Other common groupings are bundled together with the least common set being

given the down-left position (Figure 3.1). Upon entering a new hexagon the device switches to the

new screen and again splits the letters, this time into single letters so one can be entered.

A predictive model is applied for fuzzy inputs (when input is close to two hexagons, likelihood

of next letter group/closeness to border is examined and user choice determined) to aid input.

Another recent addition is another predictive model that aids choosing the next letter by

making it easier to tilt to. Letters less likely to be used take more effort to reach (rolling up

hill, for example.)

This results in a method where each letter can be accessed with two tilts of the device while

providing a graphical aid for the user in their adoption of the technique. This model has

already been applied to mobile devices with good results (Figure 3.2) , however the wide use of

this approach will depend on the various manufacturers embedding an accelerometer into their

products.


35 | P a g e

Figure 3-2: Hex in Action

3.2.2.1. Graffiti & Unistrokes Graffiti [29] and Unistrokes [12] are two similar stylus text input mechanisms that attempt to

simplify the English alphabet. They were created in an attempt to increase speed and further

differentiate the letters to aid in CPU recognition.

Letters inputted in such a manner are required to be single strokes so that multiple strokes

cannot be misinterpreted or matched to the wrong letter. This limits the number of strokes

available, typically to somewhere in the region of 5-8 unique strokes, depending on the

implementation. This is obviously insufficient for the 26 letter alphabet; therefore other

factors have to be included. The most common of these are the reversal of the stroke and the

rotation of the strokes. Some strokes by nature can start at their end point and work back,

making a new stroke to use. Rotating strokes 90 degrees around their centre also increases the

stroke count. Obviously symmetric strokes cannot have both applied since they would result

in identical strokes.

Figure 3-3: Natural Letter Matching


36 | P a g e

When matching letters to relevant strokes the speed and familiarity are dominant factors in

decision making. For example (Figure 3.3), letters that closely resemble strokes are often placed

together. When this could not be applied subsets of the letter stroke are considered (Fig 3.4) so

that the actual stroke is one of the strokes made while writing the letter.

Figure 3-4: Letter Subsets

Common letters are also given quicker strokes to increase input speed (Fig 3.5)

Figure 3-5: Single Strokes

Simple strokes translate well when combined with motion detection discussed in the previous

section. Familiar patterns are a large boon for new users learning the new input method and

will ease migration.

While these approaches also show promise, they require mobile devices equipped with stylus

and touch screens, which therefore restricts their applicability.

3.2.3. High Level High level input goes further than simple one word actions and attempts to communicate

events or situations. During the early stages of my research it became apparent that this was

by far the least explored area of study. This also has the most potential in future work, as well

as the ability to spread into other fields.

3.2.3.1. DyPERS DyPERS [25] is a system initially developed at the Massachusetts Institute of Technology as an

augmented reality system. The system scans the environment the user is examining for key

objects. If the system finds such objects then triggers are fired. In its current form this usually

entails launching short video clips about the noticed object.


37 | P a g e

The System learns visual cues and therefore can be taught by the user to recognize objects in

the world around him (Figure 3.6). Such a system can be embedded into the framework of this

research to aid intelligent input (see Section 3.7.2.). For example, detecting a specific business card

with appropriate visual cues upon it (logos, names, etc) could automatically call the phone

number related to that business card.

The DyPERS system shares some similarities to the work presented in this thesis, particularly

in image recognition triggering events, however it relies on identifiable visual elements being

present rather than device movement tracking. We therefore feel our approach has more

potential and scope for usefulness as an input model.

Figure 3-6: DyPERS in Action

3.2.3.2. Profiling Usage Each user has a unique style when using a device and hence would benefit from having

different functions available to them which are different to other users. For example, Person A

may spend a significant amount of time listening to music on their device while Person B may

use personal information management tools to a greater extent. Such user patterns can be

monitored and specific communication paths may be opened via this usage.

These predictive models based on a user’s past activities can map common actions to common

activities [31]. This makes device usage not only simpler for the end user but also more natural.

A user may turn the phone 90 degrees to the right to start the music player if this was a

common task of the user (the user may turn the device instinctively to insert headphones into

the jack). Such functionality could be learned over time and adapted into the input model of

the phone.


38 | P a g e

Models could also be applied less dynamically, much like a shortcut system. Upon opening a

general ‘programs’ menu favourite programs can be assigned directions to open, again based

on how commonly used they are.

3.3. Motion and Error Detection Transferring image data from a device’s camera (to be processed by the device) will enable us

to recognise changes in location and direction in the 3D environment of the phone. Being a

single camera without a free-moving lens (omni-directional) meant additional processing time

and the possibility of errors. Using a variety of already researched techniques and methods

minimized such problems.

3.3.1. Movement Tracking Processing captured images allows us to track the movement of the camera and its user. In a

three-dimensional plane there are two methods of movement, traversal along one or more axis,

or rotation around these axes.

3.3.1.1 Axial Axial traversal is the simplest motion to track with simple camera equipment. It revolves

around the device/camera travelling in a set of directions that can be mapped in the three-

dimensional planes and are upon the optical axis of the camera.. While moving, the device

cannot rotate and the focal point in the horizon must remain constant (obviously an imaginary

point in the distance if not facing the horizon).

Upon receiving a processed image (Section. 7.2.1.) the pictures were scanned for the remaining

edges and broken down to straight lines with given start and end points (pixel locations.) With

this information vector sets are created that outline the information in the image. Not only is

this an efficient way to store the image data but also gives a method to directly compare

images to those previously captured and stored [14].

Vertices of similar length and direction were matched between frames (rotation can interfere

with this, and is hence why it is dealt with separately in the following section). With multiple

vertices, these changes can be confirmed and considered consistent. When start and end points

change and lengths remain constant, then it can be determined that the device has moved either

left, right, up or down depending on the new values. In and out of the picture can be

determined with similar line direction but an increase or decrease in vector length.

3.3.1.2. Rotational


39 | P a g e

Tracking rotational movement with a standard camera and lens is a far more sketchy process.

To truly track rotation direction and amount, a sense of image depth has to be established.

Many new concept devices are currently being announced with 2 cameras, and while there is

no information on if or how they can interoperate, this could potentially be utilised to generate

a sense of three-dimensional vision.

The currently available single lens devices require compromises to be made when attempting

to track rotational movement. The currently accepted method to track depth is to increase the

field-of-view of the camera device. A typical camera has a FOV of around 45-50 degrees

directly in front. Increasing this value to around 80 degrees helps give the image a concept of

depth, and is achieved by special ‘fish-eye’ lenses or convex mirrors [14]. Unfortunately, this is

quite impractical for a phone, and hence exploratory vector mathematics was applied to

examine the practicality of rotation tracking on a typical FOV device.

Such mathematics relied on vector changes discussed previously being inconsistent across the

image(Section. 3.2.1.1). For example, if rotating the device to the left, a vector on the left of the

image would increase in size while getting closer to the centre of the image. A vector on the

right would decrease in size and move closer to the right edge of the image.

3.3.2. Error Detection and Correction Being able to judge when a user has made a minor mistake in their inputs, and adapting

accordingly, has increasingly become a more important feature of recent input models. This

project’s scope included spelling and movement error (low, medium and high level inputs (Section 2.3).

Determining what the user intended with their motion not only makes the model more

attractive to users, but also increases speed and obviously accuracy.

3.3.2.1. Gait Phases Gait Phases are a breakdown of the human walking cycle where each step is broken into eight

segments [10]. Research has shown that independent motion (entering information) typically

occurs during the ending steps of the cycle (as the person finishes their step and is putting their

foot onto the ground.) If human sway while walking is to be taken into account while

receiving input, this indicated that the device will typically be moving downwards while the

input is occurring.


40 | P a g e

To take this a step further, human motion can be monitored while no input is being inputted.

Because of the walking phase and the above findings it can be assumed that most motion

inputs would occur during this downward motion. Therefore in most inputs, a minor

downwards motion can be ignored. Since walking is typically in a cycling and consistent

manner, this collected information could be applied while input is being received to derive

correct input.

3.4. Detection Algorithms The two most important factors when deciding on a suitable detection technique are noise

filtering and processing speed. When performed on a mobile device with many restrictions,

these factors become increasingly important. Slower processing speeds require more efficient

algorithms to be effective. Mediocre cameras result in increased image noise, hence the need

to filter this noise out.

Five common algorithms have been examined for effectiveness:

• Zero Crossing Detector (Laplacian of Gaussian/LoG) [35]

• Robert Cross Edge Detector [3]

• Canny Edge Detector [5]

• Compass Edge Detector (Prewitt) [11]

• Sobel Edge Filter [3]

LoG algorithms are probably the most computational expensive of all the examined methods

though some variations are reasonably fast (Difference of Boxes, for example [35].) A

Laplacian isotope is initially created to measure the changes of intensity of an image. Zero

crossing marks the places where this value crosses the zero line (changes from positive value

to negative, or reverse). These are typically edges within an image, but occasionally are not

true edges (referred to as 'features'). Noisy pictures are usually smoothed first using a

Gaussian filter [11]. Such a detector may be appropriate for scene change (Section 3.5.2.)

functionality but is too expensive for standard detection.

The Robert Cross detector is a small and fast detector relying on two 2x2 matrices for

detection after colour data has been removed from the image. One problem with this method


41 | P a g e

is that it is far more reliable in finding 45deg edges than horizontal and vertical. This is not

ideal for low resolution images. Also the small matrix masks are highly susceptible to noise.

The Canny detector is much like LoG, starting by applying a Gaussian filter to blur the image

and remove noise. Small, but slightly more advanced matrix passes like Robert Cross are then

applied. A third pass is then made to only display the 'maximum' of the edges detected (results

in only pixel wide edges). Canny detection handles noise well and delivers good edges but has

trouble when multiple edges converge at one point, making one edge appear disconnected

while merging the resulting edges into a single edge. This method is also very intensive as

seen with the many passes needed over an image.

Compass edge detection uses eight small matrices to determine both edge gradient and

orientation in separate images. This could be useful in more powerful processors as the

information could be used to track edges over a series of frames as well. However, on smaller

devices, the two output images are difficult to handle. Prewitt detection is also very vulnerable

to noise.

The Sobel detector is similar to the Robert Cross detector in that it uses two small matrices to

detect edges of a gray-scaled image. However, this time horizontal and vertical images are

favoured. This has the repercussion of the filter being less vulnerable to picture noise than

Robert Cross while retaining the speed.

It is apparent that both strong noise filtering and fast edge detection cannot truly be obtained

and sacrifices in both fields must be made to choose the best algorithm. Sobel has been chosen

as the most appropriate edge detection algorithm in the working environment of this project.

As the computational power of mobile devices increases, it may be possible to switch to Canny

or Zero Crossing.

3.4.1. Edge Detection The input images were processed by passing a 3x3 (or higher) matrix over each pixel and its

surrounding neighbours [5, 42]. If certain patterns are found in these matrices, then it can usually

be determined that this pixel is part of an edge in the image. Typically edges are identifiable


42 | P a g e

when there is a sudden change of colour or luminosity of neighbouring pixels, and that change

travels consistently in a line.

The Sobel edge detection matrices (Fig 3.7) are considered by many as the most efficient method

to process edges and therefore should take precedence as the solution of choice on processing

power limited devices [7].

Figure 3-7: Sobel Horizontal and Vertical Edge Detection Matrices

Considering the small size of current image capturing devices currently embedded inside

mobile devices, other algorithms may be considered depending on performance. The option to

capture in gray-scale to again increase performance is another possibility.

Even edge-processed images can remain complex and difficult to perform tracking methods

upon if there is a large number of edges present in the image. Bi-level masks can be applied to

such images to reduce the complexity of the image so that only the important features of the

image remain (Fig 3.8.). Such images are ready to be processed to retrieve relevant and valuable

information.

Figure 3-8: Original Image → Grayscale & Sobel Filters → Bi-level Mask


43 | P a g e

3.4.2. Object Detection Initially, it was thought that edge-detection algorithms would be the most appropriate way to

approach this problem. Straight edges are commonly attributed to static objects in the

environment (tables, buildings, roads) and are relatively easy to track between frames.

Problems arose when trying to determine the differences between rotation and movement since

the edges had no reference point to compare when determining rotation. This resulted in

having to examine edges in order to compare their changes to each and every other edge. If an

edge grew smaller on one side while one on the other side grew then it was presumed rotation

was being performed. This resulted in additional complexity and often unreliable results as

reference edges were often inappropriate to compare with. Therefore a simpler method was

required.

Shape detection algorithms avoid this problem as each vertex found has more reference data to

compare to. Squares seem the most appropriate shape to track as they are easily identified in

an image and polygons of all orientations can easily be mapped back to an original square.

Comparing rotation information on a square is also far easier to manage

Hough Shape detection and the Augmented Reality Toolkit (Section 7.2.1.1) were both examined for

their shape detection abilities. Both were more than capable in detecting squares in the image,

but the ARToolkit™ contained multiple libraries to aid in additional functionality such as the

converting to three-dimensional information (Section 7.2.3.).

Therefore the ARToolkit™ was deemed to be the best processing to apply to motion detection.

In fact it offered too much processing, so a stripped down version of the toolkit appeared to be

the best solution to the problem.


44 | P a g e

3.5. Algorithm to determine appropriateness When parsing video, it is common to find multiple squares upon the scene. While the best

result would be to take all these squares into account when determining motion it is not viable

with current computational power. Therefore it is best to concentrate on the movement of only

one square in the scene. To achieve the best results, such squares need to be judged to

determine which will give the best results when tracking, i.e. their appropriateness. Such

variables such as where the square is on the screen, its current orientation and size were all

taken into account when determining appropriateness.

To decide this, an algorithm was devised to determine the ‘most appropriate polygon (MAP).’

Each polygon returned by the ARToolkit’s detection routine was given a score to grade how

appropriate the polygon would be to track. Quick tests are then performed to determine if the

MAP of the current frame is likely to be the same as the MAP from the frame before (by

comparing the size of the polygon on the screen, centre and vertex information). If so, then

the changes between the transformed polygons that relate to these squares can be calculated. If

not, a look through the other square data in the screen is performed, in an attempt to find the

previous polygon. If a possible candidate is found, the calculations are applied to that

polygon, before moving onto the new MAP for the next processed frame.

To determine a MAP score, the following factors were include:

α - Appropriate size (easier to detect changes on polygons taking up more screen space)

β – Vertices distance from screen edge (further from edge, higher chance square will still exist

in the next frame. If the polygon gets cut off by the screen edge it can no longer be detected.)

Determined by adding the distance of the X*10 and Y*10 co-ordinate of each vertex to the

closest screen-corner and taking the smallest result from the four vertices.

∆ - Edge Length (easier to track rotation in both directions if the edges are not short.) Result is

the addition of (100/edge length) for each edge.

θ - Chance it is a true polygon (0.0-1.0) (ARToolkit returns confidence results. The lower the

confidence the less likely this is a square, and the more likely it will not be detected next

frame.)


45 | P a g e

Examination, along with trial and error, was employed to determine an algorithm that gives the

strongest results.

The algorithm used to derive a MAP score in the work presented here is:

(α/10 + β/10 - ∆) * θ.

3.6. Model Design Creating an input model for mobile devices required breaking down the available movement

derived input information, as well as creating input categories into which the possible inputs

can be placed.

3.6.1. Interaction Breakdown To determine possible relations between input types and the general movements that can be

performed on a mobile device, a list of commonly used functions was collected to be used for

test cases. These were collected over a 4 day period by three other Microsoft Smartphone™

users. We each noted the functionality used on our devices over this time. This enabled initial

work on classifications to be performed.

The complete collection of inputs is available in Appendix A.

3.6.2. Input Types (High level commands) The inputs from Appendix A can be classified into eight general input types (Figure 3.9). These

help generalise and categorise the inputs into distinct types so that general motion types can be

applied to these generalised inputs. This adds a layer of logic to the model since if a particular

input can be classified, then a general indication of what the motion to perform the input can

be created.


46 | P a g e

Figure 3-9: Input Types

From initial analysis, eight types of input have been classified. Many of these classifications

may seem superficially similar. However, further justification and explanation is given below.

Choosing – Navigating around a set of options/possibilities to find the most appropriate. Such

inputs are generally performed by directional keys/toggle stick. A limited set of options are

given to choose from.

e.g.: Choosing an option on a menu or a list with a pre-defined number of items.

Selection – Selecting one of an unrestricted set of items. This can be often considered the

selection of an item within the user’s created content.

e.g.: Selecting a circle drawn in a vector graphics program.


47 | P a g e

Confirmation – A message sent back to device denoting that the user had understood the

current situation and agrees/disagrees with it. Usually performed by pressing a yes key or a no

key, these are often the call/end call keys.

e.g.: Ensuring the user wants to delete photo in photo album after the user requests a delete.

Adjustment – Minor changes to already confirmed/selected situation. This is often performed

by the directional keys or toggle stick.

e.g.: Increasing the contrast in a displayed Photo, increasing the value in a number box by one.

Moving – Changing contents displayed in the screen by scrolling. Again another input usually

covered by directional keys, toggle stick.

e.g.: Panning picture across to see other side.

Functionality – Accessing a program/ability of the device. Soft menus, dedicated keys or the

act of opening a menu, choosing and selecting.

e.g.: Opening up the photo album.

Menu Access – Displaying menus that list available actions in current situation. This is

commonly a specialised key.

e.g.: Opening up zoom menu to zoom in on picture.

Modification – Significant changes to a currently available item. This can be performed with

keypads, direction keys or a toggle stick.

e.g.: Flood filling water in picture to colour red, changing the input type from numerical to

English.

Preliminary classifications of such inputs are available in Appendix B


48 | P a g e

3.6.3. Input Motions Motions that are applied to inputs can be further described by breaking down the input

movements into their singular motions (Figure 3.10.) and then examining the properties of these

motions. The most significant properties can help define an Input Type (all inputs of a specific

type share similar properties.)

Figure 3-10: Input Motions


49 | P a g e

Direction: The general directions used in the three-dimensional plane from the user’s

perspective (up, down, left, right, towards user, away). Most commands will use multiple (2

or 3) directions to increase the total number of combinations available. Conjunctions of

directions (up-right for example) can be employed as well. Directions are easily classified as a

list which most likely provides the most important/prominent classification. Algorithms to

detect motion will be able to gather direction information directly.

Values: Up, Down, Left, Right, Towards, Away

Rotation: Movement around the phone’s axis. This will be the most difficult information to

gather from current day devices and hence will only play a minor role in the initial model.

Rotation does play a key faction in several functions when trying to mimic human motion (i.e.

making a phone call requiring the user to put the phone to their ear, which requires rotation.)

Values: Rotate Left, Rotate Right, Tilt Up, Tilt Down

Speed: The speed a motion is performed at is a subtle indicator of what type of command is

being input into the device. Slow motions indicate a softer response to one that is performed

faster and with more confidence. Hence identical motions performed at different speeds can

indicate similar inputs but with slightly different outcomes. A slow motion may indicate the

user wants to fast forward though a music track, while a quicker motion can indicate skipping

the track altogether. Detecting speed via a video stream cannot be precise because of depth

issues, but limiting the available values makes it more viable.

Values: Fast, Slow


50 | P a g e

Angles: When adding another direction to an input, an angle is formed (Figure 3.11.). These

angles can be classified as obtuse, acute or rounded. Obtuse and acute angles are obviously

dependent on direction, while rounded would indicate many minor direction changes over a

small area. Tests to decide what constitutes a rounded corner will have to be devised. The

additional processing power required for such a task deems that they should only have a minor

role in the input modal. The other angle types can easily be determined via vector

mathematics.

Values: Acute, Obtuse, Rounded

Figure 3-11 Angle Types

Acceleration: Speed variations within a motion are obviously beneficial in situations when

dynamic changes are common. While changing numerical values acceleration would indicate

an increase of how fast the numbers increase/decrease while a slow down would also slow

down the change to enable more precise input. Since we are relying on only two speed values

(fast and slow) this limits the possibilities of acceleration to faster, slower or no change.

Acceleration is a function of the speed tracking and its reliability will be directly linked to that

of speed.

Values: Faster, Slower, No Change

Length: The length of the accumulation of direction motions can change from minor

movement to large sweeping motions. Smaller motions require less effort and should

dominate inputs. One should also take into account that many inputs would require specific

distances to be moved to mimic motion (moving phone from pocket to ear).


51 | P a g e

To keep things simple, ‘short’ and ‘long’ will be used as values. To determine length a

product of speed and the time taken to enter the input will be taken, inputs that exceed a set

value will be considered long, and otherwise they will be considered short.

Values: Short, Long

Scene Change: A unique input type. A scene change is not related to motion but instead the

device keeps track of the general video seen and looks for keys or significant changes to

indicate a scene change.

Changes to the environment, such as the darkness of being put into a handbag/pocket down to

the possibility of seeing a phone number on a business card could all be considered scene

changes. Many changes would be unique only to the input they are designed for since there

really is no limit to what can constitute a change. Therefore consideration must be applied to

restricting their use since such a property could easily get out of control. That said, huge

possibilities abound depending on the power of the operating hardware and software.

Values (Current): Dark, Light

As can be seen from above, some of these properties are directly related to a combination of

other properties. Therefore not all of them have to be tracked. Further investigation was

conducted to show which are the easiest to track and which supply the richest information.

3.7. Gesture Recognition When a user operates a mobile device, there are many actions that a user performs that are

directly related with the current situation of the device. These movement actions have to be

preceded or followed up with additional interaction with the phone to perform the task.

Tracking the movement of the device in these situations allows the removal of this step

altogether and hence greatly increases the users experience with the phone [17, 23].

Such communication can work in both directions, whereby the user can react to an event with

relevant motion or the device can react to a motion event generated by the user.


52 | P a g e

There are many of these situations that can occur while working with a mobile phone device.

Possible examples include:

User → Device

When a user takes the mobile device out of their trouser pocket while the key lock is enabled.

The device notices the movement and disables the key lock once the device is in front of the

user.

User dials a number and puts phone to ear. The device makes outgoing call to number entered.

Device → User

The phone receives an incoming call and starts ringing. The user moves to phone to the ear to

answer the call.

Device receives incoming message. The user moves device to viewing position, device shows

message.

Such situation specific communications can also be applied to more highly defined

environments such as mobile gaming where game movements or actions could imitate real life

movement (sitting down, walking in a direction.)

3.7.1. Profiling users Each user may exhibit specific “quirks” while performing actions, and these unexpected

actions may need to be accessed and adapted to while the user interacts with the device. Some

users may find it more comfortable to move the device to the left than to the right and hence

can move it faster in a particular direction, or perhaps arthritis inhibits the user’s ability to

rotate the device. Even factors such as whether the user is left or right-handed or how the user

grips the device can interfere with the device’s usage in extreme circumstances.


53 | P a g e

Therefore a user profile should be created to store the unique information about a user, this can

be created by a simple configuration program that requests the user to partake in some simple

tests to determine their abilities and how they would naturally interact with the device. A test

could be as simple as getting the user to move the device in the requested direction and

comparing with the actual direction moved. Weights can then be applied to each direction to

determine the intended direction of motion [37].

More exhaustive tests such as getting the user to manipulate a ball into a small hole could be

used to judge users reactions and how they move the device in certain situations. All of this

can be profiled and stored to aid in determining the user’s true intentions.

3.7.2. Input Prediction Being able to pick what task the user is most likely to perform next greatly assists the error

detection process. This is most notable in applications that require spelling. Internal

dictionaries can be accessed within the device to determine what a word the user is entering

can be depending on what has currently been entered [31, 39]. Weights can be applied to the

likelihood of certain letters being entered and then these letters are compared to what were

actually entered, this comparison give a final result of what the user most likely intended the

letter to be.

If the motion does not closely match any of the likely letters then it can be assumed that

whatever was actually entered is the intended input and the dictionary will not come into play.

Otherwise the weighted letters are compared to the input entered. A weighting multiplied by

closeness to weighted letter value is then created and whichever letter ends up with the highest

value is inserted.

Input prediction can also be combined with the user profiles (Section 3.6.1) to increase the success

rate by comparing input motions to most commonly used applications or answers the user uses

on their device. This information would be unique to each user and therefore prediction would

be different between users.

An intuitive motion based model for mobile devices – Development of an Autonomous Survey on a PDA

54 | P a g e

4. Survey Construction Surveys allow the collection of both qualitative and quantitative data, required for creating a

motion-based input model. Quantitative data helps define precise motions in which the users

will typically use to input their motions, while the more subjective data (qualitative) that users

supply will help define the inputs the users typically use and which should be applied to the

model. Both models will be employed to gather information relevant to the model and how it

can work.

To create a basic model it was decided that it would be best to place users in situations that

closely represented actual situations such a model could be used. To achieve this it was

decided that using a mobile device as the actual focus of the surveys was the best course of

action. Participants would interact with a mobile device when placed in specific situations and

this information would be recorded to study and compare results.

Seeing how users responded in this environment would allow some early work on defining the

model’s direction to be completed. This would then lead into a far more specialized survey

that placed the user in far more controlled situations so specific information on specific

situations could be collected.

To develop a model that can be used by the general populace an understanding of how the

users typically used a mobile device was required. Once this was understood steps could be

taken to improve this interaction by using this motion concept.


55 | P a g e

4.1. Aim The goals of the surveys were two-fold. Firstly, there was the discovery of whether such an

input method was in fact viable for the public to use. To achieve this goal a reasonable

percentage of the survey participants would have to understand the concepts behind motion

input. This was to be judged by how they responded to such a survey. If people required

explanations for tasks that required them to interact with the device via motion instead of

traditional means, then it would be considered a failure. In such a case, the entire concept of

motion input would have to be rethought.

Otherwise, if the participants embraced the concept (either naturally or after additional

clarification), then their responses to the survey judges how well they could adapt to a motion-

based environment. If the subject understood the concept in theory, but struggled to apply the

concepts in action (typically reverting to older methods of interaction to respond to a

situation). The input interaction would only be considered a mild success.

The second goal of the surveys was to create the basis of an interaction model for simple (low-

level) inputs, as well as a few phone-specific high level inputs. Because of the constraints

applied by the first success criteria, information could only be gathered from participants who

responded positively to the first criteria and were able to give meaningful responses to the

survey. Information gathered from the remaining participants will be used to improve the

entire understanding process.

4.2. Survey Structure Early on in the creation of the survey, it was decided to break the entire process into two

specific surveys (one qualitative and one quantitative). A qualitative pilot survey would be

used to gain basic information about the participant themselves, their prior mobile device

usage and their general understanding of how they would use a motion based input model.

This would be obtained by placing them in some informal situations and seeing how they

reacted to the situations. The testing procedure is discussed in section 4.3


56 | P a g e

The information gathered was then used as a basis for a quantitative survey by seeing how

users responded to the motion based situation. A formal survey in a controlled environment

was then applied to a larger user-base to gain detailed information. This information has been

used to create a formal model that defines how the input types from section 3.5.2 can be

mapped directly to simple motion inputs.

4.3. Survey One – Initial Data Collecting The pilot survey was designed to gain a broader concept of the general population's grasp and

understanding, and using a motion-based input model in its simplest form. A general

understanding of the users' background and knowledge was gathered by survey. This was

performed in order to profile the users before their grasp of the input model was recorded.

This information was then used as a baseline in creating the second survey.

Information collected would be used to study the feasibility of such a model as well as define

some important guidelines on how a standard representation of the population perceives how

they would interact with a mobile device via motion.

4.3.1. Design As the pilot survey targeted a population that would almost certainly have no familiarity to this

input concept, the survey had to be created in a way not to intimidate these participants. As

such, an informal approach to data collection was employed. General questions were asked of

the user, as well as simple tasks performed. In both sections, users were offered plenty of time

so that they were comfortable with what they were being asked. Sections 4.3.1.1 and 2

describe the two parts of the survey, which seek the understanding of the users and assess their

reactions.

4.3.1.1. Survey One: Part One – Understanding the Users To understand the users, their past experiences and proficiency with mobile devices was

collected. This information was collected via a simple written multiple question survey

(Appendix. C) These questions included:


57 | P a g e

How long had the person owned a mobile device?

The more a user works with a specific technology there is an increased likelihood the user will

embrace new concepts the device can offer. Knowing this length of time allows us to not only

see if it applies here, but also gives us a guide and reference to their proficiency with mobile

devices. If people struggle to embrace the concept there is a high possibility that such an input

model would have difficulties in being accepted or adopted by the public as a whole.

What previous models of mobile phones had the user used?

With competing brands using differing interfaces and key interaction concepts there is a

possibility that people will have a specific mindset on what functionality is available and how

to access it. Such a mindset will have been subconsciously trained over time, especially if the

user has developed a brand loyalty and had only been exposed to one path of input concepts.

These thoughts are very likely to pass over to a motion framework and a motion framework

should be given weighting from input concepts from more popular devices. Tracking how

people with differing mindsets interact with a phone will become a large factor in designing an

input model.

What level of competency do they believe they had reached with these phones?

More experienced users may typically employ shortcuts to access functionality; this familiarity

may pass over to motion. On the other hand, inexperienced users may be ignorant of

functionality and this could in fact be detrimental to their understanding of the motion based

survey. Users might instinctively try to apply these shortcut concepts to the motion model

since they are so used to doing them. An example would be locking a mobile phone so the

input keys do not respond. This is a typical shortcut where the motion to apply the input does

not have any grounding in logic, only efficiency.

On the other hand, users with the knowledge of shortcuts may already have the understanding

of multiple paths to obtain the same result. Such users may be more open to the concepts of a

motion model.


58 | P a g e

What functionality was available on their current phones, and was it taken advantage of?

Tying in with the previous question, knowing what the device can do plays a huge factor in

what the user does with the device, and how to go about doing it. Knowing a device can do

something aids the user in visualizing how to perform this function. If the user has trouble

grasping the concept in general, then their input (motion) may be flawed.

Key profile questions are also included to see if certain demographics react differently to each

other given the same situation.

4.3.1.2. Survey One: Part Two – Gauging User’s Reactions Users were given part two of the survey along with the written part one. This was so that they

had ample time to examine what they were going to be asked and what they were actually

partaking in by performing the second part of the survey. The surveys were supplied to the

participant the day before the actual survey so they could fill in part one when they saw fit as

well as understand what would be asked of them in part two. The goal was to find how users

could incorporate motion into some very simple situations that could occur upon a mobile

device.

The user was allowed to use their own mobile device as a prop to act out the situations

presented to them. This ensured a familiarity with the device they were using. The device was

to be held upside down to ensure buttons did not interfere with the interaction and ensure than

only motion was used. Users were recorded over the left shoulder as they performed the

motions (Figure 4.1.). A whiteboard was at all times behind the user so that background

information did not interfere with the video recorded. The video was recorded by hand to

allow focus to move around in case the device was obstructed by the user.

The users were allowed to ask questions before the survey because of its informal nature to

gain more of an understanding of what was required. During the survey itself, further

information was also supplied if the user appeared to be struggling.


59 | P a g e

Figure 4-1: Basic Outline of Survey Situation

The simple situations to act out were designed to cover each of the motion input types

previously defined (choosing, selection, confirmation, adjustment, movement, functionality,

menu interaction and modification.) These tests were:

1. Selecting an option from a vertical list displayed to the user (movement, choosing, selection)

2. Saying yes to a prompt (confirmation)

3. Saying no to a prompt (confirmation)

4. Selecting an option from a vertical list (movement, choosing, selection, menu interaction)

5. Rotating an object on the screen to the left (adjustment)

6. Increasing a displayed number by two (adjustment)

7. Increasing device volume (functionality)

8. Pan right while viewing an image (movement)

9. Reloading a webpage (functionality)

Questions asked are also included in Appendix C. Information was recorded digitally,

whereby all of the motions for a user recorded to one file. Voice prompts were used to

indicate the start of each motion and allowed the repeat of a motion if the user became

confused part of the way through.

4.3.2. Participants Thirty participants were interviewed for this initial pilot survey. All thirty people answered

part one of the surveys, while twenty-five of these completed part two successfully.

Participants were aged from seventeen to thirty-six and were from a variety of backgrounds

with the common factor being the ownership of a mobile phone. The breakdown of user

information is included in Appendix D in quantitative format.

4.3.3. Data Collection Methods Part One


60 | P a g e

Surveys were distributed as hardcopy with ample space for users to record information. These

were processed into a spreadsheet so information could easily be compared, before being filed

away. The survey itself is included verbatim in Appendix C. Every user was supplied with a

copy of the survey at least a day before it was expected to be completed, giving the

participants ample time to consider their answers.

Eight of the thirty users had to be re-supplied with the survey during the time allocated for

survey part two since they had failed to bring their originally supplied copy. These were

completed before part two commenced.

Part Two

Each video file for the questions was prefixed with the participant’s name. They are encoded

in Windows Media Video 9 with Lame MP3 audio encoding. Each collection of video is

stored in a separate folder per participant. Since survey sheets were marked and contained

occasional notes by myself and the participants (user feedback, personal observations during

recording (usually revolving around participant confusion and retakes of video)), the

information related to the video was also stored in a text file that followed the same naming

procedure.

Typically the video files were under 2 minutes in length, though in situations where additional

information (instructions and questions) occurred, this was longer.

4.3.4. Ethical Considerations All forms supplied and described to the participants (Appendix C), and included information

regarding who to contact regarding ethical issues. Participants were also given ample time

prior to commencement of the survey to voice any objections to what was being asked. No

complaints were received.

4.3.5. Initial Analysis It was apparent at this early date that there were two distinct groups of users; those who had

little trouble applying motion to the simple input tasks supplied, and those who seemed

uncomfortable or confused by the concept.


61 | P a g e

The confused users typically looked up for guidance or prompts (this can be attributed once

again to the informal nature of the survey) or those who attempted to use the traditional

features of their own phone to perform the tasks (by turning the phone over and using the

keys.) This was often accompanied with a comment along the lines of ‘this is how I usually do

it’ indicating they were not comfortable with the motion concept in general.

Data collected from users who understood the requirements of the survey appeared to be

relatively consistent. A lot of the motions between these users followed the same movement

paradigms as other users based on these questions. One point of interest is that it appears a lot

of users over emphasised their motions. Perhaps users were just ensuring their motions were

captured, or thought such large motions would make it easier for the device to understand.

Large motions like these are typically not what is being looked for and therefore were not

appropriate natural motion. With the exception of this scaling issue of motions, the supplied

answers were consistent.

The most common answers supplied were:

Please perform natural motion that you believe would best interpret the choosing of an object

in a vertical list (demonstrate movement down and up the list and the choosing of the object)

This was indicated by Direction Up and Down (y-axis) to select an item followed by direction

away (z-axis) from user to select.

A motion to confirm (say yes) to an action.

A combination of Direction Up and Down and Rotation Forward and back on the X-axis

• A motion to deny (say no) to an action.

This was Rotation Left and Right along the Y-axis and Z-axis, no direction information like

the above yes example. This is the first instancing of rotations and direction confusion.

• Rotate an object on the screen to the left. Example:

Rotation left along the Z-axis. Appeared to be the simplest input for people to articulate

• There is the number 18 in a box, how would you increase this by 2 (to 20).


62 | P a g e

This resulted in many different responses, but the most common was Direction Up followed by

Direction down to the starting point twice. Directions seemed confusing to a lot of people,

perhaps such precise inputs would need a better knowledge of the mode beforehand to

perform?

• How would you increase the device's volume while talking on the phone.

This was answered by moving the device Direction down in general. Such a motion would

move the device away from the ear and make the phone conversation harder, so it is not the

ideal answer.

• Reload a web page that is being currently viewed.

The shaking of the device here was by far the most popular answer. Perhaps it is the confusion

of the user without knowledge on how to express the answer. Or maybe this was a motion

indication of the reload symbol typically seen on web browsers. I assumed the first.

• Answer an incoming phone call.

Most users moved the device up to their ear. There was significantly better than average

understanding of this question. In hindsight, it should have been the first or second question

on the survey. But because of the survey being conducted during holidays where people were

readily available, it was decided to not restructure this survey, but instead take this finding into

the next data collection stage.

• Pan right while viewing an image.

Direction Right (x-axis) was the most common answer.


63 | P a g e

It has become apparent that the order in which the questions were presented was an important

factor in the outcome of the pilot survey. Simple motions that were typically used (e.g. answer

phone) and questions with direction prompts (e.g. panning right) were answered with less

confusion and more often, than the less defined questions. Ordering the questions via this

perceived difficulty scale may have resulted in more successful completions of the survey.

19 of the 30 participants could be classified as being comfortable with the concept of applying

motion to register as inputs on their mobile device. The remaining 11 all had the same

objections and mind-set, one which was apprehensive of using motions. This is something that

could most likely be overcome by a gentler familiarisation process.

4.3.6. Particulars of Note • Many of the users who had no problem completing the survey expressed surprise that prior

participants had informed them the survey was confusing.

• Often when users asked about redoing a specific motion question, their second and third

attempts were identical, thus placing emphasis on the importance of natural responses.

These motions were also always comparable to responses by other participants. When

asked why they needed to redo them afterwards, a typical response was that the answer was

not good enough.

• The variance in answers was significantly less than expected.

• Reloading a web page caused the most confusion, even though it was placed last in the list

(in an attempt to lessen its impact because of obscurity).

• Two participants performed the exact same motion for all of the inputs. One appeared to

be trying to emulate double tapping a mouse, while the other performed a simple shake

each time.


64 | P a g e

4.4. Development of an Autonomous Survey on a PDA Autonomous surveys follow a far more quantitative path than traditional face-to-face surveys

as the information given to the user and the results recorded are without influence from third

parties. An automated survey is strictly between the participant and the device so additional

preparations must be made. However, it would have made the actual survey task a lot easier if

it had commenced earlier. Such a survey could be used to follow up from the initial survey

conducted to collect far more exact data from users now that the basics had been established.

4.4.1. An automated survey versus traditional means Both methodologies have advantages, and these were examined to determine which path offers

not only the simplest passage, but the most useful final result. Therefore, a study of what

information needed to be collected was performed to determine which type will yield the most

useful results.

Typically the scope of the data collected is an important factor on determining the usefulness

of an automated survey. If there is a limited scope of answers that a user can choose then

automated surveys are useful, but so would multiple choice. Therefore the type of data

collected also becomes a factor. If one can take advantage of the mechanisms available in the

mobile device while collecting and leveraging the data then they suddenly become a very

useful tool in the process.

4.4.2. Smartphone Development There are many Smartphones™ out in the market at this time competing for market supremacy.

These include Microsoft (http://www.microsoft.com/windowsmobile/smartphone/default.mspx/) phones (created by 3rd

parties and re-sold,) Symbian™ phones (typically Sony Ericsson and Nokia,) Palm

(http://www.palmone.com/) and BlackBerry™ (http://www.blackberry.com/) devices. Most companies (with the

possible exception of Motorola) have stacked their chips in the corner of one of the above

devices. This makes the industry not only very fragmented but very fast moving as new

standards and development ideas are constantly being introduced.


65 | P a g e

4.4.2.1. Device Information Smartphones™ are often considered a convergence of a regular mobile phone and a PDA (Fig

4.2.). Microsoft Smartphones™ are usually developed offshore (typically in China by

companies such as HTC) and resold in the separate regions as rebadged phones by companies

such as I-Mate (http://www.carrierdevices.com.au/,) Orange (http://www.orange.co.uk/,) O2 (http://www.o2.com/) and QTek (http://www.qtek.fi/).

Figure 4-2: I-Mate Smartphone2 vs O2 Xphone vs Orange SPV E200 (HTC Voyager)

While it varies between devices, most devices run a Texas Instruments OMAP or Intel XScale

processor running in the range of 132-624Mhz with 32-128 Megabytes of RAM for storage

and the same range of ROM to hold the operating system [41]. Devices have an ISO compliant

key layout as well as a 4/8/9 way toggle-stick and hardware buttons to operate the camera or to

take voice notes. Most devices are tri-band capable and have methods for external storage

(MMC, SD cards or mini-SD cards) and a variety of communication modules (Bluetooth,

infra-red, GPRS.)


66 | P a g e

The operating system itself had gone though many revisions over the years, originating with

Windows CE 1.0 which was developed back in 1996 for miniature devices. This had

undergone many revisions (CE though to 5.0) and branches (Windows Mobile for PocketPC,

Windows Mobile for PocketPC – Phone Edition and Windows Smartphone™). Each of these

has had sub-versions as well (Smartphone™ 2002, 2003, 2003 SE.) The operating system

comes with functionality that is quite similar to a base install of Windows 98, offering a

contact book, solitaire, internet explorer and phone specific options such as dedicated SMS

and MMS programs. The devices can be synchronised with a desktop computer via

ActiveSync and information can be shared between devices, this is usually through a USB

cradle or Bluetooth adapter.

Of particular interest is the recent release of Windows Mobile 5 [22], which has been living

under the codename “Magneto” for the last year. It has been designed as a total solution for

the mobile device market where it will work for PocketPC™, PocketPC™ with phone

capabilities and Smartphones™. Along with yet another new generation of mobile hardware it

will be an interesting development.

4.4.2.2. .NET CF and Embedded C++ This project utilises the Microsoft .NET Compact Framework (.NET CF) that was first

introduced in the Smartphone™ 2003/PocketPC 2003 as a successor to Embedded C++ 4.0. It

follows many of its parent (Microsoft .NET) design traits such as garbage collection,

consistent typing and delegate event handling [38]. It also has the major bonus of integrating

into the Microsoft Visual Studio development environment.

The use of the .Net CF for this project does carry with it some significant weaknesses, which

in general can be addressed by the use of unmanaged C++. The .Net framework languages are

interpreted via the Common Language Runtime (CLR), making them inherently slower than

compiled machine code, and this can become very noticeable on mobile devices. In addition,

the .Net CF is a significantly cut down version of the full .NET Framework, with a large

proportion of the class libraries unavailable [18].

The .NET CF’s ability to interact with other languages [18] allows the use of compiled C++ (via

linked libraries) to be available to the C# .Net runtime, and to thereby significantly improve

performance.


67 | P a g e

This project also utilises the open compact framework initiative, [27] which is an attempt to add

functionality to the existing .NET compact framework. Such functions made available such as

load/save dialog boxes are used by nearly every .NET CF developer.

4.4.2.3. Camera API Cameras currently available on Microsoft Smartphones™ have a significant bonus for

developers over most other camera-phones available, in that they are run on the software level

of the operating system rather than on an independent hardware chip. This means that a

developer can add hooks to the operating system to directly affect/read information coming

from the camera. Unfortunately, while this is possible, the API to work with the camera is not

documented, making implementation a matter of trial and error. The release of Windows

Mobile 5 has addressed this issue.

An open source initiative [9] to interoperate with the camera was produced though that

specifically only works on certain devices. The implementation of this is very close to the

.NET solution in the .NETcf version 2. Unfortunately, this solution was not sufficient for the

goals of this research. The Windows Mobile 5 SDK allows interaction with these cameras via

DirectShow, though documentation is extremely sparse. The procedures for using the camera

and DirectShow are included in Chapter 7.

4.4.3. Input mediums of a PDA Using an electronic device gives a lot more freedom and avenues to explore different

possibilities when conducting a survey. As these devices become more and more convergent

and aim to become the ultimate all-in-one device, their functionalities increase. Using these

additional features allows a far more complete experience when collecting data, and in the long

run, if designed well, allows the data collection task to be far easier.


68 | P a g e

4.4.3.1. Textual Input In developing an automated survey, there are multiple mechanisms the developer can monitor

in the device to collect this information. With touch screens written input has become a

possibility, but that alone is not a sufficient tool since in the participants mind, writing an

answer on paper would be just as easy, possibly more so. But if there is additional information

that needs to be derived from this written text, then this can be collected by a device.

Examples of additional information that can be gathered from text written to a PDA includes:

• The stroke order of lines

• Previous answers (erased)

• The time taken to write out the answer

• Response time before writing commenced

• If any periods of thought (non-action) occurred during the written answer.

All of this additional information can be informative when trying to parse additional meaning

from text and, once again, be very useful if the situation is right. For example, if you were

trying to read the thought processes behind an answer, instead of just the answer supplied.

4.4.3.2. Touch Input A device’s touch screen is not limited to just textual inputs. A selection of items can be done

via touch as well as by text. Displaying information to the user then requesting feedback by

allowing the user to interact with the displayed information can be much more intuitive. For

example, when making a participant choose between pictures all displayed upon the screen at

the same time. Touch also allows the option of using implements other than pencils/pens as

the input mechanism (i.e. fingers). Again details such as response time can be recorded. Other

information such as where the participant chose the option becomes available (and easily

recordable as the selection point can be tracked to the pixel).


69 | P a g e

Other ways of using touch become available which are impossible on paper. Such inputs such

as sliding information around on the screen are possible with an interactive medium. A user

could be requested to sort a jumble of numbers into order. In this example, the process of

achieving the final answer is just as important as the answer itself. With pen and paper there is

no viable way to collect this information. Without an interactive medium then it would have

to be done physically and recorded via video. This makes the survey process itself

significantly more complex.

4.4.3.3. Audio Devices have in-built microphones capable of capturing audio much like any standard voice

recorder. This audio by itself is not a sufficient reason to lean towards an automated survey.

But with processing power, this audio can be processed during the course of the survey

allowing other the survey to be moulded depending on the audio. This could be as simple as

filtering out unneeded audio, or physically modifying the path the survey takes depending on

the audio.

If there is a large background noise being recorded while the survey takes place then the

volume of audio output by the device can be lifted dynamically based on this amount. Surveys

could wait for a significant quiet period (participant has finished talking) before going on to

the next question. This audio could directly interact with other cues as well (take visual for

example) with what the user seeing being directly related to what they say.

4.4.3.4. Video Recording video from a device allows capture from a first person perspective (i.e. directly

what the user sees). This perspective can grab a lot of additional information that static camera

locations would fail to pick up.


70 | P a g e

Usage information can also be retrieved by analysing video taken by the camera. Fingers and

hands can interfere with the video. While this could be considered a hindrance, it also offers

information. This can define not only how the device is being held during a question, or even

which hand it is being held in. Swapping of hands is also easy to pick up with such video.

The angles a device is held at can also be compared between participants.

Angles can also change between questions, and this can help give information of the responses

being supplied. Sharp and sudden changes of the device can indicate a person under more

pressure while a slower motion can indicate a calmer response. With questions specifically

designed to place the participant under duress, or to make them feel comfortable, this

information can be invaluable.

Many devices also include cameras that point towards the person’s face. These cameras are

typically small and used for voice chat, but can be valuable sources of information to see the

user’s natural physical responses to questions.

4.4.3.5. Motion By processing the data between frames of video it is possible to detect the general motion of a

device. This data can be collected to read natural motion reactions to questions and situations.

It also allows questions to be created that allow movement as a response.

Using motion has been examined in depth throughout this document.

4.4.3.6. Other There are other resources available on mobile devices that can be leveraged. Wireless

communication protocols such as Bluetooth and infra-red are being used to control external

devices automatically. Bluetooth is commonly used to interact with headsets while many

applications exist to make PDAs act like a TV remote

(http://www.pdawin.com/tvremote.html). Being able to control and fire-off events on other

devices like this can bring a true level of interactivity to a survey.


71 | P a g e

Other devices (with no need to be portable) can be used to add elements not available or

simply not physically possible on a mobile device. Infra-red can be used to start recording on

a web-camera to enhance the video information collected by another stream, or it could be

used much like a remote and pause playback/recording on a VCR [36]. Surround sound setups

can be used to track reactions, and communication protocols can be used by the device to start

such audio on a computer located nearby.

GPS functionality and tracking could be used to determine location and movement during a

survey. Such environmental information can influence results during a survey, and having this

information at hand allows the researcher to study this. GPS by itself can also be used as a

mechanism to collect answers. Questions can be designed asking the participants to travel to

certain locations, for example, the time taken to get from location A to B would easily be

tracked using such a mobile device.

Other standard information can very easily be taken advantage of during a survey. Elements

revolving around time can easily be taken such as the time taken to complete a survey (or even

limitations to the time allowed to answer it) the time and date a survey took place or data

structures such as date of births.

4.4.4. Information Storage These devices are capable of recording large amount of information at relatively fast rates.

This has the downfall of having a lot of information to parse though if the storage is not

carefully planned. Hence, an effective and obvious methodology to store data is required.

4.4.4.1. Video and Audio Care has to be taken when writing the data collected by automated means. Devices have

limited storage space, so judicious use of this resource is required. Video in particular, even at

low quality settings and resolution, can consume an unacceptable multiple megabytes per

minute. Even with large storage cards these devices can struggle to push so much data onto a

card at such a high rate, as this video throughput can often exceed the write speed of storage

cards. This results in buffers getting full and doing the only thing possible, dropping data, and

hence dropping frames in the saved video.

Other problems resulting from this include battery drain (even on solid state media) and

increased down time as data has to be transferred over to another media format more


72 | P a g e

frequently as storage fills quicker. Obviously the solution to this is data compression, and

thankfully such additional overhead in processing power is well within the means of current

generation devices to perform on the fly, be it hardware or software encoded.

Considering that the data collected on devices is typically of a lower quality, lossy

compression techniques yield good results when applied to the video stream. Filters applied to

the stream can be much harsher if only certain bits of information are required from the stream

and they are clearly defined. Gray-scaling or even two-tone video can be valid options if only

specific features are being looked for. These techniques can decrease data usage dramatically.

Processing video in real-time while the survey takes place can also allow the extraction of only

the data required and being recorded as it happens, typically in a non-video format. Video

information about the darkness of the environment while the survey takes place can easily be

averaged out and given a numerical value instead of video being recorded and allowing human

interpretation of the data, or possibly even automated processing externally when it could have

been done while recording in the first place.

4.4.4.2. Other Data As with other means of data collection, storing results should be both convenient and easy to

access, interpret and compare/. Therefore binary streams of data should be free of redundant

data and clearly named, possibly even self describing (with, for example, XML tags).

Archiving information can be automated as well and can be achieved via multiple means.

Such archiving can be from simply storing a second copy of data upon the device after the first

is recorded, or streaming low bandwidth data to an external mechanism to store.


73 | P a g e

4.4.5. Questions and Survey Design Surveys developed for mobile devices simply can not follow traditional computer-based

usability survey guidelines because of the lack of many features such as keyboards and mice

(Sect 4.4.3). A mobile device also adds many limitations not commonly encountered (such as

screen size, Sect 4.4.2.1.) Therefore, there are additional factors to take into account during

both the design and collection stages that do not occur during the creation of surveys on

traditional computers.

4.4.5.1. Survey Automated surveys can take advantage of the above input techniques in an attempt to take full

advantage of the device. Ideally the participant should be informed as little as possible about

these technologies available. This way the participants will hopefully be ignorant of these

mechanisms and not attempt to over-compensate for perceived weakness in the devices

(speaking louder/slower than normal for example). It became apparent in both surveys

undertaken in this research that several participants over compensated their movements in such

a way to ensure they were captured. This was most noticeable with large, over-emphasised

motions.

With such a device, there are plenty of avenues to travel, but with the resources available, we

can make the surveys fun! Enjoyable, stress-free environments result in more realistic answers

than an environment where the participant simply wants to get out of there [6].

4.4.5.2. Questions When posing a question, resources should be used to their fullest. The devices are capable of

writing out and speaking questions and can use animated diagrams to explain what is required

and expected of the participant. Being automated should mean as little involvement by the

researcher as possible. The questions should be developed to handle the possible scenarios

that can occur. If the person does not understand then a repeat of the question, or a way to find

further information via the device should be available.

During the course of these surveys, when people were confused with what they had to do, their

answers were unrealistic. With better defined questions the amount of flawed answers will

decrease. A review over the answers given showed that questions designed to be as clear and

instinctive as possible resulted in the highest percentage of usable answers.


74 | P a g e

If the participant makes a mistake then there should be a plan of action available to either skip

or repeat the question. Other segments that need to be considered if the scenario occurs

include:

• Does the stored file get overwritten?

• Is their a way to avoid having to take the entire survey again to just get the one answer?

• Is a second attempt at the question going to give more valuable data, or are instinctive

responses more important?

Such details and the repercussions of such actions must be considered when designing

questions.

4.4.6. Survey Environments When designing questions to take advantage of the input mechanisms available, the

environments these surveys take place in should be designed in such a way to interfere with

this data as little as possible, and wherever possible actually facilitate in its collection.

While a truly mobile survey cannot be guaranteed to take place in an ideal environment, this

does not mean automated surveys cannot. In the case of the former, question design and data

collection techniques have to be developed in such a way to ensure the data in unaffected by

external stimuli as much as possible. But with automated surveys in controlled environments,

the data may be much more valuable if it is indeed affected by the environment.

4.4.6.1. Appropriate environments for use of an automated survey In standard surveys, the environment is typically designed to provoke a certain feeling in the

participant so that answers supplied are appropriate while the participant is in a certain state of

mind. Normally these environments are designed to ensure the participant is not nervous

(calming and relaxing environments), is clearly thinking (organized and clean) or perhaps is in

a specific mindset (poster placements around the room, or conveniently placed food and

beverages).

Such environments apply again to automated surveys depending on what information is to be

collected. But additional factors should be considered when preparing environments, in

particular when collecting video.


75 | P a g e

With the video being collecting typically being of a lower quality then the environment can be

set up to enhance the information retrieved. Since controlled environments are typically rooms

then the features available in the room should be clearly identifiable in that room. Objects

should be easy to spot in the video; this can be done with strong contrasting colours or easy to

recognize (unique and easily identifiable shapes.) Objects such as books all have easily to

recognize shapes (rectangles) and are not limited by colour. Such items are easily identifiable

when the camera is moving or the image is blurry.

Strong contrasting walls and floors along with straight lines also make video information

easier to process. Unique items that are evenly spaced out and uncluttered also make them

easier to notice in a video stream.

Quiet environments make audio easier to record, so the absence of humming computers or

additional electronic devices can make this data easier to process.

4.4.6.2. Preparing environments for more meaningful video results When preparing environments to aid in the data collection from automated surveys, certain

items can assist in the collection of information, in particular motion. With the device rapidly

moving and rotating, the easiest way to process this information is if there is always a

reference image on the screen that can identify the device’s location viewing angle and

rotation. Placing such ‘icons’ around the environment can greatly assist with this motion

tracking process. Not only can the devices properties be collected when an icon is on the

screen, but transformations between frames (icon to icon) can also be tracked.

To be able to gather this information, the icons would have to be unique and difficult to

confuse or misinterpret. This allows the researcher to determine when the device is being

pointed but the icons would have to adhere to other properties to gather rotation information.

Icons would have to be designed in such a way that their rotation can always be determined.

While many designs offer this, the easiest way is to identify which part of the icon is up (Figure

4.3).


76 | P a g e

Figure 4-3: A Sample Icon with a Tilde Representing Up

Icon spacing should also be devise so that the video stream will not be over cluttered with

information (so typically not more than one icon in the picture at one). But sufficient enough

that the video stream is not totally void of such information.

4.4.7. Common Obstacles Many people are intimidated by electronics and this is one of the major hurdles of an

automated electronic survey and if such demographics are a major target of the research work

then one must seriously consider if using an automated survey is in fact the best way to

proceed. Question design can ease this problem by making the survey itself more fun and

gently guiding such users into using the device by staging the questions by levels of

increasingly perceived difficulty.

It is possible that if this is the demographic being targeted that additional input streams are

recorded simply to collect any data being recorded, even if it is not of the type requested.

Audio is the best backup solution as people can often supply useful information if confused

and are aware that the device is recording their voice.


77 | P a g e

The direct opposite can also pose problems as people who are comfortable with such devices

can be extremely set in their ways of interacting with them. If people insist on interacting with

the devices via their comfortable means, usually buttons or touching, then it must be decided

how to in fact deal with this. Should it be recorded at all? Again, survey design plays a major

factor in this.

As with all software engineering, automated surveys need to be tested. When posed with

certain situations, participants might respond in unexpected ways. If these responses are not

handled then surveys may record wrong information or crash/hang. Black-box testing with

testers who could be applied to the survey demographic will be by far the best way to ensure

the surveys are stable and handle all extreme situations appropriately.

4.4.8. Device Resources With a device that is capable of communicating information to its users in a variety of ways

there is little reason not to take advantage of these mechanisms. Simultaneous text and audio

can be used to outline what is required of the participant to answer questions and imagery can

easily be included in the questions. Being able to repeat a question in exactly the same manner

as it was originally delivered is simple and should be available to be used before actual

recording take place.

If a device has phone capabilities, there is no reason why these functions should not be taken

advantage of. By allowing the device to simulate phone calls and letting the user give

responses as if they are talking on the phone, this may in fact make them a lot more

comfortable with the entire process.

4.4.9. Automated Survey Summary Automated surveys such as the ones undertaken in this research can be of great benefit to the

researcher if they are designed to take advantage of the resources offered by the devices.

Using such a survey also contributes to familiarisation with video programming on these

mobile devices and such information aids greatly during development

As development for mobile devices becomes more common I can foresee frameworks for such

surveys being developed as the development time is the only significant detriment of such a


78 | P a g e

survey style. Once such a framework is developed and such a path is open to most researchers

the true benefits will truly be seen.

4.5. Survey Two – Testing User Decisions and Reactions The data gathered from the first survey pointed out two specific factors that had to be

eliminated to gain more useful information. The surveys needed to be far more formal; this

would mean no more answering of questions during the actual survey and the users had to be

less familiar with the devices they were using as props.

It would be expected that survey participants would ask less questions in the following survey

if they had partaken in survey number one. However, considering that the planned number of

participants increased, the issue would remain. Therefore the users would have to be more

informed of what was expected of them. For this larger scale survey, users were supplied with

a document explaining the concepts of a motion-based input model via a web page to examine

and understand. Questions would have to be fielded and answered prior to the survey taking

place.

To ensure a more controlled environment, allowing users to use their own mobile devices as

the basis for the survey had to be scraped. Using a device with no notable buttons (a PDA) for

all participants covered this and eliminated the urge for users to attempt to use buttons as

input.

Since the pilot survey had a significant success rate and the possibility of such a model was

deemed valid there was a need to actually define the model with user input. Therefore a better

defined and further controlled survey was required to collect this information.

Survey two was conducted approximately four months after the original survey and was

developed as a fully automated process (further described above). The goal of the survey was

to obtain how a user reacted to specific situations that device itself gave the user. This ensured

a level of realism that the first survey could not. For example, the device was able to ring

when it wanted the user to answer it or it could change its volume when wanting to see how

the user reacted to quiet and loud noises.

An intuitive motion based model for mobile devices – Survey Construction

79 | P a g e

Again the situations were split into input types (Section. 3.5.2.) but were better defined and more

specific than the original survey. Each of these situations is described in more detail in this

section.

4.5.1. Design The informality of the initial survey had to be removed and quantitative data had to be

collected. This meant that all the information regarding motions had to be supported with hard

data (in this case video). Also the users had to receive exactly the same information from the

questions and derive their own understanding from it. These factors suggested using a single

device as the focal point of the entire survey.

The device used to collect information from the users was a HTC Universal, branded as an

IMATE JasJar [41]. The JasJar is a Windows Mobile 5 device that could be used as a PDA

(Figure 4.4) or a mini-laptop like device (Figure 4.5). After demonstrating the device to users in PDA-

mode while running the native camera application, it was noticed that the lack of buttons made

people interact with it via motion in multiple situations. When queried about this phenomenon

users responded that these were instinctive motions, indicating that the users were not actually

aware that they were performing. This was exactly the information that was required.

Figure 4-4: Universal in Laptop (landscape) mode


80 | P a g e

The devices were programmed to run applications that supplied information to the users via

traditional means (audio and text) and the reactions to these applications could be recorded.

We believe this automated approach is a much better platform for carrying out the qualitative

surveys required for this aspect of the research. Multiple bonuses were outlined by taking this

approach:

• Consistent results; all participants were supplied with the same device and identical

scenarios to respond to. The lack of external influences would greatly increase the quality

of the data retrieved.

• An early test bed for developing applications on these devices would be achieved. Much

of the information at this stage could be passed on to later stages of design and

development.

• All data could be retrieved and stored automatically in digital form, simplifying the

storage.

• Participants were supplied with an unfamiliar device with a limited amount of buttons to

interact with. This would maximize the possibility of the participants actually using

motion to express their input.

• Tests of the device’s internal camera to capture motions could be tested in a real-life

scenario.

Figure 4-5: Universal in PDA (Portrait)


81 | P a g e

A significant downside was that there was be an increased time between surveys, as the

development of these automated surveys needed to be completed before they could be used.

Considering that the time spent in development at this stage would prove beneficial at a later

stage of development, it was deemed a minor sacrifice.

4.5.2. Data Collection Methods Each video was recorded internally by the device inside an ASF wrapper with no audio stream.

Video was encoded in the Microsoft MPEG4 compressor with default properties at a 176x144

resolution. Each of the tests was stored in a separate video file with the following headers (Table

1).

Table 1: Header of Video Filenames for Survey Two

Test Header

Choosing ‘CH’

Adjustment ‘AD’

Modification ‘MO’

Confirmation ‘CO’

Functionality ‘FU’

Functionality2 ‘FN’

These were appended with an incrementing number to denote which participant this was. This

number was stored in the device’s registry and incremented by one at the end of set of tests.

To ensure this number survived after files were copied across to a more permanent medium

(laptop), this was generally done after every three to four participants. The Video was then

batch converted from the ASF container to AVI using the XVid encoder with the quality

quantisizer set at 7 (very high considering the video quality). This was saved directly to an

external 3.5inch hard-drive in a caddy, connected via USB2.0.

The HTC Universal device is equipped with two embedded cameras. One is used as a standard

low-resolution video/photo mechanism while the other points towards the user and is aimed

purely for voice calls. While using both devices might have been possible it was decided to

rely on recording purely the external (first) camera. While capable of capturing up the

resolutions of 640 by 480 pixels, at this rate the frame rate is limited (around 10-15frames per

second) but is also suspect to frame skipping like most software based cameras. Therefore all

captures were performed at 176 by 144 pixels throughout the survey phase.


82 | P a g e

At this rate, it was possible to capture at greater than 20 frames per second with more than

acceptable image clarity and limited blurring from fast movement. The data from each test

was recorded and encoded on the fly by the device into an Advanced Streaming Format (ASF)

container. Only video was captured during the tests so no audio stream was recorded during

the process (Figure 4.5.).This resulted in minor playback issues upon desktop platforms so it was

transcoded into a MPEG4 AVI container to ease compatibility and playback. All video

attributes during this transcoding were kept (depth, size) and the bit rate of the videos was

slightly increased to compensate for different compression techniques used by the filters to

ensure the videos were as close as possible to the originals. Mencoder

(http://www.mplayerhq.hu/) was used for conversion into the MPEG4 format.

With each participant partaking in six different tests (Section. 4.5.4.), each of these video clips was

saved to a different file along with a unique identifier mapped to the person’s name

(CHdJK.asf, the CH denoting Choosing (Section 4.5.2.)). Each of these (after being transcoded)

was organized depending on which test they were recorded by. This allowed easier viewing of

the videos based on the same input type.

Figure 4-5: Sample Capture

4.5.3. Survey Conditions All motion survey was conducted within the Multimedia Lab (Room S710) during the Easter

holidays of 2006 to ensure a quiet and uninterrupted environment. Every participant was alone

during the survey as the device prompted the user with each question in turn and recorded all

data. I remained behind a partition to ensure I did not detract the user or exert any influence.

This included ignoring all questions asked.

With video now being collected from the device instead of over the person’s shoulder there

was a need for an environment to be easily identifiable. Also, for collecting quantitative data


83 | P a g e

there was a need to be able to put values to this information. To facilitate in this, a relatively

bare room was selected to be the location of the survey. Easily identifiable markers were

placed over the room so motion could be tracked easier at a later date (Fig 4.6.).

Figure 4-6: Controlled Room Layout with Markers

These markers were placed to attempt to gather all the necessary information from the user in

regard to how they moved the device in response to the situations. Blind tests were conducted

to determine how movement could be tested in the environment. A third party took snapshots

from the room at various locations and at different angles to test the efficiency of this

environment and the markers greatly eased the determination of where the camera was located

when taking pictures. Rotation and linear movement from the point of view of the user could

be tracked for the front half of the room with high success. Motion was easier since the

translation between markers helped highlight the changes in camera. Floor and ceiling

colouration differences also increased rotation and location detection.


84 | P a g e

4.5.4. The Tests Each test created was designed to concentrate on a specific input type. Each situation was

designed to simulate a common real-life scenario.

4.5.4.1. Functionality Required Each of the input types to be tested required the creation of a test application with appropriate

interface functionality and recording facility. To some extent this restricted the testing

approach used for particular input types. The applications ran on a timer to presume the user

had responded to a situation. However, in situations where multiple paths could be taken

(scroll down or up though a list), this could not be compensated for.

The audio from each of the tests is available in Appendix E. These were recorded as separate

sound bytes and played by the device before the commencement of each test.


85 | P a g e

4.5.4.2. Choosing Data on the Screen In this test, the participant was given the device and asked to control it in such a way as to

select a series of asterisks that appeared on the screen in a set order (Fig 4.7.). The test was

designed to see how the user would move the device to reach a specified end point from the

start position.

Figure 4-7: The Choosing Test (Emulator shot)

The test always had a simple rendering of a mouse cursor in the middle of the screen to denote

the start point of the user and an asterisk was shown to denote the end point. This end point

changed every 3 seconds, giving the user a new target to aim at, once again starting from the

cursor in the middle. The test queried seven of the eight standard directions the cursor could

be moved in (Fig 4.8.).

Figure 4-8: The Eight-Way Movement

The asterisks were displayed in the numerical ascending order indicated in Figure 4.8. It was

assumed that the differentiation between up and down movements would be the most

significant factor in this test and therefore kept separate in an attempt to see reactions.


86 | P a g e

There are generally two schools of thought when holding a device in front of you and wanting

to indicate a direction in the Y axis. These are often relative on your perspective of the

situation. Many users would want to move or roll the device down to reach square seven, but

a minority would actually want to move the device up, much like the controls of an airplane

when pushing forward will tilt the nose down and hence send the plane downwards.

Many games (first-person shooters in particular) compensate for such a difference by giving an

option to invert the Y-axis, hence allowing the user to push the mouse away from them (up on

the surface) to look down. This seems the most appropriate direction to handle this

particularity.

I had originally assumed that the greater majority of people would in fact roll the device to

generate movement simply because of the lesser body movement required. This in fact turned

out to be false as 39 of the people surveyed relied on linear movement to indicate choosing.

Interestingly, 14 of these people used very large motions to indicate this movement, whether

this was just to emphasis their response or what they naturally believed is not known.

Seventeen people relied on the rolling of the device; four of these used the inverted method

explained above. It should also be noted that a form of deceleration was detected in some of

these videos with the movement slowing down when approaching the target.

Eight people offered no video/motion response, this was attributed to confusion with 5 of these

people deciding to touch the screen instead. Three people offered zero response.

Eight people obstructed the camera while the information was being recorded (with fingers),

while five held the device the wrong way around so no valid data could be collected. Three

people’s data was unreadable because there was not sufficient focus on the markers around the

room. No people offered alternative motions to indicate the movement portion of the choosing

of the asterisks.

Of the 57 usable responses, 41 also performed notable actions to select the items after moving

to them. These were slowly pushing the device away from them (8), a single quick flick away

from their body (20), a double flick (9), a left/right shake (3) and a left 90 degree rotation (1).


87 | P a g e

Out of this test the only unexpected but usable response was the rotation to select an item.

This was an interesting outcome.

4.5.4.3. Adjustment to Counteract the Changing Image This phenomenon was the catalyst for devising these automated tests. Capturing video while

the device was in portrait mode actually rotated the device 90degree (a device limitation I

presume). People, including myself, performed what was natural when seeing this rotated

image (Fig 4.9.). We all tried to correct it.

Much like a camera viewfinder where you try and get the perfect angle by rotating the device,

the same was occurring here as people naturally rotated the device in an attempt to match what

was being seen. Of course this was futile as all this accomplished was further rotation of the

camera itself, resulting in the identical image. This was a perfect test to see just how people

respond to a simple adjustment situation, and hence made a perfect test. But the test itself was

even deeper.

Figure 4-9: Adjustment Test. Notice the LCD Screen

The data being streamed through the camera was still not being displayed properly even the

rotation was taken into account. Somewhere along the filtergraph of the data stream, the

image format was getting lost, resulting in the image actually being inverted along the X axis.

This is actually very hard to notice because of the rotation already present in the image. So the

participants’ second attempt to correct the image was now also of interest.


88 | P a g e

Participants were given the device as it explained the instructions. The information being

shown was a live stream coming directly from the devices camera and was being recorded so

attempts to correct the image were seen. The goal was to see which way the devices were

being rotated and what follow up attempts happened after the participants noticed the failure of

their correction attempts.

Given that participants were always facing the same objects in the room it should be noted that

objects shown through the view finder had the same impact on the user’s decision on rotation.

So choices of movement came down totally to the participants thought process and not

unevenly influenced by what was seen on the screen. For example, in Figure 4.9 the LCD

monitor in the background is a focal object so users might try and correct the orientation of

that first.

Confusion did not play a part in this test as all users who received a viewable image did in fact

try to move the device to correct this. Problems once again occurred with people who placed

fingers over the camera, some of these realized the problem themselves before a significant

amount of time had expired. Seven were not useable due to a delayed reaction caused by their

finger cover up the camera. The other nine were usable.

Every one of the usable responses started by rotating the device to the left (Fig 4.10).

Figure 4-10: Device Rotation

Differences were noted in the speed and amount rotated. These factors can both be attributed

to the certainty of the user that their action would rectify the image. Of the 73 usable

responses, 25 left their attempt at the one rotation, 38 attempted to rotate right after the left

rotation failed before giving up. The remaining 10 used multiple different rotation attempts to

achieve the result. The overwhelming result was that rotating the device was deemed the same

as rotating an object displayed upon the device.


89 | P a g e

4.5.4.4. Modification - Dealing with a Warped World This is another test where the user views an image and the way they respond to it is recorded.

This time a static image was chosen, a building (Fig 4.11.) and various image transforms were

placed on it to see how the user reacted.

Figure 4-2: Modification Image, High Brightness

Various effects were placed upon the image as it was being shown to the participants to gauge

their reactions. Initially the image was shown unmodified as a baseline, then the image goes

through multiple stages of brightness to see how the users respond. There are three steps the

brightness test goes though, the above figure being the highest (Fig 4.11.).

After the brightness tests, the image was returned to normal and the contrast was increased in

two steps to determine if any natural motion occurred to counteract a stronger image. After

this, the image was once again returned to normal and then flickered to test if any motion

occurs because of this flicker.

Again, relying on natural reactions instead of instructions resulted in a high success rate in

collected data. Smothered video was again an unavoidable issue (9 occurrences). But 52

respondents gave useable data, only minor/insignificant motion was recorded in the remaining

respondents. It was noted that some of these participants with poor responses actually relied

on head movements instead of device movements in an attempt to counter the changes.

Something that was non-existent in the previous test.


90 | P a g e

Brightness, in general, was countered by moving the device away from the body as a natural

reaction. The shaking response was also popular here in an attempt to rectify the situation.

Acceleration of the movement away in this test appeared slower than in other tests.

Increasing the contrast resulted in a similar response; therefore a movement like these two

could not be mapped to a direct action. They could only be interoperated as the user wanting a

change to occur in the image (back to normal). From then the user would have to be queried

about the problem.

The flicker provided more useable results as slight rotations were performed to the device

(presumed so the user could look at the device at a slightly different angle).

The result of this test suggests that no definite motion could be applied to modification

situations, only that certain motions should be watched for to determine that there may be a

problem and that further interaction is needed to ascertain the exact problem.


91 | P a g e

4.5.4.5. Confirmation with the Faces In this test the participants were shown a collection of images showing either happy or sad

faces with the simple request of agreeing with the happy faces (Fig 4.12.) while disagreeing with

the sad (Fig 4.13.). The goal of this survey was to discover how participants interpreted yes and

no (confirmation answers) as motion.

Figure 4-3: A Happy Face

Eight faces were shown in succession from multiple sources (cartoons, photos, renders, faces

moulded out of snow) with the users given two seconds to respond with each face. The faces

were in the following order: Happy, Happy, Sad, Happy, Sad, Sad, Sad and Happy and were

displayed in that order to everyone. A set order of faces was chosen to greatly ease the

processing of the video data recorded.

Figure 4-4: Sad Face


92 | P a g e

72 respondents had at least some useable data (at least one happy and one sad response). And

while respondents answered consistently during their tests, the scope of responses was large.

Many different motion types were used to signify yes and no, and there was no standout (most

popular) response.

• Shake device left/right for yes, up/down for no (32)

• Move device left/right for yes, up/down for no (7)

• Move device left for yes, right for not (24)

• Turn device left for yes, right for no (9)

The singular direction inputs did not make any sense to me at first until I realized people were

imitating the inputs on a message box with a yes/no (ok/cancel) option. Movers appeared to

prefer the single direction inputs while rotators liked the multiple directions. What has to be

noted is that there is already a large amount of emphasis on shaking motions for other

commands. How will one differentiate between yes and help if this path is taken?

4.5.4.6. Scrolling Functionality The concept of this test was to get the participants to read to themselves a block of text on the

screen silently. The actual text consisted of just over 500 words describing various input types

on mobile devices. This text was more than could fit on the screen at once; therefore it was

scrolled up so the users could attempt to read it all (Fig 4.14.).

Figure 4-5: The Functionality Test and the Text within.


93 | P a g e

The idea behind this test was that the scrolling speed increased at a constant rate until it was

basically impossible to keep up. With such a design there would be a stage where the lines

being read were scrolling up faster that the eyes were moving down after completing each line.

The video was recorded to determine what motion happened once the participants started

‘falling behind’ the scroll speed and hopefully tried to compensate for it by a reactive motion

of the device that was attempted to slow down, or possibly even reverse the scrolling.

This test proved highly successful as reactive motion by users was very significant in the

results. 72 of the respondents performed a notable motion as the scrolling speed of the text

began to overwhelm their reading. The increase in speed was subtle enough that it was not

noticeable until it was basically too late, but the instinctive motion during this subtle

acceleration could be seen.

From the initial reading position, all of the 72 respondents naturally tilted the device upwards

as a means to counteract the scrolling speed, this could be visualized as trying to increase the

gravity applied to the text by moving the device to a more vertical position. Observations also

showed that along with this device rotation, head rotation occurred as well, but head rotation

occurred in both directions

This was an indication of natural motion at its finest and I queried several people about what

they did after the survey, several were aware they were tilting the device up as a counter-

measure, but most were not. People who were aware were generally those who were tilting

their head downwards as a countermeasure.


94 | P a g e

4.5.4.7. Simulating a Phone Call This test was a collection of minor situations that could be applied to making a phone call, all

rolled into one test. The step-by-step plan of this survey was:

• The participant placing the device on the table.

This was to ensure the device was in a ‘neutral state’ before the test commenced.

• The device ringing to simulate a phone call.

A traditional ringer was used to ensure the participants understood that the phone was

supposed to be ringing and they should answer it.

• The participant picking the phone up to answer the call.

The user would pick the device up and place to their ear as if answering a phone call.

• The device starting a basic conversation.

The device played out a generic subtle telemarketers shill to start off the conversation.

• The device asking a simple voice question that the user presumably would answer ‘no’ to.

The device asked the participant if they wanted household insurance, the attempt was to pose a

question and situation that the participants would instinctively answer without any thought

process. This process could perhaps have some motion impact on the device.

• The device’s volume suddenly getting very loud.

The device’s telemarketer becomes upset because of this no answer. Since all this audio was

in fact player over the device’s external speaker it was in fact significantly louder than what

would be possible.

• The participant responding to this sudden loud noise.

The user would instinctively move the device away from their ears, but how was the point of

interest, directly away from the ear, or down? Hopefully the user would not be too surprised

and drop the device.


95 | P a g e

• The device’s conversation getting angry with the user and hanging up.

The phone call was ending

• The device playing the disconnected tone.

Phone call was officially over, how would the user respond?

• The participant hanging up the phone.

The user performing some action to indicate they had hung up from their end as well.

The idea behind this test was to get general reactions for a set of situations placed together to

examine the combination of motions to reach an end goal. Each of the perceived inputs was

minor and hopefully instinctive. Quantitative data was not to be collected from this test; it was

merely to see how a motion input system could be applied to a more complex real world

scenario.

The test did not start well for 18 of the people as they to put down the phone and place to their

ear the wrong way around. This made video hard to track as the camera was typically pointed

right at their ear or hair. These situations were discarded. The remaining participants were

able to provide at least some useable information.

The remaining 62 users all picked the device up and placed it to their ears as expected when

answering a phone. However, the question got little usable responses via motion as most of it

could be attributed to laughing. Some minor head shaking to indicate ‘no’ was detected in six

of the responses.

Upon the increase in volume, 45 participants moved the device directly away from their ear

while seven moved the device in a general downwards direction. Ten users did not react to the

volume. No one dropped the device.


96 | P a g e

48 users moved the device away from their ear and placed it upon the table. A situation where

an external object may have affected the table was not presumed to be part of the survey since

the table cannot be applied to the hanging up process. Four users placed the phone in their

laps after ‘hanging up’. Two placed the device into their pocket while the remaining eight

kept the device in their hand.

4.5.5. Participants Eighty users were selected for the second survey. Twenty-seven remained from the original

survey while the remaining 53 were new. These new participants were supplied with the

original survey as an introduction to the prior work.

While user information of these new participants was not collected, it can be stated that the

backgrounds of these participants made the tested population less diverse, in particular in the

age category as the majority of the new participants fit into the 23-26 and 31-34 categories,

though seven people of age 40+ were added. IT and construction fields saw the biggest

growth of job positions.

Little information was supplied to the participants prior to the survey and they were asked to

try to keep the tests confidential in an attempt to not corrupt responses of people who had yet

to participate. The participants were informed that information would be collected via a PDA

that they were to interact with and that I would not be responding to questions during the

survey.

4.5.6. Additional Details/Observations A few people had trouble handling the device because of the size and weight. I imagine this

did have an impact of the answers given, but there was no way around this issue.

Relating to the above, some people were aware of the cost and fragility of the device (the

screen had cracked and had to be replaced prior to the surveys. Both of these issues would

have had minor influences on the responses given.


97 | P a g e

Several people assumed that the statement ‘hold the device at a 45degree’ was an indication to

roll the device so that the top of the device slanted to their left. This was a minor issue as the

motion captured could still be processed. Tilting the top of the device downwards occurred

naturally since the device was always held at below eye level.

4.6. Survey and Concept Summary Testing simple motion concepts upon a wide variety of the mobile device using population was

performed to see how these users embraced the concept. Originally simple motions were

tested in an informal environment. Participants were asked how they would perform simple

functions. Their motions were recorded and other reactions jotted down.

Upon first look it appeared that people either understood the motion concept, or just could not

understand why they should bother. They had keys and other input mechanisms instead to rely

upon. While this was not ideal, there was a definite base that the model could be used by

(greater than 60 percent of surveyed candidates). This number might be increased with a

gentler learning curve and possible prior training of how motion could be advantageous.

This increased number was actually achieved during the course of the second survey. It

became more and more apparent that if given the right circumstances anyone would use

motion to try to modify the outcome they wished to achieve. Even if they were not

consciously aware of this movement, it could still be taken advantage of, after all, such sub-

conscious movement is natural motion in its purest form. When users were informed of this

movement after the tests they became much more receptive to the entire concept.

This suggests that even if users do not embrace the entire idea of using motions at first, it

might still be a useful tool to augment their everyday usage. As suggested by the second

survey, people became more comfortable with the idea over time as its advantages were

demonstrated first-hand. This can only aid in the further adoption of the concept. If less

responsive users realise their inputs are being guided by motions as well then they should be

more responsive to the positives of the model in its entirety.

An intuitive motion based model for mobile devices – Basic Model Creation

98 | P a g e

5. Basic Model Creation With information collected from the surveys, it is possible to start mapping certain motion

categories to input types and then further classify motions to specific inputs. This is

performed by examining how often certain commands are used by the users and what

situations they were used in. If survey data shows an overwhelming response to a specific

input by participants using the same motions then that motion is likely to be mapped to that

input.

The goals in creating the model are three-fold:

• To create a motion model that is intuitive and simple.

We wish to augment the traditional input schemes by allowing users to perform actions to

signify their intentions. These actions would be movements that come naturally to the user

and take minimum processing and effort to perform.

• To cover a significant portion of the day-to-day actions performed upon a mobile device.

Motions for the most common actions that users wish to perform will allow true comparisons

on how this model performs against competitors. This will also give solid guidelines for

possible implementations.

• To allow the expansion of the model to be both logical and simple

The model needs to be able to cope with new input and motion types being added to it. This

will mean that the model must be well-defined so designers know where to start the placement

of their concept inside the model. It will also be clear how to easily include their ideas though

the model with minimal changes to the model itself. The model as a whole should be a good

guide on how the designers’ inputs could be mapped to motion.


99 | P a g e

When motions are used, some natural motions are a combination of multiple motions (rotate

left, rotate right) and could cause confusion with an input that is only rotate left since they both

contain the same information inside. Therefore the rotate left could fire off a command before

the rotate right could occur. To stop this from occurring, many commands require a have a

neutral state after them to indicate the end of the input.

5.1. Collected Data Classification The first step in pulling the collected information together is to classify each type of motion

type used by the participants during the survey. An example of these has been tabulated in

Appendix F. This shows the input types and where they were used. The information is sorted

by when they were used (survey).

This table shows data that was consistent throughout the survey with simple to classify

motions being far more popular during the surveys (in particular directional movements).

Participants often used very similar inputs to end up with their result. This information can be

restructured simply to allow us to examine what types of motions are commonly associated

with certain types of input (Appendix G). This data shows that it was very easy to apply certain

motions to input events. But the table also supplies details on additional information that was

obtained from examining the inputs (confidence of answer, for example.)

5.2. Situational Motions As discussed earlier (Section 3.5.1.) certain inputs do not make sense when certain situations are

currently happening (i.e. panning a picture to the right when you are listening to music since

there is no image to pan). This brings up two important points:

The model must take into account the current situation of the device

Since many motions will make no contextual sense in some situations, there must be limits and

controls on when certain inputs can occur. This also means any software must be aware of the

current usage state of the device.


100 | P a g e

Motions can represent multiple inputs as long as those inputs are in mutually exclusive

situations.

We can map the motion Direction Right, Y-Axis to multiple inputs but each of these inputs

must not interfere with each other. With the current limitations of multi-tasking on Windows

Mobile at this time (one application running on top, others in the background) this mutual

exclusivity is in fact increased (only one application can have control at a time) and allows

increased replication of motions through situations.

Currently defined situations are available in Appendix H. This describes these situations

examined and relationships with other situations.

5.3. Model Expandability There must be a goal of not boxing in the designers in what options they can choose for their

inputs. This means that the designers should always have options available for them to choose

for motions. And hopefully there will be enough available options that at least one of these

motion options is feasible enough for the required input. We do not want the designers

deciding upon illogical inputs simply because there is nothing else available for them to

choose.

Therefore commands pre-defined by the model should be spread across the motion types. This

increases the chance the designers can use a motion that is similar to what is decreed optimal

without having to overwrite a pre-existing command, which should be an absolute last resort.


101 | P a g e

5.4. Inappropriate Commands Commands need to be judged for appropriateness before even being applied to the model.

There are inputs where currently available inputs are simply better at performing tasks. For

example, to draw a picture using a stylus on the touch-screen to simulate a pencil will be far

more effective than the most natural motion replacement which would be to emulate a pencil

with the entire device. Being able to see what is drawn directly with the stylus, and the stylus’

physical similarity to a pencil are positives a motion based scheme cannot effectively replicate.

There are also situations that are not just difficult like the above example, but are in fact totally

inappropriate. If one is using the device’s camera functionality and has a perfect snapshot, one

does not wish to have to perform some motion on the device to take the picture. Not only will

one lose the perfect picture, but the motion is more than likely to blur the picture. In such a

situation it is best that the input model (and application that performs the tracking) are disabled

entirely in favour of a more traditional input model.

5.5. The Base Model To create a motion for an input, certain information needs to be defined. An input, the context

of this input and a proposed motion are all required to be placed in an input. The thought

process that could be applied to this is shown in Figure 5.1. When the designer has a certain

situation that this input is valid in, this is the beginning of defining a context. With this

context in mind, it can be checked to see if the motion is already being used. If the motion is

not being used then it can be applied to the input wanted by the designer. With a motion

mapped to an input, a result is achieved.


102 | P a g e

Figure 5-1: Example of how the Information fits together from the Model

The breakdown of the motions in the model is available in Appendix I. The process for

creating these inputs follows.

Once the input and a relative motion have been defined, it needs to be placed in a category. If

this is a pre-existing category then the designer must compare each of the categories that it is

related to. If any of these categories contains the same motion that is proposed then it is

recommended the motion is not used. But it is possible that the conflicting inputs may be

exclusive from each other as the categories might be related in separate ways. But it is still

recommended that the motions are not replicated.

Once this comparison is set up, it will be possible to quickly check what inputs are in fact

available, and if any of these is appropriate it can easily be added.

If the input belongs to a unique category, then this has to be added to the model and examined

to see which situations already in the model it can relate to. Once this is complete then the

examining of conflicts can begin.


103 | P a g e

Scaling of inputs is an important factor. It was generally observed that certain motions were

performed with slightly different factors affecting them. This can be visualized simply by

imagining a user moving the device along the x-axis faster to indicate they wish to see the

image being displayed on the screen to pan across faster. These are indicated in Appendix G

as adjusters. Therefore in inputs that have varying levels of ‘amount’ that can be input, these

adjusters aid in determining this information.

A basic model must cover fundamental information. To achieve this certain input types must

be covered. These include: Confirmation, Movement, Choosing and Selection. A type such

as Adjustment (adjusting what?), has such a large scope that it cannot be blanket covered by a

model, therefore the expandability of the model should cater to it instead.

5.5.1. Confirmation With confirmation only covering two possible inputs (no choice to back out of option, ala

cancel), it is an easy motion to map. Yes and no are polar opposites and as such, the inputs

showed this. Regardless of the choice of motion, ‘No’ used the same motion type as ‘Yes’ but

with opposite parameters.

I.E.:

• Yes – Move Left, No - Move Right

• Yes – Move Up then Down, No – Move Left then Right

• Yes – Rotate Up then Down, No – Rotate Left then Right

Being in an exclusive category (a confirmation dialog will always be on top and cannot be

avoided until it is answered), confirmation inputs do not have to worry about stealing input

motions away from other types. Therefore any of the inputs can be selected. If allowing

multiple inputs to be used as one command then we should look into ensuring these inputs

could not be confused as another (to ensure highest accuracy.) Moving Left is part of one

‘Yes’ input and one ‘No’ input, therefore only one of these input paths should be used.

Allowing multiple inputs suggests taking both inputs 2 and 3 is the best course of action for

the confirmation inputs ‘Yes’ and ‘No’

5.5.2. Movement Movement is interesting, as it can be applied to many different contexts and situations as many

items can get moved in many different ways. However the entire idea of moving an item was


104 | P a g e

easy for participants to grasp and their answers were consistent. The single important item for

consideration is what is being moved. For example, when viewing a picture on the screen and

moving the device to the left to simulate a left movement, one is in fact not moving the picture,

but moving the viewport left. Such information needs to be explained to avoid confusion.

Truly analogue movement appears to be best avoided as such detection would require a far

more precise detection method to work solidly. Therefore eight-way movement appears to be

the best course of action but only allowing rotation directly along the 3 axis:

Examples:

Table 2: Movement to Input Mapping

Move Device Left Move Target Left

Move Device Up Move Target Up

Move Device Up Right Move Target Up Right

Rotate Device Left (X Axis) Rotate Target Left (X Axis)

Rotate Device Down (Y Axis) Rotate Target Down (Y Axis)

Rotate Device Left (Z Axis) Rotate Target Left (Z Axis)

An inverted phenomenon is best avoided until the model can accept user preferences. So the

above examples are best suited for the base model.

5.5.3. Choosing The base model of choosing assumes we are choosing from a vertical list of items. Again

depending on the context of the situation this can be vastly different but handling hierarchical

menus available on devices is the most common choosing exercise.

In this situation there is always a default value set. This gives us a base point to traverse the

list (again being a hierarchical menu), the options at any situation are to go up options, down

options or to pick an option. It is possible to have scales to choose how far to go up or down

and if we try to keep consistent with the survey results (and in fact our previously defined base

type movement) we could use movement to traverse up and down our list, with amounts of

distance determining how far. This can be achieved by moving to the next item selected, then

after a set time determine, if we are still moving. If so continue moving in our list.


105 | P a g e

Therefore:

Table 3: Precise Input to Motion Mapping

Activate Item Push Device (Move Forward then Back on Z axis)

Move Up List Move Device Direction Up

Move Down List Move Device Direction Down

Since we are polling the device every so often for inputs, the model should accept continuous

moving up as a continuation of the list. Any time the device is brought back to the start this

would be seen as movement in the opposite direction. Hence this would be seen as an

attempted input in the opposite direction.

5.5.4. Selection Selection brought forward the notion of where do we start, the most forgiving answer is to,

after every accepted input, return the start point of a selection to the middle of the screen. As a

example after double clicking an icon on a desktop and running the application, once the user

quits that application, the mouse pointer returns to the middle of the screen.

This again allows us to easily embrace an input concept that has already been defined by

choosing. Move the selection in a direction and continue to move in that direction if the

device’s motion continues in that same direction. This allows us to use the directional input of

motion as out stimuli and the activation process of choosing. Selection and movement are

exclusive from each other except for the process of de-selection so they can share inputs. An

item must be selected before it can be modified or moved. So once an item is selected it can

change to the movement structure.


106 | P a g e

Step – by – Step Desktop Example:

• User selects a square in a textbox in Microsoft PowerPoint (No change)

• User selects a point inside the selected textbox (cursor changes to the movement arrow

while mouse is held down)

• User moves textbox around

• User deselects textbox (releases mouse)

Such selections can be replicated very closely by hand motions, except the actual holding

down of the mouse button, which serves as a two-state switch. Therefore a de-select input that

turns off this switch is required (we have activation to turn it on). This is where de-select is

required.

5.6. Summary of Model Creation Stipulations and exceptions were continually found while developing a base model for the

system. It is simple to define what inputs are, and give those motions, however placing these

within a consistent model remains a challenging task. Such problems were typically solved by

simplifying the process (for example, limiting directions to eight.) While others required

stipulations to cover the entire model (continual scrolling up, does the user have to keep

moving their arm up, or can I return back down then resume up). Such decisions had to be

made to keep the model as simple as possible and not consider implementation hassles.

Breaking down data collected in the survey was also a huge task, one that became a lot less

troubling because of input classifications. This meant that only the types of motions needed to

be collected and not exact data on distance or speed. These details could be supplied with very

little precision (in fact simple describers were sufficient, a.k.a. slow, fast or sharp).

Overall, a model has been created that is simple enough to apply at the application level, or

even at the OS level.

An intuitive motion based model for mobile devices – Prototype Development

107 | P a g e

6. Prototype Development To demonstrate the detection of motion and applying this to a situation, a prototype had to be

developed. A desktop implementation was to be developed first as a proof-of-concept which

would then be shrunk down and implemented onto a mobile platform. The goal was to

successfully track motion over a series of frames on a mobile platform in such a way that the

performance of the device was not significantly impacted (allowing the user to perform other

tasks at the same time).

6.1. Using DirectShow DirectShow [2] is commonly known as the video and audio processing component of

Microsoft’s DirectX Framework [8, 21]. It also is commonly considered to be the most complex

and difficult to learn aspect of the framework. This is evident as much of the functionality

available to desktop developers is simply non-existent in the mobile framework because of

implementation hardships.

DirectShow uses a plug in type architecture, taking full advantage of the ideals behind COM,

Microsoft’s Component Object Model that allows different components to talk and interact

with each other. The way this works can be easily visualized in desktop application such as

GraphEdit. This application allows the user to visually design the graph the developer wishes

the video information to pass through (Figure 6.1). Every stage from the initial source

(camera/file) through to the final destination (screen/file) is part of this DirectShow filter

graph, along with everything in between.

Figure 6-1: GraphEdit


108 | P a g e

The concept of creating the graph itself is similar between the two platforms (Windows

Desktop and Windows Mobile) but many commonly used features either do not exist or have

limited functionality in their mobile incarnation. This coupled with limited debug libraries can

make the development of a filter a much more drawn out process.

6.1.1. DirectShow Filters DirectShow filters are the middle ground between the input video information and the output.

Typically filters are used to perform image enhancements/modifications such as sharpening,

cropping and resizing before displaying the results to the user. Information is sent to the filter

as a Media Sample that contains header information such as size and encoding type. This can

be used to obtain a reference to the image itself, as a byte stream. With this information, a new

media sample can be created (either by directly copying this information or creating your own)

and the required modifications can be made to the data if needed.

Once a filter is plugged in, all information will travel through it. For example, a darkening

filter will typically copy the header information and then go through the image data decreasing

the values by a fluctuating amount (and therefore darkening the image.) Knowing the data

type lets you know the image format and what values to decrease (Colour, bit depth of image,

if it contains an alpha channel) and the most efficient way to go about it..

DirectShow filters are typically compiled as their own library and interact with applications

via COM. The GraphEdit application available in the DirectShow SDK allows interaction

with these filters via a drag-and-drop interface for easy testing and debugging. These libraries

are typically registered though the Windows registry and can be dynamically loaded by the

operating system when required.

Such filters run independently of other applications in separate threads and provide a great

starting ground for a motion-based input implementation as they can run in the background,

independently of other tasks and can be probed when required. This would be a far superior

grounding instead of the more obtrusive step-by-step approach typically taken by image

detection solutions. DirectShow tends to use a significant portion of CPU time to ensure it

processes as much frame data as possible, but it will drop frames if falling behind. This is

essential on lower speed devices.


109 | P a g e

6.2. Desktop Development DirectShow, with its’ modular format allows the filter to be separated totally from the two

endpoints of the video displaying process, in this case the input (camera) and output (screen).

Each of these are considered components that have pins that can easily be connected to others

pins to create the filtergraph. Each of these pins sends a specific type of data and to connect

pins, the component whose pin is receiving the data must confirm that it can accept the data

type of the component sending data. This is done when the components first try to connect,

therefore once a graph is complete, the data should be able to flow freely as all components

can accept the data they are given.

BDA(Broadcast Driver Architecture) and WDM(Windows Driver Model) [26] drivers that

video/audio devices use to interoperate with Windows all use DirectShow to transport data.

Therefore an interface to get this data is provided and a filter is created with a pin that can

accept the data passed to it by this driver. Hence activating devices is typically easy and

getting them running through GraphEdit is trivial.

6.2.1. Using a Filter to detect squares Instead of re-creating the wheel multiple shape detection algorithms were tested to check their

performance on mobile devices. Creating a Hough shape detection algorithm was an easy

task, but the performance was not acceptable on low resolution images to be relied upon for

motion detection. The Augmented Reality Toolkit was also examined for its basic

functionality, and while it was not easy to convert over to the PocketPC platform, its

performance was impressive.

6.2.1.1. The Augmented Reality Toolkit The ARToolkit [15] is a library of functionality designed to process scene information for

specific features (know as icons.) These icons are based on simple black squares with a black

and white pattern inside. Upon detection of a possible marker, the contents inside it can be

examined to see if it is the marker being looked for and if so, the orientation of the icon can be

determined by comparing the picture inside the image to the original icon picture. This gives a

3D orientation of the icon and therefore additional meaning within a 2D space (picture/video

image.)


110 | P a g e

Typically this allows additional 3D information (3D models) to be overlaid on top of the

displayed image in a realistic manner at the icon point based on the orientation information.

Possible applications include gaming that interacts with the surrounding environment or

something as simple as a location guide. But such is obviously not required when detecting

motion.

As explained earlier, obtaining three-dimensional information from within a flat image aids

significantly in the tracking of motion between frames. The ARToolkit was a good basis for

obtaining information that could be transformed into this three dimensional information. But it

also performed functions that were unnecessary in this step (inside the detected squares.)

While this information would be useful for scene-information input (Section 3.5.2.), it was not

necessary for motion based information.

The first step using a stripped down version of the ARToolkit as a proof-of-concept was

considered viable. This has been done before on mobile devices [33], but in such a way that it

was obtrusive to the end user and could not migrate between devices as well as using a great

deal of back-end and unreleased code.

A mobile filter framework was developed that could incorporate the ARToolkit’s stripped

down detection routines, convert this information to three-dimensional data and display the

tracking of this to screen. This is further described in the following sections.

6.2.1.2. DirectShow Filter Basics Filters modify input to produce output of a video or image stream, modifying it in some way to

produce a final result. Such filters are required in the mobile framework to enhance lacking

functionality. But as a middle ground and as a separate entity that interacts via COM, a lot

more is possible. Thankfully developing filters is pretty much identical to their desktop

variants. They can be loaded inside applications or registered though the registry in a similar

fashion to their desktop counterparts. The significant difference comes with debugging and

designing these filters.


111 | P a g e

The first step is to define exports in a definitions file (DllMain, DllGetClassObject,

DllCanUnloadNow, DllRegisterServer and DllUnregisterServer). These are simply

methods that can be called from other programs using the DLL and should be included in all

filters.

LIBRARY "FilterName"

EXPORTS

DllMain PRIVATE

DllGetClassObject PRIVATE

DllCanUnloadNow PRIVATE

DllRegisterServer PRIVATE

DllUnregisterServer PRIVATE

Usually the filter should be a transformation filter and therefore extend CTransformFilter as

well as CPersistStream to gain the base functionality needed. Being a COM object then a

GUID should also be supplied to give it uniqueness.

DEFINE_GUID(CLSID_FilterName,

0x00000000, 0x0000, 0x0000, 0x00, 0x00, 0x0, 0x00, 0x00, 0x00, 0x00, 0x00);

Several methods should then be overridden to gain basic funtionality.

From CPersistStream: HRESULT ScribbleToStream(IStream *pStream);

HRESULT ReadFromStream(IStream *pStream);

STDMETHODIMP GetClassID(CLSID *pClsid);

And from CTransformFilter: HRESULT Transform(IMediaSample *pIn, IMediaSample *pOut);

HRESULT CheckInputType(const CMediaType *mtIn);

HRESULT CheckTransform(const CMediaType *mtIn, const CMediaType *mtOut);

HRESULT DecideBufferSize(IMemAllocator *pAlloc,

ALLOCATOR_PROPERTIES *pProperties);

HRESULT GetMediaType(int iPosition, CMediaType *pMediaType);


112 | P a g e

Transform is obviously where all the work is done. With reference to multiple media samples

(one input, one output) the simplest transform is to copy directly from the input to the output.

If general information is going to remain the same (image size/format) but with only image

distortion, then calling GetPointer(ref) on the media sample gives us a pointer directly to

the byte stream of the output image, allowing one to write to it directly to make changes.

Depending on what the filter does, one might copy the data over before writing or read and

change dynamically.

This image data is usually contained in an abstract format when working with it. While the

image data does go from left to right, the data starts at the bottom of the image and works up

and is typically stored in a BGR or BGRA (blue, green, red, alpha channel) format. Therefore

image data will need modification before the filter is applied. (Figures 6.2. and 6.3.).

Figure 6-2: Image Data with Alpha Channel (Note the 00’s)

Figure 6-3: Image Data without an Alpha Channel


113 | P a g e

Having an image to work upon, and draw directly to, is generally how detection algorithms

work. Therefore a memory copy from the input to output MediaSample is typically the first

step. The next decision is if the byte stream format is suitable to work the algorithm, or if a

rotated image with swapped filters should be created (and possibly even the removal of the

alpha field).

But the true advantage of having a filter perform these modifications is that one is not just

limited to an image output result. Interfaces can be made to the filter and the filter can be

queried for information, or have parameters passed to it to change how it works on the fly.

These interfaces also required a separate GUID (unique identifier for the filter) since you call

these methods from a separate object reference (instead of a Filter object, you call the actual

COM object directly.)

DEFINE_GUID(IID_IFilterName, 0x00000000, 0x0000, 0x0000, 0x00, 0x00, 0x00,

0x00, 0x00, 0x00, 0x00, 0x01);

And then define the methods that can be called externally.

DECLARE_INTERFACE_(ICEARTFilter, IUnknown)

{

STDMETHOD(SetFiltering) (THIS_ BOOL set) PURE;

};

This would be an obvious first method to create, allowing the program using the filtergraph to

turn the filtering of the image on or off whenever it wanted. The corresponding method would

be created like normal as a way to change the information and a check inserted into the

Transform method that decides if it should actually perform the transform or not.


114 | P a g e

6.2.3. Mapping to 3D Square information retrieved from the filter contains purely screen co-ordinates of the vertex

pixels. This two-dimensional representation needs to be transformed into a relative three-

dimensional meaning (Figure 6.4).

Figure 6-4: Translation of Two-Dimensional Screen Data to Direct3D Polygon Format.

The first step to perform is the translation of pixel space to an absolute -1 to +1 value in both

the x and y axis. This is the default format used by Direct3D and allows us to use its data

structures and methods to check changes (transforms) between frames. The next step is to

calculate Z depths for each point of the polygon. To ensure consistency the detected polygon

is always considered a square (Section 6.2.1.), therefore Z values need to be determined to make the

item a square.

To determine these values triangulation, needs to occur by determining line length. This is

easiest when the closest line to the screen is given a Z-depth of 0 (which will be the longest

since we are working with squares, where all lengths are equal). Other values can then be

determined by using the actual length (screen data length) and applying our three-dimensional

length (longest line).

The ARToolkit includes such functionality in its code base to convert this screen polygon to

three-dimensional space, but returns OpenGL [30] co-ordinate data. This is a simple task to

change to a Direct3D co-ordinate system so we keep a consistent orientation.


115 | P a g e

6.2.4. Tracking motion from cube transformations To track motion in a video stream there are two core components that need to be stored, the

direction we have already determined the device to be travelling and the location of key

objects from the last frame. With this data, we find where components are located in the

current frame of video and compare this information to the last frame. Changes in similar

objects location suggests movement by in a direction relating to this change. Acceleration and

direction change can be determined from the previous motion information passed on. The

motion data is then updated with this new information.

More complicated information such as rotation and movement into or out of the screen require

more advanced interpretation. For this project, I decided to try and convert key objects

available in the scene to three-dimensional information and track changes in the 3D planes.

Such logic has limited the type of objects that could be tracked to those that could be easily

changed into such information. Therefore this logic tracks polygon information in the scene to

read motion. All polygons seen in the scene are determined to be squares and are mapped like

as square polygons on the 3D plane complete with their rotations and orientations.

With this three-dimensional information available, motion changes can be tracked between

frames by checking the X, Y and Z axis change of each vertex in the MAP (Section 3.4.). To gain a

more accurate representation of this information, it should be averaged over a series of frames

(to lessen the impact of the MAP possibly changing) and to avoid random twitching that could

occur from either the user or the device.

To perform this averaging, motion and rotation vectors of the change between frames should

be created. Then relative time samples need to be defined as a polling period (for example half

a second) then the average of this time is taken as the motion movement actually performed

over this period. Since the filter created runs at 30 frames per second and captures objects

over that period then this would surmount to 15 calculations to compare between.


116 | P a g e

With low resolutions and less consistent frame rates than a more powerful desktop counterpart,

this extended period also allows the adoption of ‘loss’ periods where the filter fails to find

objects in a frame that it is supplied (usually due to the blurring of the image). Frames where

objects are not found are simply not stored in the queue so the polling period will have fewer

transformations to look at.

With the stack resetting every half a second, the impact of visual flaws such as two polygons

being detected far apart and then loosing one off the edge is limited. Therefore, if we lose a

polygon because of its falling off the edge, then find a new polygon shortly after the impact on

the motion we detect is only minor (one frame of transformation). Any other motion detected

in the queue is likely to outweigh this.

There was consideration into examining a sliding queue so that polling occurred more often

and previous data was reused but the advantages appeared minor when compared to the

additional calculations required.

6.3. Windows Mobile 5 Development Upon the release of Windows Mobile 5, a consistent model was created for application

developers to programmatically access mobile device camera. Interfaces have been made

available for both .NET managed languages and the original C++ pocket SDK. Now armed

with the ability to directly call the camera from an application, a coder can begin to move their

applications to the mobile sector via DirectX, or even develop Desktop and Mobile

applications from very similar code trees.

Originally, this was both difficult and messy as there was no direct linking of the camera to the

operating system so each third party responsible for developing handsets had their own unique

libraries and hooks to incorporate camera software into the OS. Without any documentation

and limited support this made attempted development in this sector both difficult and

unreliable.

Now with the ability to interface with the camera directly (and in particular with the Pocket

SDK, the .NET compact framework is still lacking), developers can use COM and DirectShow

to process incoming video, similarly to their desktop counterparts.


117 | P a g e

Using DirectShow on a mobile device is identical to the desktop platform, therefore section

6.2.1.2. can be applied to mobile use as well.

6.3.1. Porting detection filters Typically filters are required in the mobile framework to enhance functionality that is available

natively on the desktop platform. But since they are designed as a separate entity that interacts

via COM, re-adding a lot of this functionality is possible and in fact a necessity for the filters

to be useful. The interaction is similar to their desktop variants (Section 6.2.1.) The significant

difference comes with debugging and designing these filters.

Many filters compiled with the debug flag will simply not register properly, even though

RegSvrCE suggests they were successfully registered. This coupled with much tighter

memory requirements, generally slower devices with lower quality video and once again a

limited amount of available libraries means there simply is not the same turn-around time in

creating these filters as on the desktop. Complicated filters needs to be smart with their

memory as pointers easily get corrupted in the still young Direct Show mobile SDK and they

need to have an obvious flow since debugging is a lot more time consuming process.

To compile code for Windows Mobile 5 devices, the Windows Mobile 5 Pocket SDK is to be

installed: http://www.microsoft.com/downloads/details.aspx?FamilyID=83a52af2-f524-4ec5-9155-717cbe5d25ed&DisplayLang=en

Compiling filters uses the following libraries: strmiids.lib, strmbase.lib, d3dmguid.lib,

d3dm.lib, ddguid.lib all DirectShow libraries are included in this SDK and do not need to be

recompiled (no source is supplied anyway).

When transferring a filter from a desktop implementation over to a mobile solution several

particulars of note must be adhered to. Use of ATL and MFC frameworks often cause

problems and you must not have wchar_t as a built in type.


118 | P a g e

The testing and debugging of filters can be particularly difficult as a mobile filter will not

register (and therefore cannot be used if compiled in debug mode so all work must be done

with the ($NDEBUG) flag (release mode). Such a situation means that limited information can

be gathered from the filter while testing. Combined with the very limited development

environment this means debugging becomes a messy task.

Once the .dll (or .ax depending on setup) is copied across to the device it must be registered

into the registry. RegSvrCE is no longer supplied with the SDK and while programs can self

register a library, actually registering it globally allows every application to use it.

Registering via the registry will usually place the filter in the following location: [HKEY_CLASSES_ROOT\Filter\{Supplied GUID}]

Code can also be self registered from inside the program that uses it with the following COM

calls: LoadLibrary("filter.dll");

GetProcAddress("DllRegisterServer");

6.3.2. Camera Initialization Typically on the desktop platform, components of the machine are enumerated through

(probed one-by-one) to find a camera connected to the machine. While this method is possible

on mobile devices there is typically no point as you will already know the being used to record

the video (the embedded camera).

Another interesting point of the mobile DirectShow implementation is there is no included

PropertyBag object, something that is essential to adding a camera to a filter programmatically.

Thankfully such information is available on MSDN [20] and easy to replicate.

CComVariant varCamName;

CPropertyBag PropBag;

CComPtr<IBaseFilter> pSrcFilter;

CoInitialize(NULL);


119 | P a g e

hr = m_pGB.CoCreateInstance(CLSID_FilterGraph, NULL, CLSCTX_INPROC);

if( FAILED(hr))

{

Msg(TEXT("Failed to create filter graph. hr = 0x%08x"), hr);

}

pSrcFilter.CoCreateInstance( CLSID_VideoCapture );

pSrcFilter.QueryInterface( &pPropertyBag );

varCamName = L"CAM1:";

if(( varCamName.vt == VT_BSTR ) == NULL ) {

return E_OUTOFMEMORY;

}

PropBag.Write( L"VCapName", &varCamName );

pPropertyBag->Load( &PropBag, NULL );

pPropertyBag.Release();

hr = m_pGB->AddFilter(pSrcFilter, L"Video Capture");

if (FAILED(hr))

{

return hr;

}

The shortcut “CAM1:” is used as a direct reference to the first camera on a mobile device and

should be implemented on all Windows Mobile 5 devices. Once added to the device, video

can start and stop via the IMediaControl and IMediaEvent interfaces after they have been

added to the graph as well.

CComPtr<IMediaControl> m_pMediaControl;

CComPtr<IMediaEvent> m_pMediaEvent;

m_pGB.QueryInterface(&m_pMediaControl);

m_pGB.QueryInterface(&m_pMediaEvent);

Starting and stopping capture is performed via:

m_pMediaControl->Run();

m_pMediaControl->Stop();


120 | P a g e

Encoding video also uses this information as media events are constantly polled while the

encoding is taking place. Once certain events take place (stopped) the video should be in the

process of being encoded. But it is not until these events are actually processed that you know

the encoding has been completed.

do

{

pMediaEvent->GetEvent( &lEventCode, &lParam1, &lParam2, INFINITE );

pMediaEvent->FreeEventParams( lEventCode, lParam1, lParam2 );

if( lEventCode == EC_STREAM_CONTROL_STOPPED ) {

OutputDebugString( L"Received a control stream stop event" );

count++;

}

} while( count < 1);

6.3.3. Image Output DirectShow on the desktop has a call-back mechanism to retrieve the currently processed

frame in the filtergraph stream. This can be used to actually process the data directly to the

bitmap object or send it elsewhere to perform processes upon. Sadly the mobile version of

DirectShow has no available way to grab this image.

Therefore the filters themselves have to offer up this functionality. Again COM is an effective

way to achieve this. Once the filter has the IMediaSample it is working upon, a pointer

directly to the image is available. Creating a function to access this pointer is then trivial but

causes problems because there may not be an IMediaSample available when the function is

called. Therefore, an additional data structure should be available to constantly store the last

worked on image in the stream. Copying a bytestream across to a new memory location

during the IMediaSample data collection process allows a constantly available image that is far

more immune to corruption than the pointer location of the IMediaSample image.

Therefore, this COM method can simply return a pointer value to this memory location and

whenever this location is checked it will contain the last processed, or currently processed

frame.

6.3.4. Data Display Once the data has been processed and information created then there are two obvious ways to

display it. Draw it directly to the image itself and display our new image, since the filtergraph


121 | P a g e

works with two IMediaSamples (in and out) the changes can be drawn to the out image as it is

being created, then if this graph ends in a renderer these changes can be seen (Figure 6.5).

Figure 6-5: Image without and with Display Filter applied

The other option is to read and interpret this data by storing into memory structures and then

displaying this data in an entirely new context. With this setup both sections of the application

must be aware of the data types this information is stored in.

6.4. Prototype Summary Creating a working and efficacious program that took advantage of the motion based input

concept was a journey that took far longer than expected. The tools and the hardware were

available, however significant difficulties were experienced due to the paucity of available

expert knowledge, and the lack of maturity of the mobile platform development platform

compared to the desktop platform.

The .NET Compact Framework v2 [19], much proclaimed by Microsoft and with great potential

in my eyes at the time (rose-coloured glasses in hindsight) proved to be severely lacking when

trying to push the boundaries of mobile development. It remains a tool to develop standard

applications in the quickest development cycle with little to no way to access device resources

outside the scope of a typical application window.

The solution was to return to the C++ orientated Pocket SDK, where development time and

effort is significantly higher than .NET, but functionality is available. For example, the

camera inter-operability in the .NET Compact Framework is nothing more than a call to the

camera application developed by the device manufacturer to record video to a file for a certain


122 | P a g e

amount of time. There was absolutely no way to use this functionality in this project. The

Pocket SDK however, allows true access to the camera to be used inside an application proper,

which allows motion detection to be a reality on a device a reality.

Microsoft Visual Studio, while a great tool, had many short comings when it comes to mobile

development. At first the tool is great, but debugging applications becomes painful while

trying to step through code. Even over a direct USB2.0 connection the time taken to process a

line of code is significant (multiple seconds per line of code). When trying to debug code that

is tens of thousands lines in size (like the ARToolkit) this is unusable. Performance is

increased when using an emulated device on the debug machine, but such a device does not

have a camera so in my circumstance this was an unusable option.

This was combined with many major drawbacks of the Pocket SDK available. It is mainly

compiled and the only source supplied is header files. This leaves the developer spending

much of their time stepping though the machine language created via disassembly. Trouble

also occurred as much of the code (the DirectShow filter in particular) would not operate with

debug flags set. This means that even less information is supplied while trying to debug code.

Many variable values cannot be tracked and branching code is difficult to work though.

Undoubtedly the above are major contributors to the lack of advancement in the field of

application development using the camera on mobile devices. But the base work has been

completed and its performance is more than acceptable given the above circumstances. Better

cameras and more memory in the future will only improve this prototypes success.

An intuitive motion based model for mobile devices – Conclusions and Future Work

123 | P a g e

7. Conclusions and Future Work With all three segments of the research resulting in varying levels of success it can be

concluded that there is a huge amount of potential in this field. This chapter summarises the

paths this research can take in the future as well as outlining the new paths the research has

opened up.

7.1. Answers to Research Questions

7.1.1. Answer to Question 1 - What functionalities of a phone's features are appropriate candidates to be used as parts of a

motion input scheme?

Throughout the research it was discovered that many of a phone’s functionalities were in fact

suitable targets for motion-based inputs. In fact many of these inputs already incorporated

movements by the end user already that could be used (moving the phone to answer it for

example.)

Section 5.4 describes several inputs that are inappropriate for an input model.

7.1.2. Answer to Question 2 - Is it possible to construct a rational and useable mapping scheme for phone inputs?

As demonstrated in Section 5.5 it is in fact possible to create a useable mapping scheme for

phone inputs. Appendix I demonstrates a small sample of inputs that can be mapped to

motions.

7.1.3. Answer to Question 3 - Can people adapt to using motion gestures as an input medium and what are considered

suitable (not embarrassing. over-exertive) motions to perform?

The results from Survey One (Section 4.3) demonstrate when instructed to use motions instead

of traditional input means users were generally more than capable of adapting straight away.

The few who struggled were capable after further encouragement and explanation.


124 | P a g e

Survey Two results (Section 4.5) show that users definitely moved the devices in specific ways

subconsciously and these could be used as inputs as well.

7.1.4. Answer to Question 4 - How uniformly do people perform motions given to them (different people, slight difference

in movement) and can these variations be adapted to?

It was discovered that a significant percentage of users performed very similar movements in

an attempt to get an end result (Sect 4.4.9 & 4.5.6.) Very few users deviated from a specific

command and few users had any significant peculiarities while performing these motions. As

long as the device moved and rotated in the general directions it was tracked the same.

7.1.5. Answer to Question 5 - How suitable are images (collected by the embedded mobile cameras) for in-depth image

processing?

Generally the images collected by the in-built cameras was not very suitable for in-depth

processing and therefore a significantly different approach had to be taken to track movement

information (Section 6.2.1.1 & 6.2.3.) In good conditions the low quality video collected was

sufficient, but in many situations the used filters were not capable of tracking information.

This brought forward the concept of switching between detection algorithms depending on the

situation (something only very briefly discussed in this thesis.) Switching detection routines

depending on the situation would be the best way to collect movement information without

higher quality image/video information.

7.1.6. Answer to Question 6 - Can real-time performance of image detection algorithms and movement calculations on

Smartphones™ be achieved?

Surprisingly it can. It is defiantly not an easy feat, but with significant knowledge of device

development and how to perform the detection even memory and processor intensive

algorithms are possible, as shown in Section 6.3.

7.1.7. Answer to Question 7 - Will tracking movement critical to this project unexpectedly interfere with the normal usage

of the phone?


125 | P a g e

Testing showed that this was in fact a possibility as occasionally the phone hung while

detecting motion and receiving a phone call at the same time. This was generally due to the

amount of memory being used and the processing of video. I would imagine that more

memory (becoming common in devices) and better quality video that requires less processing

time will improve this.

7.2. Contributions to Research Three significant contributions to the research process have been covered within this

document. First and foremost is the design and commencement of an input model that relies as

motion as the stimulus. Related models have been developed in the past, but little work has

been carried out towards motion based mobile device input. This model encompasses a far

wider scope that those of the past (typically just text input), and does not confine the

implementers of the model to limited motions. The only limitation lies on what data can be

gathered and interpreted from the camera. The findings also show how users also typically

interact with the device in general situations so the findings can be used as a guideline of

motions users prefer to use.

Secondly, I have developed and documented a process for developing DirectShow filters on

mobile devices. Information for creating filters upon the desktop is extremely limited, actual

filter development on mobiles is very close to non-existent, even in the research field.

Information regarding mobile filters that provide information back to the program using them

does not exist, so to my knowledge, this is a new field. Such filters are capable of a lot of

functionality and work on a framework that offers up great performance and options once it is

understood. These options remain on the mobile platform once the implementation limitations

are overcome by either code or design.

Finally there is the development of automated surveys to aid in the information collection

procedure. Such surveys when designed properly are capable of collecting large amounts of

data that standard surveys cannot at only a slightly increased cost. Findings and experiences

for the creation and execution of such surveys are all included. Further understanding of the

capabilities of the devices and how to work with them will greatly enhance the directions

surveys can take. Mobile kiosks can be used as a form of data collection from users. Moving


126 | P a g e

this to a smaller, portable device with many more input mechanisms (camera, audio as well as

the touch) open up many more opportunities. Include such data such as location, time and

scene information that is only lightly covered here and the possibilities are endless.

7.3. Limitations Situations occurred (particularly in the implementation phase) that hindered the progression of

this project. While such troubles were expected because of the working with smaller devices,

the workarounds are far from ideal. For the implementation to be truly viable, several things

should (and hopefully will) happen with the next iteration of mobile devices, and Microsoft

operating system.

• An increased viable resolution without the dropping of frames

Detecting valid information at low resolutions (176x144) is very difficult. You lose a lot of

the data that can be found when working at a higher resolution. In particular it is very hard to

pick up shadows at the low resolution, something that usually gives very good polygon

information on the desktop. This needs to come with improved performance so there is no

drop in frame rate, as a good frame rate is required to properly keep track of motion.

• Better direct access to memory

Suddenly losing pointer information inside Windows Mobile 5 played havoc with

development times as it simply did not make sense that data was getting corrupted. Only once

it became apparent that the OS itself was to blame (trying to clean up memory) did solutions

become clear. Sadly this solution was to actually use more memory and processor time to

ensure that the pointers that were getting scrapped were being continually accessed, or

swapped to yet another pointer. Such awkward work-arounds should not be required, and

better documentation by Microsoft to inform developers is required.

• A more complete SDK

Much functionality just isn’t there when it really should be. I found many instances where I

had to rewrite functions already available for the desktop counterpart (to handle very obvious

functionality).


127 | P a g e

7.4. Potential Applications Originally this model was designed to encompass the entire operating system it became more

and more apparent over time that the model would work on the application as well. Such a

design would allow developers to use the model to it’s fullest since they would have complete

control over the context and have no competing pre-defined motions at the OS level. With

complete control over the model and what to define, applications can take the fullest advantage

of this model.

Applications which could employ such a model are numerous. Applications that revolve

around movement and rotations are the obvious benefactors of the model. Applications which

employ the use of 3D renders can benefit from the ability of the users to truly walk around a

model and get a rotated display while moving forward and back would zoom the item in focus.

Again these are natural reactions to get a specific result, which can easily be mapped to

motion.

Simple navigation through menus also remains a powerful use of the model. The ability to

move through options using only the single hand that holds the device is advantageous in

many situations, something that is not possible with a touch-screen interface.

Following the same path, applications that generally try to avoid direct input from the user can

also aid from the model. GPS Navigation systems for motorized vehicles use signals to

determine direction and are usually situated on the dashboard at eye level. At this level the

device’s camera would have sight much the same as the user. This can produce much finer

information that what is available via GPS. Minor turns and direction information can be used

to augment the input information given by the GPS satellites.


128 | P a g e

Location based gaming would also benefit from the model. Many device’s interfaces are not

built towards gaming at all, so a way to gather information from the user quickly is sadly

lacking. Augmenting a motion based input scheme on top of these games could make them

much more intuitive. Walking forward in the world could very well translated to the walking

forward of your avatar in a virtual world.

7.5. Future Work Once hurdles are overcome then there are many different directions for this project. The

obvious is further work on the algorithm to collect motion information. However with

DirectShow a far more interesting approach can be taken for this concept. Since DirectShow

works by plugging in and pulling out components there is little reason why a series of

detection filters could not be used. Scanning of image information should be able to tell which

filters will work better than others (a lack of black or a low resolution would indicate the filter

developed for this research would not work optimally).

This scanning of information itself could be a filter placed before our detection filter in the

graph. If a detection filter is failing to find much information then it could be swapped out for

another that should be more successful. All these filters would need to communicate in the

same way with other components so a base framework for detection filters would have to be

developed. All the filters report back the same data (be it location information for the object

being track, or at a higher level actual motion vectors themselves), but the inside workings

would be totally different.

Such a framework would also become useful in domain specific situations. If a device is more

likely to be used near a assembly line then the filters can be fine tuned for that information,

and filters more appropriate take priority.

Such development, along with development of the domain specific inputs required by the

applications would create the best combination for a true motion based input system for mobile

devices.

An intuitive motion based model for mobile devices – Appendix A – Sample Inputs

129 | P a g e

Appendix A – Sample Inputs These inputs are designed as a test bed of sample inputs to be used throughout the research

procedure.

Common (Global) Motions ( A )

(AA) Scroll Up

(AB) Scroll Down

(AC) Scroll Left

(AD) Scroll Right

(AE) Select Option

(AF) Number Input

(AG) Left Soft Key

(AH) Right Soft Key

(AI) Power

Web Browsing ( B )

(BA) Go To

(BB) Refresh

(BC) Previous Link

(BD) Next Link

(BE) Follow Link

(BF) Zoom

(BG) Favourite Menu

(BH) Options Menu

(BI) Home

Photo Album ( C )

(CA) Change View (Details/Thumbnails)

Picture Viewing (from Photo Album) ( D )

(DA) Pan Up

(DB) Pan Down

(DC) Next Image

(DD) Pan Left

(DE) Pan Right

(DF) Previous Image

(DG) Zoom In

(DH) Zoom Out

In Call ( E )

(EA) End Call

(EB) Mute Call

(EC) On Hold

(ED) Increase Volume

(EE) Decrease Volume

Outside Call ( F )

(FA) Answer Call

(FB) Key Lock

Phone Book/ Contacts ( G )

(GA) Go to Details

(GB) Call Number

(GC) New Information

(GD) Store Edited Info

An intuitive motion based model for mobile devices – Appendix A – Sample Inputs

130 | P a g e

Media Player ( H )

(HA) Open File/Playlist

(HB) Play

(HC) Stop

(HD) Mute

(HE) Next File

(HF) Previous File

(HG) Volume Up

(HH) Volume Down

(HI) Randomise Playlist

(HJ) Clear Playlist

Voice Notes ( I )

(IA) Start Recording

(IB) End Recording

(IC) New Note

(ID) Replay Note

(IE) Next Note

(IF) Previous Note

Calendar/Task Scheduler ( J )

(JA) Change Time View

(JB) Next

(Day/Week/Month)

(JC) Mark as Completed

(JD) Previous

(Day/Week/Month)

(JE) Add Task

(JF) Remove Task

File Manager ( K )

(KA) View Type

(KB) New Directory

(KC) Switch Storage

(KD) Cut File

(KE) Copy File

(KF) Paste File

(KG) Change File Properties

(KH) Device

Text Input ( L )

(LA) New Line

(LB) Symbol List

(LC) Change Input Type

(T9/abc)

(LD) Change Input Language

(LE)

Upper/Lowercase/Caselock

Camera ( M )

(MA) Change Filter

(MB) Zoom In

(MC) Zoom Out

General Interactions ( N )

(NA) Exit Application

(NB) Landscape/Portrait Mode

An intuitive motion based model for mobile devices – Appendix B – Input Type Breakdown

131 | P a g e

Appendix B – Input Type Breakdown

This table breaks down the inputs listed in Appendix A and classifies them into their

appropriate input types.

Table 4: Input Breakdown

T Choosing Selection Confirm Adjust Movement Function Menu Modify

AA ●

AB ●

AC ●

AD ●

AE ●

AF ●

AG ●

AH ●

AI ●

BA ●

BB ●

BC ●

BD ●

BE ●

BF ●

BG ●

BH ●

BI ●

CA ●

DA ●

DB ●

DC ●

DD ●

DE ●


132 | P a g e

DF ●

DG ●

DH ●

EA ●

EB ●

EC ●

ED ●

EE ●

FA ●

FB ●

GA ●

GB ●

GC ●

GD ●

HA ●

HB ●

HC ●

HD ●

HE ●

HF ●

HG ●

HH ●

HI ●

HJ ●

IA ●

IB ●

IC ●

ID ●

IE ●

IF ●

JA ●

JB ●


133 | P a g e

JC ●

JD ●

JE ●

JF ●

KA ●

KB ●

KC ●

KD ●

KE ●

KF ●

KG ●

KH ●

LA ●

LB ●

LC ●

LD ●

LE ●

MA ●

MB ●

MC ●

NA ●

NB ●

An intuitive motion based model for mobile devices – Appendix C – Survey One Handout

134 | P a g e

Appendix C – Survey One Handout

Would you like to take a survey??

Part 1 - General

The following questions are presumed to be answered to the best of the participants’

knowledge.

How long have you been using mobile phones?

Please supply a list of mobile phones you have used in the past.

What functionality does your current phone contain (if known) and which of these do/would

you use?

Functionality

Has Use

Make calls □ □ Send/Receive SMSes □ □ Send/Receive MMSes □ □ Use Voice Notes □ □ Play games □ □ Store Contact Information □ □ Plan Appointments/Schedule, Track Calender □ □ View Pictures □ □ Play Music □ □ Watch Movies □ □ Transfer data to PC □ □ Take Photos/Movies □ □ Send/Receive Emails □ □


135 | P a g e

Functionality

Browse Web □ □

Do you typically operate your device one-handed , or two-handed

Do you believe you are confident with the day-to-day usage of your phone. Yes No

Part 2 - Motion (Hand Focus)

The following information is all motion based and therefore will have to be captured via video.

The questions are designed to be slightly abstract and the answers are expected to be the same.

Participants will be given a small box (as a mobile phone mock-up) to manipulate in an

attempt to give answers to the following questions. There is not expected to be a right or

wrong answer, answers are simply what the participants believe would be the most appropriate

physical input for the question.

If the participant does not understand the question then it to be skipped. It is also assumed that

the participant has prior knowledge of the following questions and has had time to formulate

answers that seem the most appropriate to them.

Please perform natural motion that you believe would best interpret the choosing an object in a

vertical list (demonstrate movement down and up the list and the choosing of the object)

A motion to confirm (say yes) to an action.

A motion to deny (say no) to an action.

Rotate an object on the screen to the left. Example:

There is the number 18 in a box, how would you increase this by 2 (to 20).

How would you increase the device's volume while talking on the phone.


136 | P a g e

Reload a web page that is being currently viewed.

Answer an incoming phone call.

Pan right while viewing an image.

If you have any concern over ethical issues in regards to this survey, please contact the

Queensland University of Technology Research Ethics Officer on (07) 3864 2340 or via email:

[email protected].

An intuitive motion based model for mobile devices – Appendix D – Survey One Participant Breakdown

137 | P a g e

Appendix D – Survey One Participant Breakdown

15- 18 19- 22 23- 26 27- 30 31- 34 35- 37Age 3 5 11 4 3 4

Figure D-1: Age of Participants, Survey One

8

4

3 3

2 2 2 2

1 1 1

Australian Born Anglo-Saxon

Australian Born Chinese

Mainland Chinese (incl. HK)

Singaporean

United Kingdom

Vietnamese

Brazilian

Samoan

Nigerian

New Zelander

Korean

Figure D-2: Nationality of Participants, Survey One

An intuitive motion based model for mobile devices – Appendix D – Survey One Participant Breakdown

138 | P a g e

11

65

4

2 2

Completed High School

TAFE/Community College

Undertaking Degree

Completed Degree

Undertaking/Completed Postgraduate

Year 10

Figure D-3: Education of Participants, Survey One

4 4

3 3 3 3 3

2 2

1 1 1

Study OnlyBuilding/ConstructionMaintenance/CleaningCateringUnemployedCombinationITLegal/AccountingOfficeManagerialEngineeringTourism/Travel

Figure D-4: Employment of Participants, Survey One

An intuitive motion based model for mobile devices – Appendix F – Survey Two Results

139 | P a g e

Appendix E – Survey Two Audio

1: In this test please hold the device at a 45deg Angle and control it so that the pointer in the

centre of the screen would move to and select the asterisks around the screen

2: A picture will be displayed in this test please hold the device at a 45deg angle view the

picture and react naturally.

3: Please hold the device in a comfortable position away from objects. Video will be taken

from the camera and displayed to you much like a viewfinder. Look and interact with this

viewfinder

4: A selection of faces will appear on the screen. While holding the device at a 45deg angle

please react by agreeing with the happy faces while disagreeing with the sad. Try to

incorporate device motion into this.

5. Hold the device at a 45deg angle, text will be displayed on the screen, attempt to speed read

it.

6. Place this device on the table, it will ring so react naturally to this. Hang up the phone when

you get the engaged signal. This will be repeated twice, so please repeat you actions


140 | P a g e

Appendix F – Sample Survey Two Results

Results on the following pages show a visualization of the recorded video collected for the six

different tests performed for the second survey. They are classified in via the surveys they

were recorded in and what the suspect input the motion was attributed is. Classifications and

visualisations are included. The occurrences column is a value that can be up to 10 and

denotes how common that motion was for that command. Therefore all motions for one input

should add up to 10. This sample data shows the information collected for users trying to

select the first asterisk in the first test.

Failed attempts are not included in this data, the reasons for these failed inputs is summarized

in Section 4.4.6.


141 | P a g e

Table 5: Sample 1, Survey Two Motion Breakdown

Survey Assumed Input Input Type Motion Explanation Motion Type Visualization Occurrence

Choosing Traverse to Up Left Asterisk Choosing Move Direction Left,

Move Direction Up Direction

4

Move Direction Up, Move Direction Left

3

Move Direction Up-Left

3

Selecting Asterisk Choosing Push Forward Direction

2

Push Forward, Return to neutral Push Forward

3

Rotate Down (Y-axis) Rotation

1

Nothing - - 2

Traverse to Left Asterisk Choosing Move Direction Left, Direction

8

Rotate Left (X-axis) Rotation

An intuitive motion based model for mobile devices – Appendix G – Base Input Compression

142 | P a g e

Appendix G – Base Input Compression

Table 6: Input Compression Part 1, Survey Two

Input Type Possible Motions Adjuster

Confirmation (Yes) Rotation Left then Right (X

Axis),

Direction Left,

Rotation Left (Z Axis)

Confidence of Answer

Speed of Rotation, Direction

Angle of Rotation.

Confirmation (No) Rotation Up then Down (Y Axis)

Direction Right,

Rotation Right (Z Axis)

Confidence of Answer

Speed of Rotation, Direction

Angle of Rotation.

Movement (Move Left) Direction Left,

Rotation Left (X Axis)

Amount of Movement

Length of Direction,

Acceleration of Direction,

Angel of Rotation.

Movement (Move Right) Direction Right,

Rotation Right (X Axis)

Amount of Movement



Angel of Rotation.

Movement (Move Down) Direction Down,

Rotation Down (Y Axis)

Amount of Movement



Angel of Rotation.

Movement (Move Up) Direction Up,

Rotation Up (Y Axis)

Amount of Movement



Angel of Rotation.

Movement (Rotate Up (Y)) Rotation Up (Y Axis),


Amount of Rotation

Angle of Rotation

Movement (Rotate Down (Y)) Rotation Down (Y Axis),


Amount of Rotation

Angle of Rotation

Movement (Rotate Left (X)) Rotation Left (X Axis) Amount of Rotation


143 | P a g e

Angle of Rotation

Movement (Rotate Right (X)) Rotation Right (X Axis) Amount of Rotation

Angle of Rotation

Movement (Rotate Left (Z)) Rotation Left (Z Axis) Amount of Rotation

Angle of Rotation

Movement (Rotate Right (Z)) Rotation Right (Z Axis) Amount of Rotation

Angle of Rotation


144 | P a g e

Table 7: Input Compression Part 2, Survey Two

Choosing (First Item

Selected)

Direction Forward and Back,

Direction Forward and Back * 2,


None

Choosing (Item Up One) Direction Up,

Rotation Up (Y Axis) Then




None

Choosing (Item Up

Multiple)

Direction Up,

Direction Up - Direction Down * n


Rotation Up – Rotation Up (Y Axis) Then


Direction Forward and Back * 2

Amount of Items Up


Angle of Rotation.

Choosing (Item Down

One)

Direction Down,

Rotation Down (Y Axis) Then




None

Choosing (Item Down

Multiple)

Direction Down,

Direction Down - Direction Down * n


Rotation Down – Rotation Up (Y Axis) Then



Amount of Items Up


Angle of Rotation.

Selection (Start of

Selection)

Start at Centre

Direction towards Point of Interest Then



Distance Travelled in

Selection

Length of Movement

Selection (Deselect) Direction Left, Direction Right None

An intuitive motion based model for mobile devices – Appendix H – Base Situations

145 | P a g e

Appendix H – Base Situations

Table 8: Context Relationships

ID Situation Related To

1 Base OS (Home Screen) 6, 9, 14

2 Music Player 12

3 Web Browser 8, 12, 14

4 Calendar (Week View) 5, 6, 14

5 Calendar (Month View) 4, 6, 14

6 Calendar (Daily View) 4, 5, 1, 14

7 Photo Album 8, 12, 14

8 Image Viewing 3, 7, 12

9 Clock Application 1

10 Phone Call 11

11 Contact List 10, 14

12 File Manager 2, 3, 7, 8, 14

13 Confirmation Window

14 Item is selected 1, 3, 4, 5, 6, 7, 11, 12

15 In hierarchical menu

16 Selecting an item (currently unselected)

An intuitive motion based model for mobile devices – Bibliography

146 | P a g e

Appendix I – Sample Motion Model

Figure I-1: Model Map, Direction Down


147 | P a g e

Figure I-2: Model Map, Direction Up


148 | P a g e

Figure I-3: Model Map, Direction Left


149 | P a g e

Figure I-4: Model Map, Direction Right


150 | P a g e

Bibliography

1. Amant, R. S., Horton, T. E., & Ritter, F. E. (2004 ). Model-based evaluation of cell

phone menu interaction In Proceedings of the SIGCHI conference on Human factors in

computing systems (pp. 343-350 ). Vienna, Austria ACM Press.

2. Blome, M. & Wasson, M. (2002, July). DirectShow: Core Media Technology in

Windows XP Empowers You to Create Custom Audio/Video Processing Components. MSDN

Magazine, Vol 17 No. 7.

3. Boyle, R & Thomas, R. (1988). Computer Vision: A First Course. Oxford, UK:

Blackwell Scientific Publications.

4. Brewster, S., Lumsden, J., Bell, M., Hall, M., & Tasker, S. (2003 ). Multimodal 'eyes-

free' interaction techniques for wearable devices In Proceedings of the SIGCHI conference on

Human factors in computing systems (pp. 473-480 ). Ft. Lauderdale, Florida, USA ACM Press.

5. Canny, J. (1986 ). A computational approach to edge detection IEEE Trans. Pattern

Anal. Mach. Intell. , 8 (6 ), 679-698.

6. Charmandari, E., et al. (2005). ENDOCRINOLOGY OF THE STRESS RESPONSE.

Annual Review of Physiology, 67, pp259-284.

7. Chen, J. S., & Medioni, G. (1989 ). Detection, Localization, and Estimation of Edges

IEEE Trans. Pattern Anal. Mach. Intell. , 11 (2 ), 191-198.

8. Chesnut, C. (2006). /cfMDX : Windows Mobile DirectX and Direct3D. Retrieved

21/3/2006 from http://www.mperfect.net/cfMDX/

9. Cowburn, N. (2004). Using the Integrated Camera in HTC Devices from Managed

Code. Retrieved 12/04, 2005, from http://blog.opennetcf.org/ncowburn/PermaLink,guid,5f0ebbac-8199-4ad1-aaa5-

5e84af695359.aspx


151 | P a g e

10. Crossan, A., Murray-Smith, R., Brewster, S., Kelly, J., & Musizza, B. (2005 ). Gait

phase effects in mobile interaction In CHI '05 extended abstracts on Human factors in

computing systems (pp. 1312-1315 ). Portland, OR, USA ACM Press.

11. Davies, E. (1990). Machine Vision: Theory, Algorithms and Practicalities. London, UK:

Academic Press.

12. Goldberg, D., & Richardson, C. (1993 ). Touch-typing with a stylus In Proceedings of

the SIGCHI conference on Human factors in computing systems (pp. 80-87 ). Amsterdam, The

Netherlands ACM Press.

13. Goodman, J., Venolia, G., Steury, K., & Parker, C. (2002 ). Language modeling for soft

keyboards In Eighteenth national conference on Artificial intelligence (pp. 419-424 ).

Edmonton, Alberta, Canada American Association for Artificial Intelligence.

14. Gluckman, S. K. N. (1998). Ego-motion and omnidirectional cameras. Paper presented

at the Sixth International Conference on Computer Vision.

15. Human Interface Technology Lab, University of Washington, WA, USA. (2005).

ARToolkit Home Page. Retrieved 1/11/2005 from http://www.hitl.washington.edu/artoolkit/

16. MacKenzie, I. S., & Buxton, W. (1992 ). Extending Fitts' law to two-dimensional tasks

In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 219-226

). Monterey, California, United States ACM.

17. Marentakis, G. B., S. A. (2004). A Study on Gestural Interaction with a 3D Audio

Display. Paper presented at the Mobile HCI 2004, University of Strathclyde, Glasgow, Scotland.

18. Microsoft Corporation. (2003). Smart Client Developer Center. Retrieved 12/03/2005,

2005, from http://msdn.microsoft.com/smartclient/community/cffaq/default.aspx

19. Microsoft Corporation. (2005). .NET Compact Framework. Retrieved 9/7/2005 from http://msdn2.microsoft.com/en-us/netframework/aa497273.aspx


152 | P a g e

20. Microsoft Corporation. (2005). MSDN Home Page (Australia - English). Retrieved

11/8/2005 from http://msdn2.microsoft.com/en-au/default.aspx

21. Microsoft Corporation (2005). Windows Mobile DirectX and Direct3D. Retrieved

17/3/2006 from http://www.mperfect.net/cfMDX/

22. Microsoft PressPass (2005). Microsoft Releases Windows Mobile 5.0. Retrieved

11/05/2005, 2005, from http://www.microsoft.com/presspass/press/2005/may05/05-10WindowsMobile5PR.asp

23. Nesbat, S. B. (2003 ). A system for fast, full-text entry for small electronic devices In

Proceedings of the 5th international conference on Multimodal interfaces (pp. 4-11 ).

Vancouver, British Columbia, Canada ACM Press.

24. Nodelman, V. (2004). OOP via C++, C\#...? In Proceedings of the 9th annual SIGCSE

conference on Innovation and technology in computer science education (pp. 255-255). Leeds,

United Kingdom: ACM Press.

25. Oliver, N., Pentland, a. (1999). DyPERS: Dynamic Personal Enhanced Reality System.

Retrieved 26/09, 2005, from http://research.microsoft.com/~nuria/dypers/dypers.htm

26. Oney, W. (2002). Programming the Microsoft® Windows® Driver Model (2nd ed.).

Redmond, WA, USA: Microsoft Press.

27. OpenNETCF Consulting L. (2003-2004). OpenNETCF.org. Retrieved 02/03, 2005,

from http://www.opennetcf.org/CategoryView.aspx?category=Home

28. Paelke, V., Reimann, C., & Stichling, D. (2004 ). Kick-up menus In CHI '04 extended

abstracts on Human factors in computing systems (pp. 1552-1552 ). Vienna, Austria ACM

Press.

29. Palm, Inc. (2005). Graffiti 2 Writing Software. Retrieved 12/6/2005 from http://www.palm.com/us/products/input/graffiti2.html


153 | P a g e

30. Silicon Graphics, Inc. (2006). OpenGL – The Industry Standard for High Performance

Graphics. Retrieved 13/3/2006 from http://www.opengl.org/

31. Strachan, S., et al. (2004). Dynamic Primatives for Gestural Interaction. Paper presented

at the Mobile HCI 2004, University of Strathclyde, Glasgow, Scotland.

32. Stringfellow, C. V., & Carpenter, S. (2005 ). An introduction to C\# and the .Net

framework J. Comput. Small Coll., 20 (4), 271-273

33. Studierstube, Graz University of Technology, Austria. (2005). Handheld Augmented

Reality. Retrieved 4/11/2005 from http://studierstube.icg.tu-graz.ac.at/handheld_ar/

34. Tegic Communications, Inc. (2005). T9 Text Input. Retrieved 7/6/2005 from http://www.t9.com

35. Vernon, D. (1991). Machine Vision. Upper Saddle River, NJ, USA: Prentice-Hall.

36. Vito Technology (2006). VITO Remote – Pocket PC infrared universal remote control.

Retrieved 13/2/2006 from http://www.vitotechnology.com/en/products/remote.html

37. Wigdor, D., & Balakrishnan, R. (2003). TiltText: using tilt for text input to mobile

phones. In Proceedings of the 16th annual ACM symposium on User interface software and

technology (pp. 81-90). Vancouver, Canada: ACM Press.

38. Wilkens, L. (2003 ). The joy of teaching with C\# J. Comput. Small Coll. , 19 (2 ), 254-

264

39. Williamson, J., & Murray-Smith, R. (2004 ). Pointing without a pointer In CHI '04

extended abstracts on Human factors in computing systems (pp. 1407-1410 ). Vienna, Austria

ACM Press.

40. Williamson, J. R. M.-S. (2005). Hex: Dynamics and Probabilistic Text. Switching and

Learning.


154 | P a g e

41. Windows for Devices. (2005). Smartphones™ and Pocket PC™ Phones Quick Reference

Guide. Retrieved 12/05, 2005, from

http://www.windowsfordevices.com/articles/AT2468909181.html

42. Zhu, Q. (1992 ). Improving edge detection by an objective edge evaluation In

Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological

challenges of the 1990's (pp. 459-468 ). Kansas City, Missouri, United States ACM Press.