16
Real Time Image Processing manuscript No. (will be inserted by the editor) Ricardo Jota · Bruno Ara´ ujo · Lu´ ıs Bruno · Jo˜ ao M. Pereira · Joaquim A. Jorge IMMIView A multi-user solution for design review in real-time the date of receipt and acceptance should be inserted later Abstract IMMIView is an interactive system that rely on multiple modality and multi-user interaction to sup- port collaborative design review. It was designed to offer natural interaction in visualization setups such as large scale displays, head mounted displays or TabletPC com- puters. To support architectural design, our system pro- vides content creation and manipulation, 3D scene navi- gation and annotations. Users can interact with the sys- tem using laser pointers, speech commands, body ges- tures and mobile devices. On this paper, we describe how we designed a system to answer architectural user requirements. In particu- lar, our system takes advantage of multiple modalities to provide a natural interaction for design review. We also propose a new graphical user interface adapted to architectural user tasks, such as navigation or annota- tions. The interface relies on a novel stroke based inter- action supported by simple laser pointers as input de- vices for large scale displays. Furthermore, input devices like speech and body tracking allow IMMIView to sup- port multiple users. Moreover, they allow each user to select different modalities according to their preference and modality adequacy for the user task. We present a multi-modal fusion system developed to support multi-modal commands on a collaborative, co- located, environment, i.e: with two or more users inter- acting at the same time, on the same system. The multi- modal fusion system listens to inputs from all the IM- MIView modules in order to model user actions and issue commands. The multiple modalities are fused based on a simple rule-based submodule developed in IMMIView and presented in this paper. User evaluation performed over IMMIView are pre- sented. The results show that users feel comfortable with the system and suggest that users prefer the multi-modal VIMMI group, INESC-ID Department of Information Systems and Computer Science IST/Technical University of Lisbon Rua Alves Redol, 9, 1000-029 Lisboa, Portugal. E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] approach to more conventional interactions, such as mouse and menus, for the architectural tasks presented. 1 Introduction The IMPROVE project focused on review of 3D models under a multiple user, loosely coupled collaborative en- vironment. The project included architects and automa- tive users that presented us with specific requirements divided into three areas: navigation, annotations and ob- ject editing. Users also defined three scenarios where dif- ferent artifacts were required for interaction and visual- ization. On the first scenario, users requested a 3D head mounted displays to physically move around a virtual model. On the second scenario users were required to go on-site and use tablet PC to annotate and edit an archi- tectural model. Finally, on the third scenario users asked to be able to loosely collaborate while interacting with a large scale display. Our challenge in the project was to create a system that provided the desired functionalities, supported such diverse scenarios and still be recognized as a single system. Moreover, the system was required to be collaborative, co-located and distributed, which also meant that the system needed to be multi-user. Our contribution to IMPROVE was IMMIView: a multi-user system for real-time design review. Its graphi- cal user interface is based on pen strokes, instead of point and click. We found this kind of graphical user interface would be applied to all the requested scenario. That is, the system can be used with 3D head mounted displays, large-scale display or multiple tablet PC networked to form a collaborative environment, but the graphical user interface would only suffer minor adjustments. For dis- tributed collaboration, the system includes a communi- cation backbone that is responsible for the synchroniza- tion between instances, even if the system is being used on different output devices. For co-located collaboration the system has support for multiple inputs that it is able to mix by means of the multi-user fusion module described in this paper. To enable interaction with the

IMMIView - University of Torontojotacosta/wp-content/uploads/external/JRTIP.pdf · IMMIView A multi-user ... 3D scene navi-gation and annotations. Users can interact with the sys-

  • Upload
    lykiet

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Real Time Image Processing manuscript No.(will be inserted by the editor)

Ricardo Jota · Bruno Araujo · Luıs Bruno · Joao M. Pereira · JoaquimA. Jorge

IMMIViewA multi-user solution for design review in real-time

the date of receipt and acceptance should be inserted later

Abstract IMMIView is an interactive system that relyon multiple modality and multi-user interaction to sup-port collaborative design review. It was designed to offernatural interaction in visualization setups such as largescale displays, head mounted displays or TabletPC com-puters. To support architectural design, our system pro-vides content creation and manipulation, 3D scene navi-gation and annotations. Users can interact with the sys-tem using laser pointers, speech commands, body ges-tures and mobile devices.

On this paper, we describe how we designed a systemto answer architectural user requirements. In particu-lar, our system takes advantage of multiple modalitiesto provide a natural interaction for design review. Wealso propose a new graphical user interface adapted toarchitectural user tasks, such as navigation or annota-tions. The interface relies on a novel stroke based inter-action supported by simple laser pointers as input de-vices for large scale displays. Furthermore, input deviceslike speech and body tracking allow IMMIView to sup-port multiple users. Moreover, they allow each user toselect different modalities according to their preferenceand modality adequacy for the user task.

We present a multi-modal fusion system developed tosupport multi-modal commands on a collaborative, co-located, environment, i.e: with two or more users inter-acting at the same time, on the same system. The multi-modal fusion system listens to inputs from all the IM-MIView modules in order to model user actions and issuecommands. The multiple modalities are fused based ona simple rule-based submodule developed in IMMIViewand presented in this paper.

User evaluation performed over IMMIView are pre-sented. The results show that users feel comfortable withthe system and suggest that users prefer the multi-modal

VIMMI group, INESC-IDDepartment of Information Systems and Computer ScienceIST/Technical University of LisbonRua Alves Redol, 9, 1000-029 Lisboa, Portugal.E-mail: [email protected], [email protected],[email protected], [email protected], [email protected]

approach to more conventional interactions, such as mouseand menus, for the architectural tasks presented.

1 Introduction

The IMPROVE project focused on review of 3D modelsunder a multiple user, loosely coupled collaborative en-vironment. The project included architects and automa-tive users that presented us with specific requirementsdivided into three areas: navigation, annotations and ob-ject editing. Users also defined three scenarios where dif-ferent artifacts were required for interaction and visual-ization. On the first scenario, users requested a 3D headmounted displays to physically move around a virtualmodel. On the second scenario users were required to goon-site and use tablet PC to annotate and edit an archi-tectural model. Finally, on the third scenario users askedto be able to loosely collaborate while interacting witha large scale display. Our challenge in the project was tocreate a system that provided the desired functionalities,supported such diverse scenarios and still be recognizedas a single system. Moreover, the system was required tobe collaborative, co-located and distributed, which alsomeant that the system needed to be multi-user.

Our contribution to IMPROVE was IMMIView: amulti-user system for real-time design review. Its graphi-cal user interface is based on pen strokes, instead of pointand click. We found this kind of graphical user interfacewould be applied to all the requested scenario. That is,the system can be used with 3D head mounted displays,large-scale display or multiple tablet PC networked toform a collaborative environment, but the graphical userinterface would only suffer minor adjustments. For dis-tributed collaboration, the system includes a communi-cation backbone that is responsible for the synchroniza-tion between instances, even if the system is being usedon different output devices. For co-located collaborationthe system has support for multiple inputs that it isable to mix by means of the multi-user fusion moduledescribed in this paper. To enable interaction with the

2

GUI, IMMIView includes a number of modalities, pen orlaser interaction, speech recognition, mobile devices andbody tracking. All of which can be used in conjunctionto produce new, multi-modal dialogs more appropriateto some scenarios. For example, on the large scale dis-play scenario, the multi-modal system allows for users tocombine body gestures with laser input and voice recog-nition to resize objects.

To measure the success of the IMMIView, we devel-oped tests with users where quantitative and qualitativedata, as well some important comments and suggestions,was recorded. The users were handed a questionnairebased on the standardized ISONORM 9241 Part 10 (us-ability questionnaire) and their results suggest that theIMMIView: (i) is suited for the main tasks (annotation,3D edition, navigation), (ii) conformed with user expec-tations related to the interactions techniques and multi-modal resources provided (iii) and suited to learning pre-sented in general an comfortable learning curve.

The rest of the paper is structured as follows. First wediscuss the related work. Afterwards, we detail the func-tionalities, the system architecture and graphical user in-terface of the IMMIView. Next, we describe the modal-ities along with the system support for multi-user andmulti-modal fusion. Users tests and others deploymentsare presented on the results section followed by our con-clusions and future work.

2 Related Work

During the last decade, several studies showed that vir-tual and augmented reality environment are suitable toaddress collaborative design review for architectural de-sign applications ( Iacucci and Wagner (2003); Dvoraket al (2005); Wang (2007)). Tasks supported by existingapplication include 3D annotations ( Jung et al (2002);Kadobayashi et al (2006); Kleinermann et al (2007); Isen-berg et al (2006)), object placing in Virtual environment( Drettakis et al (2007); Broll (2003)), 3D navigation( Broll (2003); Kleinermann et al (2007)) and augmentedinformation visualization ( Ishii et al (2002)). In ( Dret-takis et al (2007)), a 2D desktop application is presentedto place virtual content using a top view in a stereo visu-alization environment. The evaluation of the system wasfocused on the rendering quality but also showed thatthe top view does not allow this to place object withprecision. Both ( Kadobayashi et al (2006)) and ( Junget al (2002)) propose 2D desktop applications where sev-eral users can interact with the same virtual environmentand share annotations using a billboard representation( Kadobayashi et al (2006)) or by sketching directly overthe 3d model ( Jung et al (2002)). Recently, ( Klein-ermann et al (2007)) proposes an enriched annotationsystem for virtual environment where multimedia con-tents such as images and videos are supported and is

possible to define navigation paths using 3D landmarksplaced along the 3D scene.

During the site survey analysis, architects deal withlarge quantities of 2D information, such as notes, mapsand photos. To support this architectural process step,several applications were designed for tabletop interac-tion and augmented reality. Ishii (Ishii et al (2002)) presentsthe Luminous Table where several users can interact andtakes advantage of tangible interface to place virtual ob-jects on the top of 2D drawings. They also complementedtheir system with a 2D projection over the physicals ob-jects to visualize simulated effects such as wind patterns,traffic or virtual shadows. The ARTHUR project ( Broll(2003)) also take advantage of tangible interfaces andaugmented reality. It provides a setup were several users,wearing head mounted displays, can interact with vir-tual buildings. Using freehand gestures and 3D pointers,users can open 3D floating menus that allow shape cre-ation and selection. A comparison between tangible in-terfaces and traditional user interface was performed byusing a 3D stereo working bench setup( Kim and Maher(2005)), showing that tangible interface helps coordinatemultiple actions. While these systems present some so-lutions for architectural design, they are not suited fordesign review and do not offer annotations, navigationand shape editing in the same environment. In addition,augmented reality can be limit the user in conceptualdesign review, since it requires the usage of limited fieldof view visualization devices such as head-mount display.Furthermore, Tabletops present a good environment forcollaboration, but they do not reproduce the experienceof a real (non virtual) navigation. Considering the pricedecrease of projection systems, large scale displays canbe used to mix the advantages of both visualization sys-tems. They do not constrain the user field of view andcan be used to reproduce the experience of a real navi-gation.

Architects are used to interact and perform model-ing activities using 2D based GUI applications. Howeverthese applications are usually design on top of complexWIMP (Windows Icons Mouse Pointing) based inter-faces, which do not naturally adapt to 3D environmentor large scale displays. For large scale displays, the white-board analogy is considered more appropriate to describeinteractive graphical displays because it affords drawingand sketching. This analogy was subject of research withthe introduction of 2D calligraphic interfaces. Taking ad-vantage of emerging input devices such as tablet PCs,systems such as SKETCH ( Zeleznik et al (1996)) andTeddy ( Igarashi et al (August 1999)) better support 3Dmodeling by taking advantage of the designer sketchingskills. Both systems allow direct drawing on a 3D viewusing a graphical 2D symbol-based grammar or by con-tour sketching to invoke modeling operations. This ap-proach was later followed by the GIDeS modeling project( Pereira et al (2003)), which explores new mechanisms,such as a suggestive interface, to support the sketch in-

3

terpretation. The SmartPaper system ( Shesh and Chen(2004)) combines 3D reconstruction with sketch-basedinteraction. More recent work (de Araujo and Jorge (2003);Nealen et al (2005)) enhance this approach by provid-ing complex 3D modeling operators, based on contourand feature-line editing. While sketch based interfacesare well adapted to 3D modeling, they do not entirelyavoid conventional GUI primitives to activate other func-tionalities such as navigation. Jacoby and Ellis ( Jacobyand Ellis (1992)) present 2D menus over a 3D map viewadapted to the navigation in virtual three-dimensionalcontent. However, limitations of the virtual environmenttechnology used contribute with additional constraints tomenu design. As menu alternatives, Several 2D Applica-tions such as Hopkins (1991) and Callahan et al (1988)propose circular menus. Holosketch ( Deering (1995))use a similar approach where all the functionality wasexposed using 3D icons organized in a circular layout.In IMMIView, we choose to combine sketching, gesturesand voice modalities in order to reduce the needs of tra-ditional menus. We also proposed a novel graphical userinterface, more adapted to large scale displays.

Recently, different alternatives were proposed to re-place or adapt traditional input devices to interactionwith large scale displays. Several approaches (Buxtonet al (2000); Grossman et al (2002); Bae et al (2004))proposed to design curves using large screens mimick-ing the taping technique followed by the car industry.Cao Cao and Balakrishnan (2004) developed an interfacebased on a wand (a colored stick) tracked by two camerasto interact with large scale displays. Other approachesuse laser pointers as input devices to large scale displays(Lapointe and Godin (2005); Davis and Chen (2002); Ohand Stuerzlinger (2002)). Most of them try to adapt theWIMP concept to the laser input device, but run intoproblems due to lack of precision and unstable jitterymovements. While, these approaches explicitly addresslarge scale displays, none of them allows different usersto interact with the display simultaneously and addressarchitectural scenarios.s

3 IMMIView

The IMMIView is our solution for architectural designreview of 3D models. The system was built do meet thearchitects requirements, providing functionalities that al-low users to perform the following tasks in the virtualenvironment: navigation, annotation, 3D edition and col-laborative review. The system architecture of the appli-cation relies on the AICI Framework( Hur et al (2006))and on our own framework, called IMMIView framework,which provides 3D content visualization, supports differ-ent input devices (marker tracking system, laser pointers,mice, keyboard and others) and implements interactionmodules concepts, such as gesture recognizers or multi-modal fusion. The IMMIView offers a graphical user in-

terface that, when compared to traditional window baseddesktop applications, proposes an alternative layout thatrelies on a stroke based interaction, instead of the com-mon point and click metaphor. The next sub-sectionsdescribe our solution in detail, providing a complete de-scription of the architects requirements and the function-ality provided by our system.

3.1 Architectural System Requirements

Customer review of architectural 3D models is one of themajor tasks performed by architects during a project life-cycle. This section details the system requirements, onboth functionalities and on interaction metaphors, givenby the architects for that task. For each functionalitygroup (navigation, annotations, 3d edition and collab-orative review) interaction requirement details are ex-plained on the following subsections.

The project review usually takes place at the office,where customers review the design alternatives takinginto account information collected during on-site visits.Therefore, the system should be able to present differ-ent interactive design alternatives to support discussionbetween the architects and the customer. IMMIVIewshould allow users to view the design proposal on desk-top screens, 3D Head Mounted displays or large scaledisplays. Moreover, new 3D content and materials canbe added to the scene and the lighting can be changedto show different lighting, according to the time of theday. Users must be able to annotate parts of the modelusing text and drawings. Finally, the system must allownavigation over the virtual 3D content.

2D and 3D content interaction should be natural andsupport different media types (annotation, 3D models,pictures, 2D drawings). Moreover, the interface shouldsupport sketch based interaction for most of its func-tionalities, taking advantage of user familiarity with penbased devices such as stylus tablets, interactive pen dis-plays or other kind of pointer devices. Thus, interfaceshould resemble the paper/pencil metaphor, but enhancedwith multimodal dialogues. In addition, tracking of userbodies and movements will allow to monitor user’s ac-tions, such as 2D or 3D gestures and create further pos-sibilites to support multimodal user input.

In conclusion, users should to interact with the sys-tem either using speech, gestures, the pen based deviceon the tablet PC or a laser pointer when interacting withthe large scale display.

3.1.1 Navigation

To explore the architectural 3D Model, the users need tohave adequate means for navigation. The system shouldprovide general approaches like flying, walking and ex-amining. Flying and examining are dedicated to the nav-igation within the virtual space, since the observer is de-tached from his/her physical location. The detachment

4

provides the user a mean to easily reach locations nor-mally not accessible during early review session. The fly-ing mode offers a better perspective with the maximumdegree of freedom, allowing performing zoom operationson the virtual elements. While examining provides a nav-igation technique that focus on a single object and allowsrotations and zoom around the selected object.

Walking, on the other hand, may provide a more nat-ural and realistic way for the user to explore the model.For example, int the walking navigation mode, exploringthe virtual scene requires the same actions as in reality:walking around and inside the 3D virtual building suchas if it was a real one. By using this modality, the userexperiences the architectural artifacts from a first personpoint of view, and has to deal with the elements such asstairs, walls, furniture and doors.

3.1.2 Annotations

Architect designers often take notes (audio, visual ordocumentary) which might helpful at a later design stage.Annotations allow the users to attach their commentsand thoughts to a model entity. Thus, notes possess thecharacter of an addendum to define what cannot be ex-pressed in any other way. By capturing design intentions,modification suggestions and by documenting the rea-sons for the alterations applied to the model, the useris able to identify areas of interest and create annota-tions, either in visual (drawings) or in multimedia for-mat (audio/video). The functionality of an annotationsystem can be separated in two basis functionalities: (i)create the note content and (ii) add the note to thecorrespondent object. The user interface has to providethe following functionalities: (i) make available a mecha-nism to hide and unhide annotations; (ii) provide efficientmeans to create annotations; (iii) enable annotations toinclude additional documents, such as construction plansor sketches; (iv) to filter the annotations and (v) to deleteannotations.

Fig. 1 User annotating using a laser pen to draw onscreen

3.1.3 3D Edition

The system should include the functionality to visuallyvalidate architectural models, but it also the opportunityto change design prototypes by applying minor modifica-tions. Therefore, design modification should address thecreation and edition of objects’ geometry. In particular,the users should be able to create the following graphicsprimitives: cube, sphere, cone, cylinder and plane. More-over, to manipulate existing objects the system shouldprovide mechanisms to select objects and perform ge-ometric transformations, like translation, rotation andscaling.

3.1.4 Collaborative Review

IMMIView should provide a set of tools and modalitiesto the architects in order to review building conceptionsunder a collaborative enviornment, i.e: with other archi-tects or customers. For collaborative review, users arerequired to be present on the same location. In this case,visualization support is done using the large scale dis-play scenario, because of its high size and resolution.Although navigation, verification and modification toolsare relevant, during collaborative review it is more im-portant to allow users to annotate at the same time, thusproviding multiple user annotation functionality in thecollaborative review process.

Therefore, Annotation and sketching should be pos-sible by all participants allowing each user to annotatea the same time. Moreover, the control of navigationshould be accessible by all users, for example a usershould be able to change the view in order to share itsview to other participants. Users interacting with systemat the same time are free to select different modalities.This means that one user can move physically in front ofthe large screen, and use a particular workspace where hecan use own widgets to examine objects of interest whileanother user is using different modalities like gestures,speech or pointing or tracing strokes to create annota-tions.

3.2 Our System Architecture

The IMMIView application provides innovative multi-modal interactions for 3D content visualization using abus architecture as depicted by Figure 3. The system re-lies on two framework: the AICI(Hur et al (2006)) Frame-work for visualization and tracking support and the IM-MIView Framework for interaction support. This frame-work is similar to the Studierstube project( Schmalstieget al (2002)), however it relies on OpenSG (2009) insteadof OpenInventor, allowing cluster based visualization ona tiled display system. InstantReality(IGD and ZGDV(2009)) presents similar functionality based on OpenSGand X3D description. However, ImmiView multi-modal

5

Fig. 2 Two users collaborating on a Large Scale Display.

Event Manager

PDA Multiple Lasers Voice

Shape

Annotation

Navigation

MultiModal BoxInteraction ConceptCALIBody Gestures

Visualization Manager

Commander

Widget Manager

Tracker 2D Pen / Mouse

Communication Backbone

ApplicationModules

Recognizers and Specialists

Input Modules

DomainModules

AICI Framework IMMIView Framework

Fig. 3 The IMMIView System Architecture

support requires a deeper access to the application eventloop than the one proposed by X3D sensor based archi-tecture.

The AICI Framework is responsible for the 3D ren-dering using display configurations such as head mounteddisplays, multi-projector screens or TabletPC. The Visu-alization Manager is based on the OpenSG (2009) scene-graph and extended to support physically based render-ing and advanced lighting using High Dynamic Rangeimages. It also enables 3D stereo visualization and itis complemented with a tracking component based onOpenTracker (Reitmayr and Schmalstieg (2005)). Thetracking support enables 3D stereo using head tracking.It also provides a simple Widget Manager for immer-sive environments. The event loop of this framework onlysupport tracking binding with the visualization and theusage of traditional input devices such as mice and key-board.

Our system is based on the AICI Framework, thusour visualisation is required to be executed inside theAICI main thread. We also run the event manager mainfunction inside this thread. The reason for doing this isthat the event manager main function executes the call-back functions, which might include visualization relatedactions that are required to run inside the AICI mainthread. All other modules run on their specific threads.Because our system is based on events, we do not need

to synchronized the multiple threads, except the visual-ization thread which is already synchronized inside theAICI Framework.

3.2.1 Events

The main component of our system, the event manager,relies on the IMMIView framework, which was designedto offer innovative multi-modal interaction techniques.Because the AICI event loop doesn’t support input de-vices such as laser pointers or body tracking we felt theneed to develop our own event manager. Therefore, thearchitecture of IMMIView relies on an event managermodule that includes a bus where the other modules canpublish or subscribe to events. The event manager im-plements a simple API that offers two functions: publishand subscribe. Using these two functions the event man-ager knows who is interested in an event type and isable to forward the information to all subscribing mod-ules. The manager publish function is called whenevernew information is available in the IMMIView modules.To simplify the event management, each event type isidentified by a string correspondent to its type. For ex-ample, each module that subscribe to the ”laserEvent”type, gets called back, by the event manager, when thelaser input module publish new information.

The event manager must run on a single thread, sothat the callbacks are executed in a serial condition toenforce the correct order of callbacks. Therefore, it im-plements a waiting line that gets filled inside the pub-lish function and consumed in the event manager mainthread function (that eventually calls the subscribe call-backs).

According to their event behaviour, the IMMIViewmodules can be organized into three different classes:publishers, consumers and converters. Publishers aresources of information that do not require further in-formation from the IMMIView system to update theirstatus. Each publisher informs the event manager whatkind of event they produce. Whenever there is new infor-mation, they insert it into the event manager’s publishwaiting line. Examples are modules such as: the multi-user laser module, data proxies from hand held devicesand body tracking modules. Consumers require infor-mation to change their state. Therefore, they subscribe acallback function for a certain type of events. For exam-ple, the visualization module subscribes to navigationtype events in order to change its camera parameters.Other consumer modules include annotation managermodule, shape creation and manipulation module andthe widget module that provides the menu interface. Fi-nally, some modules act as both consumers and publish-ers. They subscribe to multiple events such as laser in-put and voice commands and compose those commandsinto more higher level events such as object selection ornavigation actions. We call these modules converters.

6

Modules included in this class are detailed in the nextsection.

3.2.2 User Interaction

User interaction is supported by converter type modulesthat listen to simple events and, in return, publish higherlevel events into the event manager. We have defined thefollowing converter modules:

– Body Gesture Recognizer: Analyzes the trackingdata obtained from a real time marker based mo-tion capture of the user and published recognized ges-tures. To obtain user data, we track the user’s headand arms and send the information to the recognizer,which is constantly trying to recognize body gestureperformed by each user. With this information wecan navigate using arm gestures, for example.

– Cali Gesture Recognizer: Cali is a 2D symbol rec-ognizer. It receives data from 2D input devices suchas pen, mouse or lasers. It trigger actions whenevera user draws a gesture. For example, the main menucan be opened by drawing a triangle gesture.

– Interaction concept Module: This module is ableto recognize a set of interaction behavior and gen-erate higher level interaction primitives. It supportsobject selection or pointing, drag and drop manip-ulation and lasso based selection. Since this moduleis also aware of the context of the application andthe entities present on the scene, it can provide con-textual information about user actions useful for theMultimodal box.

– Multimodal Box: This module provides an abstractdefinition of multimodal interaction using a rule basedinference mechanism. Using all the information trav-elling on the event bus and a predefined grammar,the multimodal box is able to compose interactionusing several modalities such as body gesture plusvoice commands or mixing laser interaction with mo-bile devices to create new annotations.

To support collaborative design review between sev-eral instances of the IMMIView application the originalAICI framework was extended with an XML communi-cation backbone based on XMLBlaster (2007) middle-ware. The information related with functional compo-nent i.e. annotations, shape, widgets and navigation canbe shared with other instances of the IMMIView system.Thanks to a centralized data coordinator several appli-cation can view the same data and manipulate it usingdifferent visualization systems. For example several usercan interact with a large scale display while the scenecontent is also edited remotely by another user using ahead mount display.

3.3 Graphical User Interface

The IMMIView application offer a GUI for annotations,navigation, 3D edition and collaborative review. Our GUIis based on a set of menus that overlay the 3D scene,They were adapted to the different visualization environ-ments handled by the IMPROVE project. The GUI pro-poses an alternative layout, when compared to the tra-ditional window based desktop applications. It relies ona stroke based interaction instead of the common pointand click metaphor. The stroke based interaction was se-lected considering architect sketching skills and its easyof use when interacting with TabletPC or other handledpen-based computers and interactive whiteboards suchas laser interaction on large-scale displays.

The IMMIView application interprets a stroke in thefollowing ways:

– User can sketch symbols (2D gestures recognized bythe CALI module) to launch IMMIView. For exam-ple by drawing a triangle the main menu will appearwhere the gesture was recognized;

– To activate and select options of the GUI, the userdraws a stroke to cross over the options. This selectsoptions on a menu;

– To select objects or bring menus, specific to the typeof a 3D object (i.e., shape or annotation), the usersketches a lasso over the object, surrounding it. Thisselect the object and pops up a context menu.

The IMMIView functionality is exposed through cir-cular layout menus, similar to a torus shape. The menuoptions can use a textual or iconic representation andare activated by crossing the option. This solution re-places the use of the point and click metaphor, which isless adapted to pen-like input devices. The functional-ity accessible through the GUI is identifiable by differ-ent semi-transparent menu background colors and addi-tional textual tool-tips that appear when sketching overan option. Menus are labelled using captions, however,the background color enables to identify easily the scopeof the functionality provided by the menu: i.e annota-tions, navigation, object creation or transformation andsystem configuration all have different background col-

Fig. 4 Main menu (left) and Notes Menu with the severalareas of interaction (right)

7

Fig. 5 Creating and Editing Annotations using the IM-MIView GUI

ors. The textual information provided by the tooltips isvaluable when interacting with the IMMIView applica-tion, since it is the basis of the voice command based in-teraction, beyond the traditional usage to disambiguateiconic representations of the GUI.

Figure 4 depicts two different menu examples avail-able on IMMIView. On the left, the main menu is de-picted using a green background and shows a set of tex-tual options (up to eight options per menu). On the right,the annotation menu is depicted using yellow color asbackground and presents iconic options plus a specialinteractive area to draw annotations. Menus can takeadvantage of their layout to propose interactive areasrelated with the task at hand. Starting from the mainmenu, all menus and theirs options are accessible withintwo levels. Some menus cluster related functionalities byproviding a left lateral menu, visible on the annotationmenu in Figure4. Finally, to support multiple user in-teraction on large scale display collaborative scenarios,several menus can be opened, moved and controlled us-ing to the peripheral options located on the top right ofeach circular menu.

3.3.1 Annotation Menu

The functionality related with annotations is availablein the yellow menu or by selecting annotations alreadypresent in the scene. Figure 5 presents the several stepsinvolved by the annotation creation. The content of thenote can be draw in the central interactive area of theAnnotation Menu (Top left). To place a note, the usersketch a path from the placing button to the desired 3Dlocation (Top right). The annotation will be snaps auto-matically to objects. Notes are represented on the sceneas floating post-its with anchors (Bottom left). They canbe edited, deleted or hidden by selecting them with alasso. This action brings an Annotation Menu dedicatedto the selected note (Bottom right).

Fig. 6 Three mode navigation menu: 1st Person, Mini-Map,Explore.

Fig. 7 Creating Simple Shapes with Menu and editing theirattributes

3.3.2 Navigation Menu

The Navigation Menu propose three different ways toexplore the 3D scene (See Figure 6). The first mode isa first person like navigation which is presented as a setof icons. This view allows the camera to move forward,backward, turn left, turn right and control pitch (Leftpicture). This mode is more suitable for local naviga-tion and similar to a flying mode. The second mode isbased on a mini map and compass based representation(Middle). The user can sketch directly over the top viewof the map located on the center of the menu, draggingits position. It is also possible to control the orientationby rotating the surrounding compass area. This modeenables fast global navigation. Finally, a third mode isoffered by the navigation menu to explore a particularobject of the scene (Right). Using a track-ball like repre-sentation, the user can zoom and rotate around a objectof the 3D scene. Similar to annotations, to select the tar-get object the user draws a stroke between the top leftmenu option and the desired object.

3.3.3 Shape and Transformation Menu

To create simple shapes over the 3D scene the user needto open the Shape Menu. Spheres, cubes, cones, cylin-

8

ders and planes can be created and place anywhere bysketching a path from the menu icon to the wished loca-tion, similar to annotations. Moreover, these shapes canbe deleted or transform geometrically. To do this, usersselect a shape using the lasso, this brings up the transfor-mation shape menu. The transformation menu providestranslation, scale and rotation options. Figure 7 depictsthe creation of a sphere using the shape menu and theselection of a shape and correspondent transformationmenu.

4 Multimodal solution

IMMIView was designed to support different visualiza-tion scenarios such as large scale displays, head mounteddisplays or tablet PCs. However, the user interface dif-ferences between scenarios were required to be minimal,so that users could switch from one scenario to anotherwithout the need to learn a new interface. Furthermore,the user interface should be flexible in order to enableseveral users to collaborate using different scenarios. Forexample, an user using a large scale display could collab-orate with another using a tablet pc sharing the sameinteraction concepts. The need of a common set of in-teraction metaphors lead us to create a multimodal sys-tem, where a principal modality, that allows for the basicinteraction, is available on all scenarios and secondarymodalities are available on each scenario taking advan-tage of each scenario’s qualities. The availability of differ-ent modalities also means that interaction dialogs couldbe expanded with multimodal commands. The modali-ties fusion was implemented in a manner that the wayhow actions are accessed could be easily re-configuredusing a script file, thus supporting different scenarioswithout requiring recompiling the system. The followingsubsections explain how each modality was implemented,their intent and how the modality fusion was achieved.

4.1 Inputs and Modalities

IMMIView offers several input devices for interaction.For example, the Laser Pointer is used for generic in-teraction, to cross menus or annotate 3D models. Wedeveloped a set of secondary input devices to enhancethe interaction and support multi-user scenarios. Lasercan be used in combination with Speech based Interac-tion to activate menu options, or with Mobile Devices tocreate and insert multimedia note content. Using bodyTracking with speech, user can navigate and edit objectsnaturally. Figure 8 shows an user interacting with IM-MIView using laser and body tracking.

4.1.1 Laser Input

As presented before, our GUI is based on stroke input.On a tablet pc strokes are executed using a pen. How-

Fig. 8 User interacting with Immiview. Left: user open ancircular menu. Right: User is navigating through the virtualWorld.

ever, the pen is cumbersome and does not allow for directinput on large scale displays. Therefore, we adopted laserpens to draw strokes directly on the large scale display,similar to the pen functionality on tablet PC . The laserposition is captured using image processing techniques.We use a IR sensitive camera to reduce the image noiseand increase the laser detection. The image is captured,and filtered to identify high intensity pixels, which weconsider to be the laser position. Afterwards, each laserposition is sent to the application that translates this in-formation to cursor information and constructs strokes.Further information regarding how lasers support multi-ple users is detailed in section 4.2. Figure 9 depicts thethree main steps of the laser recognition algorithm.

Fig. 9 Laser detection algorithm steps: Acquisition, Filter-ing and Stroke Matching

4.1.2 Speech based Interaction

The IMMIView speech system enables users to invokevocal commands to control the application and to per-form navigation, modeling, annotation creation and con-figuration tasks. Speech is mostly used on three occa-sions: to navigate into the menu based graphical inter-face, to enhance laser interaction with direct commands,for example ”Hide This”, and to support body trackingnavigation. To improve recognition, IMMIView’s speechmodule informs the application clients of interface ac-tions. This enables the speech client to reduce gram-mar complexity, by only interpreting on-screen availablecommands. For example, if a menu was opened by theclient, the correspondent speech actions are added tothe grammar. Likewise, whenever a menu is closed, thespeech sub-module informs the client and the correspon-dent commands are removed from the active grammar.

9

4.1.3 Mobile Device Input

The post-it metaphor was largely mentioned wheneverthe annotations were discussed during user interviews.This metaphor, to some extent, is available on the GUI -by writing an annotation and dragging it to the desiredlocation. Even so, on the large scale display, the size ofthe display made it difficult for users to write an anno-tation and them, accurately place it. Therefore, a mobiledevice was used to simulate physical post-it notes. Theuser can draw something on a mobile device, or choose apicture and then, using the laser, point to the large scaledisplay and an annotation is created on that position(see figure 10).

Fig. 10 Mobile device metaphor example.

4.1.4 Body Tracking

Body tracking enables us to use gesture and body posesto interact more naturally. With Head Mounted Dis-plays, body tracking gives the view position of the user.On large scale displays, body tracking allows users tonavigate and edit objects through gestures. To navigate,users issue a spoken command to enter the mode andthen point to where they want to fly to. Users controlspeed, pitch and roll by arm position and inclination.These afford a simple metaphor to fly over the scene. Tochange shapes, user selects an object and then activatechange mode via a spoken command. In this mode, armgestures get reflect on object size and shape. For exam-ple, to shrink an object, we can select it, utter the ”Editobject” command and then move both arms together toscale down the selected shape.

The arm position and inclination control speed, pitchand roll offering a simple metaphor to fly over the scene.To resize objects, an user select an object and them issuea speech command to activate the mode, afterwards armgestures would reflex on the object. Figure 11 depicts thegestures allowed by tracking. Our tracking setup uses 4

cameras with infrared beamers to detect infrared reflec-tive spherical markers. Our tracking recognizer modulereceived the labeled position of each reflective markerfrom the tracking system. Knowing each markers, we areable to recognize postures and gesture command com-puting the geometrical relationship between hands andshoulders.

Fig. 11 Example of actions supported by body tracking: Top- Fly mode with speech commands. Bottom left - Moving anobject. Bottom Right - Scaling an object.

4.2 Multi-User Support

IMMIView supports both distributed and co-located col-laboration. On distributed collaboration each user inter-acts with a different instance of the system, that aresynced in real-time. Since each user are using their owninput setup, this scenario does not present real multi-modal issues. On the other hand, when two or moreusers collaborate on the same system (co-located) themodalities must be aware that there are different users.Multi-user supports works seamlessly when each user isusing different modalities, for example one user can beusing Mobile device and Laser Interaction while the otheris using voice commands to navigate menus. Each usercan execute any functionality without interfering withothers. One exception is navigation, because it trans-forms the system view of the model. If the user needs tonavigate to different parts of the model, then the sug-gested collaboration model is the distributed one. How-ever, early tests show that two users use the same modal-ity more often than planned. Again, on most modalitiesmulti-user can be trivially supported. Voice commandscan be disambiguated using separate microphones foreach user and the tracking system can also be used to fol-low multiple users. Laser Interaction, however, needed tobe adapted to multi-user. On the input level we need todisambiguate between multiple inputs and identify laserinputs as continuous strokes. On the application level,we need to provide users feedback so that users couldidentify their own input.

10

4.2.1 Disambiguating Laser Input

Our main problem was finding out what events belongedto which stroke. The image captured identifies the posi-tion of all active lasers, but we need to match the iden-tified position with positions obtained in the previouscaptured image. Using an Kalman filter, we are ableto detect how many users are interacting and maintaintheir interaction state. The Kalman filter is a well knownmethod for stochastic estimation which combines deter-ministic models and statistical approaches in order to es-timate the variable values of a linear system (Welch andBishop (2006)). In our system, we use this technique toestimate and predict possible laser positions. The cam-eras work as clients which are able to identify laser posi-tions. The registration of the camera and its calibrationprovide an homography that translates camera coordi-nates to application coordinates. Using the cameras ho-mographies, points are translated to application coordi-nates and then sent to a single server which is responsiblefor collecting the information of all cameras and matchthe input information with the active strokes. Using theKalman filter predictive behavior, it is possible to matchpoints of the same laser, even when the laser crosses sev-eral cameras. The matching identifies when strokes areinitiated, remain active and are terminated. The advan-tage of this approach lies in its support for multi-user in-teraction. Using several filters, one for each active laser,we can identify strokes, and calculate their status. Figure12 depicts this workflow. If there is a prediction withoutmatching input, we conclude that the stroke was finished.In case of a point not being matched to any estimation,we assume that a new stroke was started and use theevent coordinates location as the stroke’s first position.If there is a match between a point and an estimation,the corresponding stroke is updated with a new pointand remains active. Thanks to this disambiguation algo-rithm for our laser input device, we support that severalusers interact with our GUI using several lasers at thesame time over the large scale display.

Active Strokes Input Events

Preview Events

Kalman filtering systemNo Input No previsionEvents match

Kill Stroke New StrokeContinue Stroke

Fig. 12 Workflow of our stroke estimation approach

4.3 Modality Fusion

The IMMIView prototype exposes a set of functionalityaccessible trough the fusion of several modalities. Thesecomposite interactions are defined using a ruled-baseddefinition which is managed by our multi-modal modulementioned as MultiModal Box on the previous SystemArchitecture section. This module is structured into twoparts, the first part is the set of rules to define all pos-sible interactions and the second part is the state of themodule which is represented by all valid tokens. By to-kens, we refer to all the information which can be inputfrom a given modality, recognizer or as a consequence ofactivating a rule from our multi-modal module.

4.3.1 Representing Multimodal Interaction

Our multi-modal module defines interaction using a setof rules. Each rule is divided into a set of preconditionsand actions. In order to apply a rule, all its precondi-tions need to be fulfilled which results on a set of actionchanging the status of the multi-modal module. Our sys-tem support the following types of preconditions whichrepresent abstract concepts supported by the module:

Tokens: an abstract knowledge representation, whichcan represent a user body action or gesture such as ”point-ing”, a speech command or even an application mode.Tokens can be enriched with attributes allowing to repre-sent any kind of event from the user or other interactionmodules existing in the application such as recognizersor specialists.

Objects: represent identified system entities over whichuser can perform actions. These objects have a class as-sociated to them identifying a subset of objects. For ex-ample, in our system, we support the notion of ”shapes”,”annotations”, ”anchors” and ”widget”.

Regarding the possible set of actions which can be ap-plied by a rule, just two types of actions are supported byour definition. The first set are operators to manage thedata matching preconditions allowing to remove tokensor objects or generate new ones, changing the status ofthe module. The second set of actions are Commands toactivate functionality of the IMMIView system providedby other modules such as depicted by the IMMIViewarchitecture.

The following list presents three examples of rulesused by our multi-modal system where T <>, O <> andC <> represent tokens,objects and commands respec-tively. The first rules represents the launch of the mainmenu using the voice command ”Open Main Menu”. Re-sulting from the activation of this rule, the commandis given to our widget manager and the token is re-moved. The token is an abstract information with nodependency regarding the modality. In our implemen-tation the voice module is able to generate the tokenT < openMainMenu > when the voice command is rec-ognized. On the other hand, this token is also generate

11

when a triangular shape is recognized by our 2D shaperecognizer.

Rule 1: T < openMainMenu >⇒ C < widget2D :TOK0 >,−T < TOK0 >

Rule 2: T < moveUp >, O < widget >⇒ C < widget2D :OBJ : MoveUP >,−T < TOK0 >

Rule 3: T < selectThis >, T < pointingOverObject >⇒C < Object3D : select : ATT10 >, C < Context :menu : ATT10 >,−T < TOK0 >

The second rule illustrates the activation of an optionnamed MoveUp from a widget object. Finally, the lastrule shows an example of a multi-modal interaction toselect an object using a pointing gesture combined witha voice command. For each rule, the action can use ref-erences to the matching data of a precondition. TOKnrefers to the nth token from the precondition list of arule, OBJn refers to the identifier of the nth object andfinally ATTnm refers to a mth attribute of the nth token.

4.3.2 Inferring Multimodal Interaction

The Multimodal module receives information from theother component in the form of new tokens or new ob-jects to update its status. A temporal duration is as-sign to each knowledge data in order to know until whenthe information is valid for multimodal status. Currently,several IMMIView sub-components feed this module asindependent specialists:

– The selection component identifies when objects areselected by lassos metaphors.

– The observer component is responsible to inform whena user is pointing over an object by analyzing laserinputs continuously and notifying when users are en-tering or leaving shapes or annotations.

– Gestures such as the triangular shape recognition issupported by the CALI (Fonseca et al (2002)) shaperecognizer.

– The application widget manager notifies which in-terface objects or menus are available on the userworkspace.

– The speech recognizer feeds the knowledge base withrecognized speech command as simple textual tokens.

When an input data is received, it is automaticallyprocessed by our inference system which will try to ful-fill the precondition of existing rules using the status ofthe multimodal system. If at a given time, all the pre-conditions from a rule are available, the rule is appliedexecuting its actions. If until the end of its time validity,an input data is not used by any rule, it is discardedby the system. This solution allows to define interactionusing an abstract definition independent of the type ofmodality and also to avoid to deal with sequentially ofprecondition regarding a rule. By applying this set ofrules, we are able to define the behavior of the systemcombining several modalities using a flexible and exten-sible way.

5 User Tests

IMMIView was tested by twenty-two users during threedifferent rounds of user tests within the European IM-PROVE project. The three rounds were performed toevaluate both interaction techniques and the mix of modal-ities proposed by our system. During each round, userswere exposed to multimodal interaction using speech recog-nition, laser input and body tracking. Each user testsession included three single-user tasks and one multi-user (collaborative) task. Single user tasks were designedwith different degrees of difficulty: the first one com-prised three easy steps related to navigation, creatingand editing annotations. The second one included ninemedium difficulty subtasks including navigation, creat-ing, selecting and manipulating notes and 3D objects.Finally, the third task added more specific steps such asgeometric transformations to 3D objects including scal-ing, rotation and translation. For the collaborative task,we asked two users to execute the first task at the sametime. The tests were conducted using a 4 x 2.25 meterTiled Display with 4096x2304 pixel resolution. The in-put modalities included voice recognition with wirelessmicrophones, stroke interaction using laser pointers andbody gesture tracking using a marker-based optical in-frared system. The data sources of the user tests were ausability questionnaire given to each user based on thestandardized ISONORM 9241 Part 10, as well, the datafrom user comments, observation task notes and videoanalysis.

5.1 Evaluation based on ISONORM 9241 (part 10)

The standardized ISONORM 9241 - Part 10 (Ergonomicrequirements for office work with visual display termi-nals) provides requirements and recommendations re-lated to the hardware, software and environment attributesthat contribute to usability and the ergonomic principlesunderlying them. Trough the usage of questionnaires, thegoal was to obtain feedback from the users experience re-lated to the following seven principals of this standard:(1) Suitability for the Task, (2) Self Descriptiveness, (3)Controllability, (4) Conformity with User Expectations,(5) Error Tolerance, (6)Suitability for Individualizationand (7) Suitability for Learning. Users were asked to rateeach question from 1 (the least favorable case) to 6 (themost favorable case). The results are presented in theTable 1.

Globally, the average results of all the seven principlesare above the mean value (3.0). So, in the general, thesystem seemed suited to the needs of users in the threemain tasks(navigation, annotation and 3D edition). The”‘Suitability for the task”’ results (average is 4.49) showthe users find in general the system easy, fast and nat-ural, related to the use of widgets, different modalities(strokes, gestures and speech) and different input devices

12

Average Std. DeviationSuitability for the task 4.49 0.79Self Descriptiveness 4.56 0.63

Controllability 4.26 0.68Conformity with User Expectations 4.27 0.83

Error Tolerance 3.32 0.89Suitabilty for individualisation 4.50 0.39

Suitabiltiy for Learning 4.62 0.51

Table 1 Results of ISONORM questionnaire on the scale 1to 6.

(laser pointer and tracker). The ”‘Self Descriptiveness”’results (average is 4.56) allows us to conclude a moderatelevel of users satisfaction who appear to be able to dif-ferentiate and understand system functionalities by theemployed codes and metaphors. The ”‘Controllability”’results (average is 4.26) revealed the users could controlthe input devices, modalities, widgets and other objectsrelatively well. However, some improvements could bedone on the accuracy of speech recognizer, tracker andlaser pointer, and on some interaction metaphors. The”‘Conformity with User Expectations”’ results (averageis 4.27) are positive but some issues could be improved.The users found the correct usage of speech commandsvery important to invoke commands, but in their opin-ion this method was too error prone. The users foundalso they were unable to express themselves on the notesgiven the space provided. The ”‘Error Tolerance”’ results(average is 3.22) although positives presented the lowestscores when compared with other ISONORM principles.Some actions performed by users are a little bit difficultto revert. Regarding the navigation task, it is not easyfor the user to correct some actions in flying mode usinggestures. Using gestures, the geometric transformation ,the creation and the placement of the notes on the cor-rect position are difficult to achieve. The ”‘Suitability forindividualization”’ results (average is 4.50) revealed thissystem is relatively easy to adapt the user’s needs, dueto the usage of different modes (stokes, speech and ges-tures), to perform the same tasks. I.e: the users foundeasy to customize their work space with widgets. The”‘Suitability for Learning”’ results (average is 4.62) arepositive but some problems were found: the users hadsome difficulties to remember new speech commands orwhich arm gestures required to be executed for a givengeometric transformation or navigation.

5.2 Users Tasks Performance

Regarding the functionality proposed by our system, quan-titative data was collected from the tests related to nav-igation and annotation tasks to identify the level of per-formance of each multi-modal metaphor used by user.To perform the navigation task, the user could use thefollowing modes: Menus and Strokes / Laser, Menus andSpeech and Gestures and Speech. The Table 2 providesthe rate of errors and the time per usage done by users

when using three different combination of multi-modalmodes.

With these results, we can conclude that the users didmore errors on navigation task when using the ”Menusand Speech” mode and spent more time using the ”Ges-tures and Speech” mode. The first conclusion is due thefailures of the speech recognition system to recognize cer-tain commands. Some of the users had scottish regionalaccents that are not completely compatible with the con-figured American-English grammar. The second conclu-sion is due to the fact that the users spent more timeand cognitive effort to adjust their position and orien-tation when using gestures on the flying mode. The re-sults revealed the users did less errors using the ”Menuand Strokes/Laser” mode which is the interaction ap-proach more similar to tradition desktop applications.On the other hand, the usage of speech commands overthe opened menus permitted to spend less time to per-form the tasks because, when the speech recognizer workswell, the time to activate the commands is very short.

To perform the annotation manipulation tasks overthe note objects, user could invoke multi-modal com-mands like ”select this”, ”delete this” or ”hide this”. Itcould be done using the following modes: Menus andStrokes/Laser or Pointing and Speech. The Table 3 pro-vides the rate of errors and the time spent by users foreach multimodal combinations.

Although, the ”Pointing and Speech” mode was cho-sen by the great majority of the users, the errors rateand time spent are higher than using the more tradi-tional interface ”Menu and Strokes” mode. The failureson the speech recognition system and the lack of accu-racy to pick note objects when they are away (or it hassmall size) are the reasons for the lack of performance ofthe ”Pointing and Speech” mode. It’s important to high-light users didn’t do any error using ”Menu and Strokes”mode.

5.3 Multi-modal Preferences for users tasks

IMMIView allows the users to interact with the sys-tem using different modalities (multi-modal) which al-low each user to perform the same task using a differ-ent combination of modalities, which are best fitted totheir preferences or skills. Also, if the user had difficul-

Error/Usage Time/UsageMenus/Laser 0.27 1:37Menus/Speech 1.80 0:59

Gestures/Speech 0.89 1:50

Table 2 Navigation performance data by modes.

Error/Usage Time/UsageMenus/Laser 0.00 00:30

Pointing/Speech 0.84 0:51

Table 3 Annotation performance data by modes.

13

ties with a particular action, he could promptly switchto another modality that he felt more comfortable. Theresults presented below were gathered from three differ-ent tasks: navigation, notes manipulation and geomet-ric transformations over 3D objects. The combination ofmodalities used was the following: laser stroke interac-tion plus menus, speech interaction plus menus and botharm gestures and laser pointing combined with speechcommands.

5.3.1 User Multimodal preferences for navigation

Regarding navigation, we collected data about the first,second and third preferred modality combinations usedto perform seven tasks of varying degrees of difficulty.These included (a) Laser and Menus (using laser strokesto open menus and to activate menu options), (b) Speechand Menus (invoking commands using speech and con-text menus) and (c) Arm Gestures and Speech (perform-ing the navigation fly task using the arm gestures com-plemented with speech commands to change some pa-rameters of navigation such as velocity). The results arepresented in Table 4.

1st Choice 2nd Choice 3rd ChoiceMenus/Laser 32.73% 54.55% 100.00%Speech/Menus 27.27% 27.27% 0.00%

Gestures/Speech 40.00% 18.18% 0.00%% from previous - 40.00% 4.55%

Table 4 Percentage of user navigation choices by modalitiescombination.

With these results, we conclude that users used dif-ferent combinations of modalities to perform the samekind of tasks. The 1st choice is balanced among the threedifferent combinations which illustrates the system flex-ibility in accommodating user preferences. When userschose a second combination of modalities (40% of thenumber of the first choices), a slight majority (54.55%)opted for the laser pointer plus menus combo. This so-lution is more similar to the desktop environment, sousers might feel more comfortable interacting with thisresource, thus falling back to this solution when hav-ing difficulties. Only one user elected to combine threemodalities to perform a particular task.

5.3.2 User Multi-modal preferences for annotation

Regarding annotations, we collected the data on the firstand second user’s choices of modality combination usedto perform five tasks of different degrees of difficulty.The combinations of modalities are: Laser and Menus(using laser strokes to open note menus and to activatetheir options) and Pointing and Speech (pointing onenote object and invoke speech commands to change thestate of the note). The commands ”select this”, ”delete

this” and ”hide this” are examples of speech commands.The results are presented in Table 5.

1st Choice 2nd ChoiceLaser/Menus 0.00% 100.00%

Pointing/Speech 100.00% 0.00%% from previous - 24.49%

Table 5 Percentage of user notes manipulating choices bymodalities combination.

The combination ”Pointing and Speech” was pickedby all users as the first choice to perform annotationtasks. This interaction modality seemed very natural tousers and reflected a similar usage on the ”real” world.The second choice (24.49% of first choices) was Laserplus Menus. The reason of this second choice was relatedto some difficulties of the speech recognizer to handlestrong accents of some users, which caused user to fallback on more familiar, or less troublesome, modalities.

5.3.3 User Multimodal preferences for geometrymanipulation

For geometric transformations of 3D objects (transla-tion, rotation and scale over the three axes), we collecteddata on the first and the second choices of modalitiesused in combination to perform the proposed three tasks.These included (a) Laser plus Menus (using laser strokesto open geometric transformation menus and to activatetheir options/operators), (b) Speech plus Arm Gestures,whereby geometric transformation commands are com-posed by two parts: first, the kind of transformation (ro-tation and scale) is invoked using a voice commands,secondly the user quantifies and controls the transfor-mation using arm gestures. The results are presented inTable 6.

1st Choice 2nd ChoiceLaser/Menus 11.54% 100.00%

Speech/Arm Gestures 88.46% 0.00%% from previous - 30.77%

Table 6 Percentage of user geometric transformationschoices by modalities combination.

The combination ”Speech/Arm Gestures” was pickedfirst by the majority of the users (88.46%) to perform ge-ometric transformations on 3D objects. The users classifythis kind of interaction as natural and direct. One reasonmight be that users could better tune the transformationusing gestures rather than using stroke interaction andmenus.

To conclude, IMMIView is a system with strong multi-modal and multi-user components that integrates scal-able and redundant technologies. It is clear that havingdifferent modalities and interaction techniques benefitsthe users. The redundancy of modalities to access the

14

same functionality is important to better fit and adaptthe interaction to users wishes. The usage of the speechcommands while pointing to a particular object was con-sidered to be a interactive powerful resource. On themajority of the cases, the laser plus menus, was oftenthe 2nd choice. The navigation using arm gestures andspeech commands was also appreciated by many users,which noted the flexibility and versatility of the multi-modal combinations available.

5.4 Deployments

Along this research project, other setups were also de-ployed, but no user tests were executed on them. Theywere deployed as technical achievements, prior to the fi-nal user tests. Early on, the communication backbonewas tested with two improve system instances, a largescale display in Portugal and another in Germany. Thisfirst test allows us to verify that the system could providedistributed collaborative actions and that annotationspassed from one instance to the other without notabledelay. Tests with mixed setups, involving different arti-facts were also deployed. Tablet PC with head mounteddisplays allowed the users to navigate using the headmounted displays and interact using the tablet pc. More-over, co-located collaboration was also tested with oneuser running the system on a tablet pc and another userinteracting with large scale display (Figure 13). Finally,to test the mobility of our system - one of the user re-quirements - the system was deployed on an architecturalsite, in Glasgow, using an simple large scale display (oneprojector and an rental projection screen). Here, speechand laser modalities were also available for interactionillustration the versatility of our proposed design envi-ronment.

6 Conclusion

This work presents the IMMIView as an interactive sys-tem relying on multi-modality and multi-user interactionto support collaborative design review of 3D models. Ourtask analysis have identified user requirements for archi-tectural design review and the functionalities that shouldbe implemented to enable users to develop their maintasks: navigation, annotation, 3D edition and collabora-tive review. In order to integrate these functionalities, weproposed a system architecture based on a set of modulescoordinated around an event bus where functional com-ponents, interaction experts and visualization modulescoexist. Many experiments and decisions were taken inorder to implement a flexible and modular system thatoffer 3D visualization, support for different kind of inputdevices, interaction using different modalities (pointing,gestures, speech and stokes) and GUI interface. Someissues were solved in the design phase of this system to

Fig. 13 Deployments setups. Top: single display projectorwith multiple users and mobile device support. bottom: largescale display and tablet PC collaboration.

permit a coherent and consistent data flow between theirdifferent components.

IMMIView offer several alternative interfaces so thatthe user can optimize their interaction according to theirgoals and tasks. A novel stroke interface were adapted towork with laser pointer limitations on Large Scale Dis-play presenting a innovative architectural environmentsimilar to an augmented white-board. On the other hand,both arm gestures based interaction and speech com-mands were available to users controlling their actionsand offering more natural or productive metaphor fortheir tasks. To manage the simultaneous multi-modal in-formation, an abstract knowledge representation and in-ference mechanism were designed to handle multi-modalaction ambiguities, even when several users are interact-ing at the same time. This mechanism permits to uncon-strain the user expressivity when several user interactswith different modalities at the same time.

To assess our approach and the tasks supported byour prototype, three different kind of user tests were per-formed. Based on the ISONORM 9241 - part 10, the

15

users opinion were collected about different principles ofinteractive design. According to the results, the usersexpressed a positive feed-back for all the principles. Theprinciple ”‘Error Tolerance”’ presented the lowest resultsand will be subject of future work to overcome this is-sue. The performance results permitted to conclude thatcombining our GUI with the laser pointer to interactis the technique were the users presented the lowest er-rors in the navigation as well in annotation task. On theother hand, using speech commands to activate the GUI,the users presented more errors. However, for navigationtasks this options was the fastest one. The analysis ofthe combination of modalities proved that the first userchoices to the navigation task are balanced among thethree different modes (menus/laser, speech/menus andgestures/speech). This revealed the users feel natural touse the different modalities offered by our system andthey choose the best alternative to fit their goals. Regard-ing our annotation system, all the users prefer pointingand speech modalities instead of laser and menus combi-nation. Usually the second option is only used if the firstfailed. Related to the geometric transformation task, thegreat majority of users prefer the speech and arm ges-tures modalities combination instead of the laser andmenu combination. These results proved the system isflexible to user preferences and goals and suitable to thearchitectural requirements.

For future work, we would like to increase the infor-mation available to the knowledge base. This would helpus to further disambiguate multi-modal commands usingthe knowledge of which user is doing what. This will al-low to improve multi-user support in IMMIView. We alsoare looking and collecting data to identify better whatactions and modalities users are using in different phasesof their tasks. By analyzing usage patterns, the systemcould predict user actions improving their performance.

Acknowledgements Ricardo Jota was supported by thePortuguese Foundation for Science and Technology, grant ref-erence SFRH/BD/17574/2004. Bruno Araujo was supportedby the Portuguese Foundation for Science and Technology,grant reference SFRH/BD/31020/2006. We would like to thankJose Pedro Dias for his work on IMMIView.

References

de Araujo B, Jorge J (2003) Blobmaker: Free-form modellingwith variational implicit surfaces. In: 12 Encontro Por-tugues de Computacao Grafica, pp 335–342

Bae SH, Kobayash T, Kijima R, Kim WS (2004) Tangi-ble nurbs-curve manipulation techniques using graspablehandles on a large display. In: UIST ’04: Proceedings ofthe 17th annual ACM symposium on User interface soft-ware and technology, ACM, New York, NY, USA, pp 81–90

Broll W (2003) The augmented round table - a new interfaceto urban planning and architectural design. In: Rauter-berg M, Menozzi M, Wesson J (eds) INTERACT, IOSPress

Buxton W, Fitzmaurice G, Balakrishnan R, Kurtenbach G(2000) Large displays in automotive design. IEEE Com-puter Graphics and Applications 20(4):68–75

Callahan J, Hopkins D, Weiser M, Shneiderman B (1988) Anempirical comparison of pie vs. linear menus. In: CHI ’88:Proceedings of the SIGCHI conference on Human factorsin computing systems, ACM Press, New York, NY, USA,pp 95–100

Cao X, Balakrishnan R (2004) Visionwand: interaction tech-niques for large displays using a passive wand tracked in3d. ACM Trans Graph 23(3):729–729

Davis J, Chen X (2002) Lumipoint: Multi-user laser-basedinteraction on large tiled displays. Displays 23(5)

Deering MF (1995) Holosketch: a virtual reality sketch-ing/animation tool. ACM Trans Comput-Hum Interact2(3):220–238

Drettakis G, Roussou M, Reche A, Tsingos N (2007) De-sign and evaluation of a real-world virtual environmentfor architecture and urban planning. Presence: TeleoperVirtual Environ 16(3):318–332

Dvorak J, Hamata V, Skacilik J, Benes B (2005) Boostingup architectural design education with virtual reality. In:CEMVR’05: Central European Multimedia and VirtualReality Conference, Eurographics, pp 95–200

Fonseca MJ, Pimentel C, Jorge JA (2002) CALI: An OnlineScribble Recognizer for Calligraphic Interfaces. In: Pro-ceedings of the 2002 AAAI Spring Symposium - SketchUnderstanding, Palo Alto, USA, pp 51–58

Grossman T, Balakrishnan R, Kurtenbach G, Fitzmaurice G,Khan A, Buxton B (2002) Creating principal 3d curveswith digital tape drawing. In: CHI ’02: Proceedings ofthe SIGCHI conference on Human factors in computingsystems, ACM, New York, NY, USA, pp 121–128

Hopkins D (1991) The design and implementation of piemenus. Dr Dobb’s J 16(12):16–26

Hur H, Fleisch T, Kim TB, On G (2006) Aici-advanced im-mersive collaborative interaction framework. In: Proceed-ings of the SCCE CAD/CAM Conference : Book of Ab-stracts & Proceedings CD-ROM., Society of CAD/CAMEngineers (SCCE), Pyungchang, Korea, pp 20–25

Iacucci G, Wagner I (2003) Supporting collaboration ubiqui-tously: an augmented learning environment for architec-ture students. In: ECSCW’03: Proceedings of the eighthconference on European Conference on Computer Sup-ported Cooperative Work, Kluwer Academic Publishers,Norwell, MA, USA, pp 139–158

Igarashi T, Matsuoka S, Tanaka H (August 1999) Teddy: Asketching interface for 3d freeform design. Proceedings ofSIGGRAPH 99 pp 409–416, iSBN 0-20148-560-5. Held inLos Angeles, California.

IGD F, ZGDV (2009) instantreality : advanced mixed realitytechnology. Http://www.instantreality.org/

Isenberg T, Neumann P, Carpendale S, Nix S, Greenberg S(2006) Interactive Annotations on Large, High-ResolutionInformation Displays. In: 2006 Conference Compendiumof IEEE VIS, IEEE InfoVis, and IEEE VAST (October29–November 3, 2006, Baltimore, Maryland, USA), IEEEComputer Society, Los Alamitos, pp 124–125, extendedabstract and poster

Ishii H, Ben-Joseph E, Underkoffler J, Yeung L, Chak D,Kanji Z, Piper B (2002) Augmented urban planning work-bench: Overlaying drawings, physical models and digitalsimulation. In: ISMAR ’02: Proceedings of the 1st Inter-national Symposium on Mixed and Augmented Reality,IEEE Computer Society, Washington, DC, USA, p 203

Jacoby RH, Ellis SR (1992) Using virtual menus in a vir-tual environment. In: Proceedings of SPIE, Visual DataInterpretation

Jung T, Gross MD, Do EYL (2002) Sketching annotationsin a 3d web environment. In: CHI ’02: CHI ’02 extended

16

abstracts on Human factors in computing systems, ACM,New York, NY, USA, pp 618–619

Kadobayashi R, Lombardi J, McCahill MP, Stearns H,Tanaka K, Kay A (2006) 3d model annotation from multi-ple viewpoints for croquet. In: C5 ’06: Proceedings of theFourth International Conference on Creating, Connectingand Collaborating through Computing, IEEE ComputerSociety, Washington, DC, USA, pp 10–15

Kim M, Maher ML (2005) Comparison of designers using atangible user interface and a graphical user interface andthe impact on spatial cognition. In: Proceedings of Inter-national Workshop on Human Behaviour in Designing,Key Centre of Design Computing and Cognition, Univer-sity of Sydney, pp 81–94

Kleinermann F, Troyer OD, Creelle C, Pellens B (2007)Adding semantic annotations, navigation paths and tourguides for existing virtual environments. In: Proceedingsof the 13th International Conference on Virtual Systemsand Multimedia, Springer-Verlag, Brisbane, Australia, pp50–62

Lapointe JF, Godin G (2005) On-screen laser spot detectionfor large display interaction. In: Haptic Audio Visual En-vironments and their Applications, 2005. IEEE Interna-tional Workshop on, pp 5 pp.–

Nealen A, Sorkine O, Alexa M, Cohen-Or D (2005) A sketch-based interface for detail-preserving mesh editing. In:SIGGRAPH ’05: ACM SIGGRAPH 2005 Papers, ACMPress, New York, NY, USA, pp 1142–1147

Oh JY, Stuerzlinger W (2002) Laser Pointers as CollaborativePointing Devices. In: Proc. Graphics Interface, pp 141–150

OpenSG (2009) Opensg: Opensource scenegraph.Http://www.opensg.org

Pereira JP, Jorge JA, Branco VA, Ferreira FN (2003) Calli-graphic interfaces: Mixed metaphors for design. In: JorgeJA, Nunes NJ, e Cunha JF (eds) DSV-IS, Springer, Lec-ture Notes in Computer Science, vol 2844, pp 154–170

Reitmayr G, Schmalstieg D (2005) Opentracker: A flexiblesoftware design for three-dimensional interaction. VirtualReality 9(1):79–92

Schmalstieg D, Fuhrmann A, Hesina G, Szalavari Z, En-carnacao LM, Gervautz M, Purgathofer W (2002)The studierstube augmented reality project. Presence:Teleoper Virtual Environ 11(1):33–54

Shesh A, Chen B (2004) Smartpaper: An interactive anduser friendly sketching system. Comput Graph Forum23(3):301–310

Wang X (2007) Specifying augmented virtuality systems forcreative architectural design. In: IV ’07: Proceedings ofthe 11th International Conference Information Visualiza-tion, IEEE Computer Society, Washington, DC, USA, pp584–589

Welch G, Bishop G (2006) An introduction to the kalmanfilter. Tech. rep., University of North Carolina

XMLBlaster (2007) Xmlblaster: Open source for message ori-ented middleware. Http://www.xmlblaster.org

Zeleznik RC, Herndon KP, Hughes JF (1996) SKETCH: AnInterface for Sketching 3D Scenes. In: SIGGRAPH 96Conference Proceedings, pp 163–170