11
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008 585 A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments Yen-Hsu Chen, Tsorng-Lin Chia, Senior Member, IEEE, Yeuan-Kuen Lee, Shih-Yu Huang,and Ran-Zan Wang Abstract—This work presented a novel vision-based augmented- reality system for applications in multiuser collaborative environ- ments. The kernel technology of this vision-based system locates the cameras that are utilized to point and simulate the positions of multiple viewers. Camera calibration based on computer vision is employed during the camera’s locating process. The applica- tions in multiuser collaborative environments allow the viewers to view various positions and in numerous directions. However, tra- ditional calibration approaches are not suitable for these cases suf- ficiently. A novel calibration pattern based on pseudo-random ar- rays is designed for multiuser collaborative applications. The pat- tern has a simple and regular structure, easily extracts features, achieves robust recognition using local information, and does not limit viewer positions and directions. Experimental results indicate that the proposed system provides a effective platform for applica- tions in multiuser collaborative environments. Index Terms—Augmented reality, camera calibration, multiuser collaborative environment, pseudo-random arrays. I. INTRODUCTION A UGMENTED REALITY (AR) technology is a human–machine interface technology that displays information in a seamless fashion, combining real scene im- ages and artifact information to increase the understanding of viewed objects. Therefore, AR has been widely used in various applications, such as in medicine, guidance, design, education, and entertainment. Currently, more complex applications are employed in 3-D multiuser collaborative environments that require seamless and consistent display for multiple users. Three major methods used for combining and displaying real scene images and artifact information. 1) Head-worn display captures real scene images via a camera and then com- bines the information using a computer. The resulting image is displayed on a small screen mounted on a helmet or eyeglasses. The user views the combined image from a semi-opaque mirror. This viewing method is called video see-through [1]. Another Manuscript received August 23, 2007; revised December 24, 2007. The as- sociate editor coordinating the review of this manuscript and approving it for publication was Dr. Zhengyou Zhang. Y.-H. Chen, Y.-K. Lee and S.-Y. Huang are with the Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan 333, Taiwan, R.O.C. (e-mail: [email protected]; [email protected]; [email protected]). T.-L. Chia is with the Department of Computer and Communication En- gineering, Ming Chuan University, Taoyuan 333, Taiwan, R.O.C. (e-mail: [email protected]). R.-Z. Wang is with the Department of Computer Engineering and Science, Yuan Ze University, Chungli 320, Taiwan, R.O.C. (e-mail: [email protected]. edu.tw). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2008.921741 viewing method display artifact information on a small screen and then projects it onto a semi-opaque lens in which the real scene is be combined with artifact information. This viewing type is called optical see-through. 2) Projection display projects the artifact information directly onto a real object via the pro- jector. A viewer does not need a special device to watch the real scene. 3) Handheld display captures a real scene image using a camera mounted on a handheld device, and then combines and displays the information on the handheld device screen. These systems consist of following components: camera, projector, display screen, and/or lens/mirrors. Among these components, the projector and lens are the most expensive. Therefore, a hand- held device with a camera and screen is most attractive based on cost and universality. From a technological viewpoint, the kernel of AR technology is how to properly align the coordinate systems of real and vir- tual worlds. For instance, how can one know the position of a viewer holding the device in the real world? That is, where is the camera presenting the real world to a viewer? The aim of camera calibration in computer vision is to be. Camera calibra- tion estimates intrinsic and extrinsic camera parameters using the corresponding relationships calibrating objects be tween a 3-D real world and 2-D image space. The traditional solutions are grid patterns [2], circles [3], [4], color lines [5], time-varying multiple patterns, Fiducial markers [6]–[8], and encoded-spatial patterns [9]. The five benchmarks of assessment requirements utilized to examine the effectiveness of the calibration object in a 3-D multiuser collaborative environment are as follows: B1: suitable for low-resolution capture devices; B2: can be extracted easily from captured images; B3: only needs to provide local and partial information of the calibration object during the recognition process; B4: each viewing position and direction relative to the cali- bration object can be determined uniquely; B5: can support the estimated viewing position and direction have sufficient accuracy. Generally, a camera mounted on handheld devices (e.g., we- bcam, PDA, or Tablet PC) has low spatial and color resolutions [10]. Hence, the calibration pattern generated by the projection of the calibration object, i.e., the grid pattern must be sufficiently large and regular. The calibration pattern also requires far fewer colors that are suitable for low color resolution of the camera. A binary pattern is a fine choice for applications using hand- held devices. Moreover, the calibration pattern is composed of primitive geometric features—lines, rectangles, circles, charac- ters [11], grids [12]–[14], and color stripes—that are suited to low-resolution capture devices. The calibration pattern passes through image preprocessing to extract pattern’s features such that features in an image must be able to be extracted easily. 1520-9210/$25.00 © 2008 IEEE

A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

  • Upload
    ran-zan

  • View
    217

  • Download
    4

Embed Size (px)

Citation preview

Page 1: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008 585

A Vision-Based Augmented-Reality System ForMultiuser Collaborative Environments

Yen-Hsu Chen, Tsorng-Lin Chia, Senior Member, IEEE, Yeuan-Kuen Lee, Shih-Yu Huang, and Ran-Zan Wang

Abstract—This work presented a novel vision-based augmented-reality system for applications in multiuser collaborative environ-ments. The kernel technology of this vision-based system locatesthe cameras that are utilized to point and simulate the positionsof multiple viewers. Camera calibration based on computer visionis employed during the camera’s locating process. The applica-tions in multiuser collaborative environments allow the viewers toview various positions and in numerous directions. However, tra-ditional calibration approaches are not suitable for these cases suf-ficiently. A novel calibration pattern based on pseudo-random ar-rays is designed for multiuser collaborative applications. The pat-tern has a simple and regular structure, easily extracts features,achieves robust recognition using local information, and does notlimit viewer positions and directions. Experimental results indicatethat the proposed system provides a effective platform for applica-tions in multiuser collaborative environments.

Index Terms—Augmented reality, camera calibration, multiusercollaborative environment, pseudo-random arrays.

I. INTRODUCTION

AUGMENTED REALITY (AR) technology is ahuman–machine interface technology that displays

information in a seamless fashion, combining real scene im-ages and artifact information to increase the understanding ofviewed objects. Therefore, AR has been widely used in variousapplications, such as in medicine, guidance, design, education,and entertainment. Currently, more complex applications areemployed in 3-D multiuser collaborative environments thatrequire seamless and consistent display for multiple users.

Three major methods used for combining and displaying realscene images and artifact information. 1) Head-worn display

captures real scene images via a camera and then com-bines the information using a computer. The resulting image isdisplayed on a small screen mounted on a helmet or eyeglasses.The user views the combined image from a semi-opaque mirror.This viewing method is called video see-through [1]. Another

Manuscript received August 23, 2007; revised December 24, 2007. The as-sociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Zhengyou Zhang.

Y.-H. Chen, Y.-K. Lee and S.-Y. Huang are with the Department of ComputerScience and Information Engineering, Ming Chuan University, Taoyuan 333,Taiwan, R.O.C. (e-mail: [email protected]; [email protected];[email protected]).

T.-L. Chia is with the Department of Computer and Communication En-gineering, Ming Chuan University, Taoyuan 333, Taiwan, R.O.C. (e-mail:[email protected]).

R.-Z. Wang is with the Department of Computer Engineering and Science,Yuan Ze University, Chungli 320, Taiwan, R.O.C. (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2008.921741

viewing method display artifact information on a small screenand then projects it onto a semi-opaque lens in which the realscene is be combined with artifact information. This viewingtype is called optical see-through. 2) Projection display projectsthe artifact information directly onto a real object via the pro-jector. A viewer does not need a special device to watch the realscene. 3) Handheld display captures a real scene image using acamera mounted on a handheld device, and then combines anddisplays the information on the handheld device screen. Thesesystems consist of following components: camera, projector,display screen, and/or lens/mirrors. Among these components,the projector and lens are the most expensive. Therefore, a hand-held device with a camera and screen is most attractive based oncost and universality.

From a technological viewpoint, the kernel of AR technologyis how to properly align the coordinate systems of real and vir-tual worlds. For instance, how can one know the position of aviewer holding the device in the real world? That is, where isthe camera presenting the real world to a viewer? The aim ofcamera calibration in computer vision is to be. Camera calibra-tion estimates intrinsic and extrinsic camera parameters usingthe corresponding relationships calibrating objects be tween a3-D real world and 2-D image space. The traditional solutionsare grid patterns [2], circles [3], [4], color lines [5], time-varyingmultiple patterns, Fiducial markers [6]–[8], and encoded-spatialpatterns [9]. The five benchmarks of assessment requirementsutilized to examine the effectiveness of the calibration object ina 3-D multiuser collaborative environment are as follows:

B1: suitable for low-resolution capture devices;B2: can be extracted easily from captured images;B3: only needs to provide local and partial information of

the calibration object during the recognition process;B4: each viewing position and direction relative to the cali-

bration object can be determined uniquely;B5: can support the estimated viewing position and direction

have sufficient accuracy.Generally, a camera mounted on handheld devices (e.g., we-

bcam, PDA, or Tablet PC) has low spatial and color resolutions[10]. Hence, the calibration pattern generated by the projectionof the calibration object, i.e., the grid pattern must be sufficientlylarge and regular. The calibration pattern also requires far fewercolors that are suitable for low color resolution of the camera.A binary pattern is a fine choice for applications using hand-held devices. Moreover, the calibration pattern is composed ofprimitive geometric features—lines, rectangles, circles, charac-ters [11], grids [12]–[14], and color stripes—that are suited tolow-resolution capture devices. The calibration pattern passesthrough image preprocessing to extract pattern’s features suchthat features in an image must be able to be extracted easily.

1520-9210/$25.00 © 2008 IEEE

Page 2: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

586 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008

TABLE IBRIEF SUMMARY EXISTING CALIBRATION PATTERNS FOR 3-D MULTIUSER COLLABORATIVE ENVIRONMENTS

Using surface patterns for camera calibration will provide moreinformation than using line patterns, as you can obtain higheraccuracy of the camera’s parameters. Conversely, a line is thesimplest feature in that its extraction supports good accuracy,correctness, and execution time. Lines have projective invari-ants that can defend the distortion from the perspective projec-tion. To take advantage of both patterns, a grid pattern formedby straight lines is introduced. This means the lines can be easilyextracted and the locations of the grid are accurately determined,based on the intersections of the extracted lines. The length andwidth of the grid can offer the basis for estimating the camera’slocation. Therefore, a squared grid pattern composed of lines isthe best choice for a fast feature extraction method.

For the limited viewing range of the camera, a small viewedarea on the calibration pattern must provide meaningful in-formation. Even relatively smaller marks, such as circles,Fiducial markers (ARToolKit), rectangles, text, and grids, canbe recognized in the fully structure; however, they cannotwork in the partial or occluding situations, even though itemploys more patterns in the working area. Fiala and Shu [15]employ an array of 19 9 ARTag markers and extract accuratecamera parameters. However, this calibration pattern is notseamless, as the camera cannot catch a full marker in someviewing conditions in multiuser collaborative environments.Also, it needs more than ten images to calibrate the camera,which is not enough for real time applications. Conversely,the larger marks, such as grid patterns [2], color lines, timevarying multiple patterns, and encoded-spatial patterns, canused in the situation of partial viewing. However, ambiguitystill exists reducing the correctness of recognition. The work[16] by Matsunaga and Kanatani described a novel optimal gridpattern based on cross ratio . Their pattern is seamless,so that in theory it can be recognized under the partial view.However, the numerical range of the value is real ,and the value of a calibration object is sensitive to any

numerical change and is influenced by noise. When the widthof the grid is perturbative, the matching error is raised bythe sensitive value. Also, a real cannot be used asan index for database searching directly. Although we canconsider setting a tolerant range to error for the real values,the numerical distribution of real values is a nonuniformdistribution. It is difficult to determine an accommodated rangeof tolerant error for the real matching. A fine method is toutilize a pseudo-random array as the calibration patternsupporting a well-defined theory, the simple pattern, and thesolution for partial viewing. Table I presents a brief summaryof existing calibration patterns for 3-D multiuser collaborativeenvironments. In general, no existing method fulfills all fiverequirements. For examples, methods based on encoded-spatialpatterns have difficulty extracting special codes, due to thesmall pattern required. Moreover, most methods using Fiducialmarkers do not work when a marker’s projections are partialand the working table is big unless a large number of themarkers are distributed throughout the whole working table.The goal of our proposed method was to design a suitablecalibration pattern that fulfils five assessment requirements formultiuser collaborative environments.

This work presents a novel calibration pattern based on a. The pattern is a 2-D binary-like pattern composed of

two basic squared grids. Therefore, extracted features fromthe proposed are noise-free and suitable for use in alow-resolution capture device. This new calibration pattern notonly determines the viewed partial pattern’s position on theentire calibration pattern, it also avoids problems associatedwith traditional s applied in 3-D multiuser collaborativeenvironments. Additionally, the perspective projection invariantis adopted to generate accurate camera locations, thereby gen-erating a proper image combining real and virtual scenes.

The remainder of this paper is organized as follows. The pro-posed modified PRA is described in Section II. The

Page 3: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

CHEN et al.: VISION-BASED AUGMENTED-REALITY SYSTEM 587

TABLE IIPRIMITIVE POLYNOMIALS OF DEGREE m 20 [12].

Fig. 1. Feedback shift register of degree m = 4.

feature extraction and recognition procedure using isintroduced in Section III. The vision-based locating techniquesare described in Section IV. Section V presents experimental re-sults, and Section VI presents conclusions.

II. MODIFIED PSEUDO-RANDOM ARRAYS

A. Pseudo-Random Sequences

Pseudo-random sequences ( s) are binary sequencesalso called pseudo-noise sequences, maximal-lengthshift-register sequences, and -sequences [17]. The primi-tive polynomial with degree is defined to constructpseudo-random sequences of length . Table IIpresents primitive polynomials of degree [18].

A primitive polynomial can be described using a feedbackshift register that provides a mechanism for generating thepseudo-random sequence. Fig. 1 illustrates a feedback shiftregister defined by a primitive polynomial of degree .The feedback shift register consists of an exclusion-OR gateand four small boxes for representing memory elements. Whenthe values of four memory elements are , , , and

at time , then , , , and at time.

Initially, let each of the memory elements in the feedbackshift register of degree contains a 0 or 1, with the excep-tion that the initial state is always 0 and the pseudo-random se-quence will be 0, which is invalid. The maximum number ofpossible initial states is . After time unit, thefeedback shift register produces a pseudo-random sequence oflength . Table III shows 15 pseudo-random se-quences from under all possible 15 initialstates.

Pseudo-random sequences have the following window prop-erty [17]: each binary -tuple made by a window of widthsliding along a pseudo-random sequence is unique. By using thisproperty, one can determine the corresponding location of the

TABLE IIIUSING h(x) = x + x + 1 TO PRODUCE DIFFERENT PSEUDO-RANDOM

SEQUENCES BASED ON DIFFERENT INITIAL STATES

binary -tuple in the pseudo-random sequence when con-tinuous bits are obtained from the pseudo-random sequence.

B. Pseudo-Random Arrays

For applications in 2-D space, the [19], [20] canbe constructed by folding a pseudo-random sequence of length

, where and arerelatively prime and 1. Consider that the of length is

. The form of is defined as

(1)

where , , and (, , and ). For example, when

, , , and the pseudo-random sequence is000100110101111; the corresponding is defined as

(2)

Pseudo-random arrays also have a window property [20].When a window is slid over a pseudo-random array toproduce a rectangular bit pattern, each bit pattern is unique.

For the purpose locating a camera locating, the has theadvantages of simplicity and fast processing. But the windowproperty of s does not exist if the bit pattern is observedfrom different viewpoints and directions. For example, the left

top 2 2 bit pattern of in (2) is , which is the

same as the left bottom bit pattern after counterclock-

wise rotation of 90 . So the traditional cannot be used incollaborative environments which allow for observing locationarbitrary. The next section presents a new method to overcomethis problem.

C. Proposed PRA

This section presents the proposed scheme, called the mod-ified PRA firstly. Then the window property of the

is introduced. Finally, how to locate the viewer (orcamera) using the window property of the is described.

Page 4: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

588 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008

Consider a pseudo-random sequencewith degree and length . An

can be constructed using the followingequation:

......

(3)

where is and is the one’s comple-ment series of . All even rows of , , ,are the same as the pseudo-random sequence , and all oddrows of are where , and

is the result of that be circularly shifted to theright times.

The has two major properties that are used in thiswork.

a) Window property: Each of the produced squared bit pat-terns is unique if a window is slidover the with degree . This property is provedby enumerative verifications on a PC computer.

b) Complement property: A one’s complement bit string ofis still a , except that 11 1 is changed to

00 0. The complemented and shifted can be ap-plied to determine the row number in the .

First, this work extracts an selected squaredbit pattern in the and checks whether the bit pattern hasthe same bit strings in both directions. If the same bits stringsexist, their direction is in horizontal such that they are odd rowsin the . One can easily find the location of the selectedbit pattern in a horizontal direction by the window property

. By using the other shifted rows on the selected bit pat-terns, we can locate the vertical position based on the numberof shifts. Therefore, the location of the selected bit pattern inhorizontal and vertical directions is identified. This solves theproblem of traditional s that do not allow a viewer to viewthe bit pattern at arbitrary locations and viewing directions.

D. Generating the Camera Calibration Pattern From MPRA

A straightforward method for generating the camera calibra-tion pattern from the is to transform element 0 of the

into a black square and transform element 1 into awhite square. The world coordinate system describingthe multiuser collaborative environment is defined by theaxis in the row direction and axis in the column directionof , and its origin is at the bottom-left corner of the

. The with degree and size

Fig. 2. Recognizable MPRA pattern defined by the MPRA with degreem = 4 and size 2n � n = 30 � 15. (a) Original MPRA pattern. (b) Di-amondMPRA pattern. (c) Base pattern. (d) ProposedMPRA pattern.

is composed by black-and-white blocks [Fig. 2(a)].However, detecting the straight lines is difficult when contin-uous black or white blocks are in the horizontal or the ver-tical directions and they may appear in this original .To overcome this drawback, we utilize a black-and-white inter-laced pattern as the base pattern of the [Fig. 2(c)]. Thesquare shapes are then changed to diamond shapes in the orig-inal pattern [Fig. 2(b)]. Finally, the proposedpattern [Fig. 2(d)] is obtained by exclusion-OR of the diamond

and the base pattern. Therefore, edge detection can beused to get boundary lines of blocks easily since the base pat-tern is black-and-white interlaced. In addition, the center colorsof the blocks can be obtained accurately using the proposed

pattern; the colored blocks are translated into the cor-responding bits pattern because the size of the square blocks onthe base pattern is the same.

III. FEATURE EXTRACTION AND RECOGNITION

To reduce the computational complexity and the effect fromnoise in the recognition step, this work utilizes pre-processingsteps. This section then describes the approach for recognizing

images.

A. Image Binarization

An appropriate color model has advantages for image pro-cessing. In this work, RGB color images are first captured bycamera and then transformed into gray levels. The transforma-tion defined by NTSC is

(4)

where is the gray level of the image. A predefined thresholdis then applied to transform the gray-level image into a binaryimage.

Page 5: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

CHEN et al.: VISION-BASED AUGMENTED-REALITY SYSTEM 589

Fig. 3. Boundary lines of black-and-white blocks of an MPRA pattern cap-tured by the camera.

B. Edge Detection

To obtain the boundary lines of the black-and-white blocksin the pattern, this work applies the Laplacian oper-ation to extract edges from the binary image of thepattern. The Laplacian operation [21], the simplest isotropicderivative operation has an invariant property of rotation andis very suited to arbitrary camera viewing locations and direc-tions in 3-D multiuser collaborative environments. This workonly keeps the edges of the pattern to avoid the possi-bility of measuring an excessive number of feature lines. Fig. 3illustrates the boundary lines of the black-and-white blocks inthe pattern captured by the camera, where the whitepixels are extracted edges.

C. Line Detection

To obtain the relationship between the camera coordinatesystem and world coordinate system , one usu-ally finds features in the image and then calibrates the cameralocation to the . In the proposed method, line detection issimplified into a parametric space transformation based on theHough transform [22]. This work then uses the intersectionsof a set of feature lines to calculate the transform relationshipbetween the two coordinate systems.

There are many boundary lines for black-and-white blocksin a recognizable pattern. The boundary lines canbe applied to track and recognize features for calibrating thevirtual camera and generating corresponding synthetic images.After the Laplacian operation is applied to detect edge points,an excessive number of edge points may be generated such thatlinking straight lines is difficult. This problem can be conqueredby using the Hough transform. This method transfers theimage space to the parameter space where a line in the

image space can be mapped as a point. The transformationis defined as

(5)

Consider a feature line composed of numerous collinear edgepoints, and the infinite lines passing through each edge point canbe mapped to a curve in the parameter space by (5). These

curves all intersect at the same point in the param-eter space; the point represents the feature line in theimage space. Therefore, this work calculates each edge pointdepending on different values to generate the corresponding

and accumulate the number of points at . Line detectioncan then be treated as finding the local maximum in the accu-mulator of the parameter space.

D. Bit Assignment and Recognition

To find a block of size from a cap-tured pattern, this work applies the Hough transformto classify feature lines. The recognizable pattern con-sists of two feature line types. One type comprises horizontaland vertical lines generated by the base pattern. The other typeis the straight lines produced by the diamond pattern. Becausethe former lines are longer than latter lines, the former lines willbe obtained first in the Hough transform. The latter lines can befiltered by the property that they intersect with the horizontaland vertical lines of blocks in the pattern. Therefore,this work only uses the horizontal and vertical lines to locate ablock of size from the extractedpattern.

This work then finds a “real” block fromtwo groups of parallel lines, meaning that no lacked lines existin the extracted block. A process is requiredto verify whether the situation of lacked lines occurred and tofill in the omitted straight lines. Because the cross ratio has aninvariant property of the perspective projection, this work deter-mines whether a missed line exists among five known lines bycalculating the cross ratio of feature lines. We assume that onlythree lines are lacking in at most five adjoining feature lines inthe extracted block. For one straight line,

, the five points, , are the intersections of five knownlines of another group lines and and , representthe intersections of lacked parallel lines. In Table IV, the situa-tions in which lines are lacking in the extracted patternare characterized as types , , and , and 35 different sit-uations are generated.

For example, Type is lacks one line; thus, thereare four possible situations of— ,

, , and, where is the lacked line corre-

spondent with the five extracted feature lines. A table of crossratios for all 35 situations is first constructed. Table V presentsexamples utilizing the selected four situations (Table IV). Thefive cross ratio type are calculated to check in which situationthe five feature lines exist. After identifying the situation, thecross ratio of , , and can be derived. Hence, isthe solution of a linear equation and defined as follows.

a) The order of , , and is :

(6a)

Page 6: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

590 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008

TABLE IVFOUR TYPES OF SITUATIONS UNDER THREE LACKED LINES IN AT MOST FIVE KNOWN LINES

TABLE VCROSS RATIOS USED TO VERIFY THE LACKED LINES FOR FOUR SITUATIONS IN TABLE IV

b) The order of , , and is :

(6b)

c) The order of , , and is :

(6c)

To fill the omitted straight line, another straight line in thesame group with must be identified, and the value of inthe straight line is then calculated. The equation of the straightline is then calculated using two values of .

Finally, adjoining straight lines are selected from twoparallel line groups to construct blocks of size .The color is removed from each central point of blocks andtransformed to a bit pattern with black as 0 and white as 1. Thebit pattern’s position in the pattern is identi-fied using the method described in Section II-C, and each of thetwo feature line groups corresponds to which the direction ofhorizontal or vertical is determined.

IV. VISION-BASED LOCATING

This section first defines the coordinate systems in the col-laborative environment and then describes how to obtain theextrinsic parameters of the camera. In our proposed method, acheap and simple camera (i.e., a webcam) was employed. Thecamera shutter and focus couldn’t be adjusted, so that we can

calibrate its intrinsic parameters before the practice to reduce theexecution time for camera calibration. The method for renderingimages corresponding to cameras at different viewing positionsand directions is also discussed.

A. Homographic Matrix Setup

Fig. 4 presents the entire collaborative environment. As-sume the world coordinate system is composed ofthe plane. The is a right-handed coordinatesystem with the three axes of and origin of

. The partial pattern observed by theviewer is defined as a pattern coordinate system . Thesystem’s plane is composed of ablock; the relationship with is translation. The three axes

of the are parallel to ,respectively. The originwhose and can be acquired from the after rec-ognizing the bits pattern. In addition, the represents thestructure and geometry of the defined virtual camera. The threeaxes of the are a left-handed coordinatesystem with an origin of . The relationshipbetween the and can be defined as

(7)

The image coordinate system is defined by the imageplane , which has a normal vector and distancefrom . The relationship at any point of the projectedon plane is

(8)

Page 7: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

CHEN et al.: VISION-BASED AUGMENTED-REALITY SYSTEM 591

Fig. 4. Coordinate systems in 3-D collaborative environments.

To integrate the transforms, a homogeneous coordinate is usedto gain (9)

(9)

where is a nonzero scalar.

B. Extrinsic Parameters

The unknown values in (9) are elements of the rotation matrixand the translation vector . This work

calculates the directions for three coordinate axes of thein based on the vanishing point corresponding a pair ofparallel feature lines supported by the pattern.

1) Vanishing Points: The location of the camera is estimatedusing the vanishing points. The vanishing point is the intersec-tion of a set of projected parallel lines. The vector from theorigin of the to a vanishing point in the image plane isparallel the corresponding parallel lines in the . When axes

and in the correspond to the horizontal and ver-tical lines of a recognizable pattern, the directions ofthree coordinate axes of the in based on vanishingpoints can be obtained.

2) Rotation Matrix: Let the direction of in thebe , thus the intersection of all par-allel projected lines from the recogniz-able pattern in the image can be calculated. The vector

can be derived using (10) and .

(10)

Similarly, the direction of in the is, and the intersections of all

projective lines that are parallel to can be calculated andis unitized by (11) to obtain vector .

(11)

Because comprises the right-hand coordinatesystem, the direction of in the can be calculated usingthe cross product in (12).

(12)Since is the left-hand coordinate system, thevector is calculated using (13).

(13)

Consequently, the relationship from the to can berepresented by the rotation matrix .

(14)

3) Translation Vector: The translation vectoris the displacement between two origins from the to .Support two points and on the , and the distancefrom to is . The projected points and onto the

are and , respectively. Thecorresponding points of and in the are and ,

which must be lain on and , respectively.Thus, and can be represented as (15) (refer to Fig. 5)

(15)

Points and can also be represented as

(16)

Page 8: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

592 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008

Fig. 5. Geometric configuration for determining the translation vector.

Because and , therefore

(17)

Equation (17) is an over-determined system, and can be solvedusing the pseudo-inverse method.

(18)

Because ,

(19)

C. Rendering Image

The rendered image using the proposed system is generatedby OpenGL. For applying 3-D graphics API, this work mustcalculate three parameters to describe the viewpoint.

• The viewpoint’s origin: the position of the virtual camerain the global coordinate system.

• The viewpoint’s direction vector: the direction vector isthe direction in which the virtual camera is pointed. Analternative method is to define the point at which the opticalaxis of the camera passes through the plane ofthe . Thus, the viewpoint’s direction is constructedbased on the viewpoint’s origin and intersection point,

• The viewpoint’s orientation vector: the orientation vectoris the “up” direction on the axis of the virtual camera.

The three parameters that correspond to the are then inputinto the GluLookAt function of OpenGL to render the image.Fig. 6 presents the geometric configuration for rendering animage using OpenGL.

Since the in OpenGL (set as ) is defined as aright-hand coordinate system, the relationship between

Fig. 6. Geometric configuration for rendering image using OpenGL.

and is redefined as , where is a point onthe , is a point on the , is the rotation matrixfrom to and is the translation vector. Therefore,

(20)Then we calculate from using the method that describedin Section IV-B.

(21)

Because the three parameters correspond to the , we infer(22) from and obtain

(22)

where ,

.

Moreover, , and theviewpoint’s origin in is

(23)

The viewpoint’s direction vector is defined by and . Thelocation of is the intersection of in and the planeof in the . The viewpoint’s orientation vector,denoted as , is the transformation of in the

; thus .

V. EXPERIMENTAL RESULTS

A. Rendering Images

This work presents a new method that determined the viewingposition and direction of the viewer by markers of and

Page 9: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

CHEN et al.: VISION-BASED AUGMENTED-REALITY SYSTEM 593

Fig. 7. Rendered images observed from eight different angles.

generates the rendered images depending on extrinsic parame-ters of a camera. The test setup uses an IBM PC running WindowXP with Pentium(R) 4 CPU 3.0 GHZ and 1 GB RAM. Fig. 7shows the results of rendering images at different viewing po-sitions. First, a 3-D virtual view consists of a castle, two treesand a green lawn is constructed. The with degree 4 isgenerated and placed on a work table for defining a 3-D col-laborative environment. Each primitive block of the is

. The castle and trees in the 3-D virtual view areallocated at the plane defined by the in the range fromthe block coordinates (0, 0) to (7, 6). The experiment simulateseight different angles and the distances between the andcameras are about 55 to 75 cm. The odd rows in Fig. 7 show thereal captured images of the pattern from the handhelddevices and the even rows are corresponding rendered images.The current system can handle the total processing at 5 Hz forfour users.

B. Accuracy Analysis

To verify that the calculation of extrinsic parameters is accu-rate and sufficiently stable, a bit pattern isrecognized for determining the location of the bit pat-tern in the and the extrinsic parameters of the camera.The locations of four corners of the bit pat-tern in the are derived by (8), and are compared with truelocations in the captured images. A camera with the focal lengthof 16.498 mm is used to capture the pattern at anglesof 0 (top view) to 60 , where the distances between cameraand the pattern are 60–150 cm. Each location is cap-tured 100 times. A known distance, i.e., the diagonal length ofthe block, is utilized to assess the accuracy of the positional de-termination. The mean error between the estimated and actuallength is 2.59 mm, and standard deviation of errors is 0.86 mm.Based on error analysis, the calculated results for extrinsic pa-rameters in the proposed system are accurate and stable. Theplot in Fig. 8 shows the distribution of distance error for the cor-responded angles and distances. These experimental results thatat a short distance ( 60 cm) under large viewing anglesand at a large distance ( 150 cm) under a small viewing angle

generate large errors. In the medium parts of the viewing

Fig. 8. Distribution of distance errors for the corresponded angles and dis-tances.

Fig. 9. The distribution of error standard deviations for the corresponded anglesand distances.

distances and angles, errors are 4 mm. Fig. 9 illustrates thedistribution of standard deviations of error for the angles anddistances. The following observations are based on these twoplots.

• At the general viewing placement (i.e., from the distanceof 80–120 mm and at an angle of 50 –20 ), the distanceerror and its standard deviation is low.

• Special viewing placements-top view (viewing angle isnear 0 ) and large viewing angle at a low distance-generatea large distance error.

• All error SDs is 3 mm at all viewing locations.

C. Sensitivity to Lighting and Occlusion

To evaluate the effects of lighting changes in the proposedmethod, five lighting conditions were designed to illustratethe differences. Five real input images captured in normal,light, dark, nonuniform and shadowed lighting conditions aredisplayed in Fig. 10(a), (c), (e), (g), and (i), respectively. Theresults after binarization are shown in Figs. 10(b), (d), (f), (h),and (j). As each line is formed by the block and the white blocksand the corner is determined by the intersection of two longlines, all of the effects caused by the five lighting conditionsare small.

The proposed pattern has the window property, so it can workin partial views raised by occlusion. When the scene has noocclusions, the method will select a usefulblock shown in Fig. 11(a). A set of real input images took fromthe same scene with a cube in different occulted locations are

Page 10: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

594 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008

Fig. 10. Examples of image binarization under five lighting conditions: (a) normal, (b) light, (c) dark, (d) nonuniform, and (e) shadowed. (Top) The tested imagesand (bottom) the result after image binarization.

Fig. 11. Examples of a cube in different occulted locations and corresponding extracted (m+2)� (m+2) blocks. (a) No occlusion. (b) Occlusion on the rightbottom corner. (c) Occlusion on the middle part. (d) Occlusion on the left top corner.

shown in Fig. 11 (b), (c), and (d). The extractedblock displayed in red color is adaptively selected and

avoid the occlusion. Therefore, the proposed method is not onlyimproving the effects of lighting change but also amelioratingthe difficulty raised by occlusion.

VI. CONCLUSIONS

This work presents a method that generates a real-time vi-sion-based augmented reality system for applications in mul-tiuser collaborative environments. The proposed method pro-duces patterns of different sizes from an dependingon user requirements and leaves out setting up the locations ofworking environments and eliminates the ambiguity from theviewing relationship with the calibration patterns.

Camera location can be accurately estimated by seeing thepartial of a big range pattern based on the window property of

the . This method solves the disadvantage of the AR-ToolKit that requires a limited viewpoint because it must at leastsee a whole small marker. Additionally, the patternis constructed using two basic elements that can be treated asa binary pattern, which is suitable for low-resolution camerasbecause they have the advantage of being noise-free. Becausethe markers are square grids, they can be detected easily viastraight-line detection. Therefore, the proposed method has theadvantages in that image processing is easy, fabrication cost ischeap, and facile operating is operated easily.

REFERENCES

[1] W. R. Sherman and A. B. Craig, Understanding Virtual Reality: Inter-face, Application, and Design. San Mateo, CA: Morgan Kaufmann ,2003.

[2] T. L. Chia, Z. Chen, and C. J. Yueh, “A method for rectifying gridjunctions in grid-coded images using cross ratio,” IEEE Trans. ImageProcess., vol. 5, no. 8, pp. 1276–1281, Aug. 1996.

Page 11: A Vision-Based Augmented-Reality System For Multiuser Collaborative Environments

CHEN et al.: VISION-BASED AUGMENTED-REALITY SYSTEM 595

[3] G. Jiang and L. Quan, “Detection of concentric circles for camera cal-ibration,” in Proc. Tenth IEEE Int. Conf. Computer Vision, Oct. 17–21,2005, vol. 1, pp. 333–340.

[4] J.-S. Kim, P. Gurdjos, and I.-S. Kweon, “Geometric and algebraic con-straints of projected concentric circles and their applications to cameracalibration,” IEEE Trans. Pattern Anal. Machine Intell., vol. 27, no. 4,pp. 637–642, 2005.

[5] C. Sinlapeecheewa and K. Takamasu, “3-D profile measurement bycolor pattern projection and system calibration,” in Proc. IEEE Int.Conf. Industrial Technology, Dec. 11–14, 2002, vol. 1, pp. 405–410.

[6] M. Fiala, “ARTag, a fiducial marker system using digital techniques,”in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, Jun.20–25, 2005, vol. 2, pp. 590–596.

[7] T. Kawano, Y. Ban, and K. Uehara, “A coded visual marker for videotracking system based on structured image analysis,” in Proc. IEEE Int.Symp. Mixed and Augmented, Oct. 7–10, 2003, pp. 262–263.

[8] D. F. Abawi, J. Bienwald, and R. Dorner, “Accuracy in optical trackingwith fiducial markers: An accuracy function for ARToolKit,” in Proc.IEEE Int. Symp. Mixed and Augmented, Nov. 2–5, 2004, pp. 260–261.

[9] P. Vuylsteke and A. Oosterlinck, “Range image acquisition witha single binary-encoded light pattern,” IEEE Trans. Pattern Anal.Machine Intell., vol. 12, no. 2, pp. 148–164, Feb. 1990.

[10] M. Mohring, C. Lessing, and O. Bimber, “Video see-through ARon consumer cell-phones,” in Proc. IEEE Int. Symp. Mixed andAugmented, Nov. 2–5, 2004, pp. 252–253.

[11] D. Wagner and I. Barakonyi, “Augmented reality Kanji learning,” inProc. IEEE Int. Symp. Mixed and Augmented, Oct. 7–10, 2003, pp.335–336.

[12] G. Bianchi, C. W. M. Harders, P. Cattin, and G. Szekely,“Camera-marker alignment framework and comparison with hand-eyecalibration for augmented reality applications,” in Proc. IEEE Int.Symp. Mixed and Augmented, Oct. 5–8, 2005, pp. 188–189.

[13] P. F. Sturm and S. J. Maybank, “On plane-based camera calibration: Ageneral algorithm, singularities, applications,” in Proc. IEEE Int. Conf.Computer Vision and Pattern Recognition, Jun. 23–25, 1999, vol. 1, pp.432–437.

[14] Z. Zhang, “Flexible camera calibration by viewing a plane from un-known orientations,” in Proc. Seventh IEEE Int. Conf. Computer Visionand Pattern Recognition, Sep. 20–27, 1999, vol. 1, pp. 666–673.

[15] M. Fiala and C. Shu, Fully Automatic Camera Calibration Using Self-Identifying Calibration Targets National Research Council of Canada,Tech. Rep. NRC 48306/ERB-1130, Nov. 2005.

[16] C. Matsunaga and K. Kanatani, “Optimal grid pattern for automatedmatching using cross ratio,” in IAPR Workshop on Machine Vision Ap-plications (MVA 2000), Tokyo, Japan, Nov. 2000, pp. 561–564.

[17] E. R. Berlekamp, Algebraic Coding Theory. Walnut Creek, CA:Aegean Park, 1984.

[18] S. Lin and D. J. Costello, Error Control Coding: Fundamentals andApplications. Englewood Cliffs, NJ: Prentice-Hall, 1983.

[19] S. Lloyd and J. Burns, Finding the Position of a Subarray in a Pseudo-Random Array HP Tech. Rep. HP-91-159, Oct. 1991.

[20] F. J. MacWilliams and N. J. A. Sloane, “Pseudo-random sequences andarrays,” Proc. IEEE, vol. 64, no. 12, pp. 1715–1729, Dec. 1976.

[21] R. C. Gonzalez, P. Wintz, and S. L. Eddins, Digital Image ProcessingUsing MATLAB. Reading, MA: Addision-Wesley, 2002.

[22] G. A. Baxes, Digital Image Processing: Principles and Applications.New York: Wiley, 1994.

Yen-Hsu Chen received the B.S. and M.S. degrees in computer science and in-formation engineering from Ming Chuan University, Taiwan, in 2005 and 2007.He is currently pursuing the Ph.D. degree at the Institute of Computer Scienceand Engineering, National Chiao Tung University, Hsinchu, Taiwan. His re-search interests include image processing and computer vision.

Tsorng-Lin Chia (SM’05) received the B.S. degree in electrical engineeringfrom Chung Cheng Institute of Technology, Taoyuan, Taiwan, in 1982, and theM.S. and Ph.D. degrees in computer science and information engineering fromNational Chiao Tung University, Hsinchu, Taiwan, in 1986 and 1993, respec-tively.

He has been on the faculty in the Department of Electrical Engineering atChung Cheng Institute of Technology during 1993–2000. Currently, he is a pro-fessor and dean in the School of Information Technology at Ming Chuan Univer-sity, Taoyuan, Taiwan. His research interests include image processing, patternrecognition, computer vision, and parallel algorithm and architecture. He is asenior member of the IEEE Signal Processing Society and the IEEE ComputerSociety.

Yeuan-Kuen Lee received the B.S., M.S., and Ph.D. degrees all in computer andinformation science from National Chiao Tung University, Hsinchu, Taiwan,R.O.C., in 1989, 1991, and 2002, respectively.

From 1993 to 1995, he was a Lecturer at Aletheia University, Taipei, Taiwan.He is currently an Assistant Professor with the Department of Computer Scienceand Information Engineering, Ming Chuan University, Taiwan. His research in-terests are in the areas of interactive media, media security, digital steganog-raphy and steganalysis.

Shih-Yu Huang received the B.S. degree in information engineering fromTatung Institute of Technology, Taiwan, R.O.C., in 1988, and the M.S. andPh.D. degrees from Department of Computer Sciences, National Tsing HuaUniversity, Taiwan, in 1990 and 1995, respectively.

From 1995 to 1999, he worked in Telecommunication Laboratories ofChunghwa Telecom. Co., Ltd., Taiwan. In October 1999, he jointed theDepartment of Computer Science and Information Engineering, Ming ChuanUniversity, Taiwan. His current interests are image compression, visual com-munication, and steganography.

Ran-Zan Wang received the B.S. degree in computer engineering and sciencein 1994 and M.S. degree in electrical engineering and computer science in 1996,both from Yuan-Ze University, Taiwan, R.O.C. In 2001, he received the Ph.D.degree in computer and information science from National Chiao Tung Univer-sity, Hsinchu, Taiwan.

He is currently an Associate Professor in Department of Computer Engi-neering and Science at Yuan Ze University, Taiwan. His recent research interestsinclude media security, image processing, and pattern recognition.

Dr. Wang is a member of Phi Tau Phi.