06419983

8/14/2019 06419983

1/6

978-1-4673-1572-2/12/$31.00 2012 IEEE

Abstract This paper presents image processing and scene

analysis methods that can provide artificial vision that is of

interest for automatic selection of hand trajectory and

prehension. The new algorithm, which uses data from the Kinect

sensor, allows real-time detection of the hand of the person

grasping an object at working table in front of that person. The

outputs are real world coordinates of the hand and the object.

The image processing is done in Matlab over the depth image

stream taken from the Microsoft Kinect as a sensory input.Results show that in the presented system setup our program is

capable of tracking hand movements in the transverse plane and

estimating hand and object position in real-time with tolerable

estimation error for the selection of stimulation paradigm that

could control hand trajectory.

Index TermsMicrosoft Kinect, functional electrical therapy,

RANSAC optimization, hand tracking, object detection

I. INTRODUCTIONHANKSto the low cost and its ability to track movementsand voices, and even identify faces with the appropriate

PC or XBOX algorithm, all without the need for anyadditional sensors, Microsoft Kinect [1] had since its launchdate transformed not only computer gaming but also manyother applications like robotics and virtual reality [2]. Thissensor found many practical applications in education andhealthcare, and even physical rehabilitation [3-6]. In our study,we tested the applicability of Microsoft Kinect as a sensoryinput device and part of the computer vision system inneurorehabilitation.

Functional electrical therapy (FET) [7-8] integratesintensive exercise, functional electrical stimulation andmotivation to relearn manipulation and grasping that are

missing as a result of stroke. One attainable improvement tothe common stimulation paradigm is the artificial perception

Manuscript received May 25, 2012. The work on this project was partlysupported by the Ministry of Education and Science, Government of theRepublic of Serbia (Project no. 175016). This research was conducted at theLaboratory for Biomedical Engineering and Technology (http://bmit.etf.rs) atthe Faculty of Electrical Engineering, University of Belgrade.

M. trbac is with the Signals and Systems Department, Faculty ofElectrical Engineering, University of Belgrade, Serbia (e-mail: [email protected]).

M. Markoviis with University Medical Center, Gttingen, Germany.D. B. Popovi is with the Signals and Systems Department, Faculty of

Electrical Engineering, University of Belgrade, Serbia and the AalborgUniversity, SMI, Aalborg, Denmark.

system which allows subconscious selection of the stimulationproperties. Some recent studies [9-11] proved that differentsystems and methods could be used to solve the task of objectidentification and automatic grasp type selection. Althoughsome of these systems proved almost faultless in grasp typeclassification process, due to the hardware, i.e. sensors theywere composed of, they featured low precision and lowrobustness in prehension information, i.e. hand to object

position and distance estimation.The next step in the research was inclusion of some new

sensor that could retrieve the missing information, i.e. handand object real world coordinates, for the automatic selectionof the stimulation parameters. This sensor, as well as thecorresponding algorithm for hand and object detection, wouldhave to guarantee good enough spatial resolution to preparesubject for grasping, and good enough time resolution for realtime tracking of average speed hand movements. In Novemberof 2010 Microsoft made a new low-cost sensor for naturalinteraction in computer gaming environment, and that sensorlater proved to be capable of solving many difficult computervision tasks. Therefore, in this paper we present results of theMicrosoft Kinect as a sensory input and some novel computervision algorithms in extracting the information of interest, aswell as the image processing and geometrical modeling thatmake the core of the real time algorithm.

II. METHODSANDMATERIALSThe Microsoft Kinect sensor consists of an infrared laser

emitter, an infrared camera and an RGB camera. The lasersource emits a single beam which is split into multiple beams

by a diffraction grating to create a constant pattern of specklesprojected onto the scene. This pattern is captured by themonochrome CMOS sensor and is correlated against a

Kinect in Neurorehabilitation: Computer isionystem for eal ime and and bject etection

and istance stimation

Matija trbac, Marko Markovi, and Dejan B. Popovi,Member, IEEE

T

Fig. 1. Microsoft Kinect for XBOX 360 - sensory input device with manypractical applications. Adapted from [12]

8/14/2019 06419983

2/6

reference pattern [12]. As a result we have sensory input oftwo corresponding image streams from the Kinect sensor,resolution of 640x480 pixels, one of them representing theRGB image while the other is grayscale depth image with the11 bit amplitude resolution. Our algorithm utilizes depthinformation only, and all the following image processingalgorithms are done after resizing the depth stream data to320x240 pixels for the sake of real time computation.

Kinect for Xbox depth sensor working range is in-between the 0.8m and 3.5m of distance. Hence, to optimizethe working area in our experimental setup, we chose to placeKinect on the 1,5 meter high camera stand on top of the table,overlooking the table with around the 45 degrees tilt in respectto the transverse plane (Fig. 2). This should also ensure easyinclusion of the Kinect sensor in the existing functionalelectrical therapy setup, by simply placing the stand with thesensor on the table next to the subject.

RANSAC and discrimination of the table pixelsBoth the object and the hand identification process rest on

the task of detecting the bounds of the table in the image.After initial calibration of the sensor [13], estimation of thereal world coordinates in the depth image matrix is the matterof simply mapping matrix indices and the depth values with

the corresponding coordinates. Whereas the geometry of thetable dictates that coordinates of all the pixels that belong tothe table must satisfy the plane equation, its position can bedetermined by inverse process of finding all the pixels that

belong to the biggest plane in sight of the camera. The tableplane in the presented system setup takes up to 60% of theimage and it is easy to deduce that this hypothesis responds tothe imminent purpose of this system.

The algorithm that is in literature praised as a best solutionfor most of the problems of fitting a geometry model toexperimental data is Random Sample Consensus (RANSAC)

Fig. 2. System setup and the image streams from Microsoft Kinect(RGB above, Depth below)

Fig. 3. RANSAC algorithm for plane detection is used to form thetable background image (upper) at the start of the acquisition andlater detect a hand in front of that plane (middle). The that has nocontact to the table edges is declared for the object of interest(lower).

128

8/14/2019 06419983

3/6

[14]. It is also possibly a most widely used robust estimator inthe field of computer vision with many different forms and

proposed optimizations for computational saving [15-17].However, we propose a novel optimization for identificationof the largest plane in the image that allows additionalcomputational savings. Basic RANSAC algorithm for planedetection consists of two steps. In the first, subsets arerandomly selected from the input data and model parameters

fitting the sample are computed. The size of the randomsamples is the smallest sufficient for determining model

parameters. In our algorithm this means random distribution ofN triplets of points across the image and defining N planesfrom these triplets with the plane equation. In the second step,the quality of each model parameters is evaluated on the fulldata set. Different cost functions may be used [18] for theevaluation, the standard being the number of inliers, i.e. thenumber of data points consistent with the model. In our case,this is the number of pixels that have 3D coordinates thatsatisfy to some small threshold limit mathematical equation ofconstructed plane. Calculating if those 76800 image pixelsmeet the conditions for all the N constructed planes is difficultcomputational tasks for a program that should work in realtime. Therefore, we introduce an optimization in this step.

Idea was to somehow use the information about the Nconstructed planes and not calculate all the inliers for every

plane. Instead of parametric plane equations, we representplanes with their perpendicular vectors and we form clustersof vectors based on the angle in between them. After that, weonly need to calculate number of inliers for the planes that

belong to the largest cluster. The plane from that cluster withthe highest number of image pixels that can satisfy its

parametric equation is declared to be the table of interest.Here, it is important to notice that this additional clustering

step in the RANSAC algorithm is possible because we haveadopted the assumption that the table will make more than50% of the image, which corresponds to the system setup andthe problem we are trying to solve. In this study theintermediate clustering step in RANSAC algorithm resultedwith smaller error rate and significantly lower time complexityfor the same number of random samples N.

Hand and object trackingObject identification is a straight forward process of

morphological operations over the black and white imageformed as a result of the RANSAC algorithm for tabledetection. Under assumption that only one object on the table

in the perspective of the IR camera does not have any contactto the table edges, it is easy to extract the object primitive withonly two steps. First is black and white padding of all theclosed contours in the image to fill the object hole in the table,and the second is subtraction of the table image without the fillfrom the filled table image. The central pixel in that primitiveis declared for the reference point of the object in thecomputation of its coordinates. Real world coordinates of that

pixel will represent the object as a whole.First necessary step for perceiving the hand in the depth

stream is formation of the background image based on average

Fig. 4. Outline of the table calculated on the background image andreal time estimation when there is a hand in front Sobel filter basededge detection and the primitive that displays the pixels estimated to

belong to the hand of the subject

12

8/14/2019 06419983

4/6

4

estimation of table pixels. Background image is formed fromthe first 10 image frames from the depth stream of the Kinectsensor, and therefore during first 400ms after start of theacquisition process. Presumption is that during this periodhand of the subject is still not above the table. Hence, this

background image should mark out all the pixels in the viewedimage that belong to the table with only exception being theobjects that are lying on the table.

This background is filtered through the low thresholdSOBEL edge filter and used to estimate the table edges.Algorithm for detection of the table boundaries works on thesame principle as RANSAC, but instead of modeling the planefrom the 3D data and then finding the longest one, is modelinglines based on the background edges primitive and searchingfor the two longest lines in that image. The mathematicalmodels of the two longest table edges in the backgroundimage is saved and used for later real time hand detection. Inour experiments, this method proved more adjustable androbust then Hough transform.

After forming the background image and the line models,the edges of the table are estimated for every frame of the

input stream by finding the pixels in the SOBEL filtered tableprimitive that satisfy the mathematical equations of the givenline models. Concept was that these edges will feature a gapduring the program execution when there is an arm over thetable. As a further verification whether that gap realyrepresents the human arm and is not due to sensory input erroror faulty calculations during image processing, the mean depthof surrounding pixels on one and the other side of the table

ledge is compared. We expect the arm to be the only entity inthe view angle of the IR sensor that will intersect the tableedges and have continuous depth through detected gap. Thearm is extracted from the image by performing image filloperation starting from the central pixel of the gap to theclosest edge in SOBEL filtered depth stream image.

For reference point on the extracted hand, the coordinates ofwhich we use for distance calculations and as real world

position information, we declared the median pixel in theheuristically defined area around the tip of the hand, meaningthe pixel most further from the edge of the table. Consequenceof this method of calculating the reference point is that theankle angle should never be greater than 90 degrees. For the

Fig. 5. Real time estimation of real world cooridinates and hand to object distance of the horizontal axis (left) and down the vertical axis (right) of thetransverse plane with intervals labeled on the graph paper

130

8/14/2019 06419983

5/6

5

smaller angles this estimation resulted with fairly eligiblereference point.

III. RESULTSA program with graphical feedback was created in

Matlab in order to test if this algorithm is capable of tracinghand movement in real time and an experiment was conductedto evaluate the error in hand to object distance calculations.

Average speed of the presented image processing methods forhand and object tracing on personal laptop (Intel Core i52430M @ 2.4GHz, 4GB of RAM) showed to be less than10ms, hence, we can conclude that the speed of this systemwill be constrained mainly by the response speed of the Kinectsensor itself, which is 30Hz.

In quasi-static conditions, object and hand identificationalgorithms resulted with very consistent estimates of its realworld coordinates. Mean estimation errors for all thecoordinates in the working space were less than 0,1cm. Indynamic conditions, hand reference point varied over 1cmwith the hand movement and changes in its orientation. Main

cause is the algorithm choice of this reference point and itsdependence on the angle in the elbow and the hand distance tothe sensor. However, this is still small error, bearing in mindthat the purpose of this system is to provide artificial

perception and automatize control of the functional electricalstimulation system.

Hand to object distance estimations were analyzed in realtime (Fig. 5) and the system proved more sensitive tomovements along the horizontal axis then to movements alongthe vertical axis. This is a side effect to the sensor position andthe orientation in this system setup and the change of IRsensor resolution based on its distance [12]. Despite this, in

presented system setup, estimation error in the 80x60cm area,

where the hand of the subject is expected to be (10 to 30 cmabove the table), this error was never higher then 1cm.

Here, it should also be noted that for distances lower than10cm down the horizontal axis (in respect to the capturedimage) and 5 cm for the vertical axis the algorithm is unable todetect the object, considering the hand and object will mostoften overlap in the perspective of the IR camera. However, ifwe assume static nature of the targeted object, this cannoteffect distance calculation and object position estimation whenthe object is on the further part of the table, because the lastestimated coordinates will be summoned for later calculations.Therefore, the Kinect sensor should usually be placed on partof the table closer to the affected hand then the object of

interest.

IV. CONCLUSIONProposed image processing algorithms and hand and

object coordinates estimation methods proved satisfactory forthe real time needs of automatically planning out and tackinghand to object movements in functional electrical therapy. The

position estimation errors and the time complexity of theproposed algorithm were maintained on the hardwarelimitations of the Kinect sensor, i.e. 1cm error in the working

range at around 30 fps. Therefore, additional optimizations inimage processing could not produce lower estimation errors orfaster computation. Nevertheless, there are few preconditionsthat should exist in the upper extremity FET system setup forthis algorithm to perform properly:

i. Only one object at a time should be placed on the table infront of the subject for the grasping exercise

ii. The other hand should never be above the table

iii. Object is grasped by a direct path and the angle in theelbow is never greater than 90 degrees

iv. Only immobile objects could be used in the exerciseAs a result of the incorporation of these hypothesis in the

image processing algorithm, its applicability for hand andobject detection in different system setup is relatively low.However, the proposed RANSAC optimization could beuseful whenever the object we are trying to model is thelargest with the given geometry in the perceived image.

In the end, it is important to say that for now this systemhad not yet been attached to the stimulator and still needs to betested for practical clinical applicability in functional electricaltherapy. Reader should notice that we have only testedapplicability of Microsoft Kinect and the concept of artificial

perception for automatic choice of the stimulation parameters.

REFERERNCES

[1] Kinect for Xbox, [Online]. Available: http://www.xbox.com/en-US/kinect. [Accessed 25. May 2012].

[2] F. Ryden, H. J. Chizeck, S. N. Kosari, H. King and B. Hannaford,Using Kinect TM and a Haptic Interface for Implementation of Real-TimeVirtual Fixtures, in RSS Workshop on RGB-D Cameras, 2011.

[3] L. Gallo, A. P. Placitelli and M. Ciampi, Controller-free explorationof medical image data: Experiencing the Kinect, in Computer-Based MedicalSystems, 2011.

[4] Y.-J. Chang, S.-F. Chen and J.-D. Huang, A Kinect-based systemfor physical rehabilitation: A pilot study for young adults with motor

disabilities, Research in Developmental Disabilities, p. 25662570, 2011.[5] H.-m. J. Hsu, The Potential of Kinect in Education, InternationalJournal of Information and Education Technology, pp. 365-370, 2011.

[6] A. P. Lanari Bo, M. Hayashibe and P. Poignet, Joint AngleEstimation in Rehabilitation with Inertial Sensors and its Integration withKinect, in EMBC, 2011.

[7] M. B. Popovi, D. B. Popovi, T. Sinkjaer, A. Stefanovic and L.Schwirtlich , Restitution of reaching and grasping promoted by functionalelectrical therapy, Artificial Organs, pp. 271-275, 2002.

[8] M. B. Popovi, D. B. Popovi, T. Sinkjr, A. Stefanovi and L.Schwirtlich, Clinical Evaluation of Functional Electrical Therapy in AcuteHemiplegic Subjects, Clinical EJ Rehab Res Develop, pp. 434-454, 2003.

[9] . Klisi, M. Kosti, S. Doen and D. B. Popovi , Control ofPrehension for the Transradial Prosthesis: Natural-like Image RecognitionSystem, Journal of Automatic Control, no. 19, pp. 27-31, 2009.

[10] S. Doen and D. B. Popovi, Transradial Prosthesis: Artificial

Vision for Control of Prehension, Artificial Organs, 2010.[11] M. trbac and M. Markovi, Stereovision system for estimation of

the grasp type for electrotherapy, Serbian Journal of Electrical Engineering ,vol. 8, no. 1, pp. 17-25, 2011.

[12] http://www.engadget.com/2010/06/13/microsoft-kinect-gets-official/[13] K. Khoshelham and S. O. Elberink, Accuracy and Resolution of

Kinect Depth Data for Indoor Mapping Applications, Sensors, pp. 1437-1454, 2012.

[14] J.-M. Gottfried, J. Fehr and C. S. Garbe, Computing Range Flowfrom Multi-modal Kinect Data, in Advances in Visual Computing, 2011, pp.758-767.

[15] Random Sample Consensus: A Paradigm for Model Fitting withApplications to Image Analysis and Automated Cartography,

131

8/14/2019 06419983

6/6

6

Communications of the ACM, pp. 381-395, 1981.[16] O. Chum and J. Matas, Randomized RANSAC with Td,d test, in

British Machine Vision Conference, 2002.[17] O. Chum and J. Matas, Optimal Randomized RANSAC, The IEEE

Transactions on Pattern Analysis and Machine Intelligence, pp. 1472 - 1482,2008.

[18] D. Nister, Preemptive RANSAC for Live Structure and MotionEstimation, in IEEE International Conference on Computer Vision, 2005.

[19] P. H. S. Torr and A. Zisserman, MLESAC: A New RobustEstimator with Application to Estimating Image Geometry, Computer Vision

and Image Understanding, pp. 138-156, 200.

132

Documents

06419983