1
ANU College of Engineering &Computer Science Visual Processing for Symbol Recognition in Prosthetic Vision Systems Lachlan Horne, NICTA, [email protected] Supervisors: Nick Barnes, Chris McCarthy, Xuming He Introduction Bionic Vision Australia is developing retinal implants which will be used to treat certain types of visual impairment. These implants stimulate the retina to produce sensations similar to sight (phosphenes) where photoreceptor cells are not functioning. However, these implants are limited in their ability to restore sight. Current prosthetic vision technology provides very low visual acuity for any significant field of view. Bionic eye recipients are unable to interpret symbolic information such as signs except in controlled conditions. Allowing the user to enlarge (zoom) the image of the environment can help, but requires a large degree of user interaction, and requires the user to locate the object in the environment, which is difficult. The purpose of our research is to develop systems which address this issue using computer vision techniques. In this poster I show two methods we have developed for automatically enhancing the resulting phosphene image to allow for identification and understanding of signs and other symbolic information. Even with a relatively high number of phosphene locations (over 1000 in this example) details are not clear. Object Detection Image Segmentation We used a sliding-window object detector with HOG (Histogram of Oriented Gradients) features to implement a street sign detector, trained with images of Australian street signs. The system is first trained on a set of images to produce a classifier, which is used in the object detector. Positive examples (around 2,000 images) Negative examples (around 20,000 images) AdaBoost training Window classifier Cascade classifier with adaptive boosting Training phase We can then take a series of images (for example, from a camera worn by the user), and in each image, test image regions against our classifier. By testing many image regions at different locations and of different sizes, we can take those which are classified as matching the target object and construct bounding boxes for target objects in the image. Sliding windows Detected bounding boxes Once the bounding boxes are found we can then zoom in those regions to give the user an enhanced image of the target object. This allows for recognition of street signs, and in informal tests we found 72% of signs were correctly identified by a normally sighted individual from the simulated phosphene image with this method. However, we found this approach to enhancement to be overly restrictive. The training stage is time- consuming, requires a large amount of varied training data, and a classifier would have to be trained for each different type of sign. This limits flexibility, and makes the system less useful for a real bionic vision system. In order to overcome the inflexibility of the object detection approach we tried developing a system based on image segmentation, which turned out to be much more flexible. A graph is produced based on the colour values (RGB) of each pixel in the image (a). A node in the graph (b) corresponds to an image pixel, and each node is connected to 4 adjacent nodes. Each edge in the graph has a weight associated with it, which is calculated from the difference in colour values between the pixels it connects. Then, the graph is split into regions of minimum energy - that is, edges are removed in closed paths that split the original graph and minimise the total edge weights in each resulting graph. This effectively divides the graph into connected regions which are grouped by colour. We then join regions based on colour and output the region around the fixation point (c) (which is taken to be in the centre of the image) to be converted to a phosphene representation (d). We found this approach to be much more flexible than the object detection approach. (a) Source image (b) Graph generated from pixel values (c) Fixated region (d) Resulting phosphenised image We tested the system with a video of text in a typical indoor setting. The camera swept from left to right across the sign shown at left. Below is the simulated phosphene result for different letters from a rectangular 30x35 arrangement (top) and a hexagonal 98-phosphene arrangement (bottom). We also tested with a variety of symbols in indoor and outdoor settings. Shown below are examples of this, with results phosphenised with the 30x35 grid layout. Acknowledgments Thanks to Paulette Lieby and Adele Scott for providing the simulated phosphene vision system used to generate images in this poster, described in: P. Lieby, N. Barnes, C. McCarthy, N. Liu, H. Dennett, J.G. Walker, V. Botea, and A.F. Scott. Substituting depth for intensity and real-time phosphene rendering: Visual navigation under low vision conditions. EMBC, 2011. Thanks also to Gary Overett of the Automap project for providing training data for our object detection system.

Visual Processing for Symbol Recognition in Prosthetic ... · phosphene image to allow for identification and understanding of signs and other symbolic information. Even with a relatively

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visual Processing for Symbol Recognition in Prosthetic ... · phosphene image to allow for identification and understanding of signs and other symbolic information. Even with a relatively

ANU College ofEngineering&ComputerScience

Visual Processing for Symbol Recognition in Prosthetic Vision SystemsLachlan Horne, NICTA, [email protected] Supervisors: Nick Barnes, Chris McCarthy, Xuming He

Introduction

Bionic Vision Australia is developing retinal implants which will be used to treat certain types of visual impairment. These implants stimulate the retina to produce sensations similar to sight (phosphenes) where photoreceptor cells are not functioning.

However, these implants are limited in their ability to restore sight. Current prosthetic vision technology provides very low visual acuity for any significant field of view. Bionic eye recipients are unable to interpret symbolic information such as signs except in controlled conditions.

Allowing the user to enlarge (zoom) the image of the environment can help, but requires a large degree of user interaction, and requires the user to locate the object in the environment, which is difficult. The purpose of our research is to develop systems which address this issue using computer vision techniques.

In this poster I show two methods we have developed for automatically enhancing the resulting phosphene image to allow for identification and understanding of signs and other symbolic information.

Even with a relatively high number of phosphene locations (over 1000 in this example) details are not clear.

Object Detection

Image Segmentation

We used a sliding-window object detector with HOG (Histogram of Oriented Gradients) features to implement a street sign detector, trained with images of Australian street signs. The system is first trained on a set of images to produce a classifier, which is used in the object detector.

Positive examples (around 2,000 images)

Negative examples (around 20,000 images)

AdaBoost trainingWindow classifier

Cascade classifier with adaptive boosting

Training phase

We can then take a series of images (for example, from a camera worn by the user), and in each image, test image regions against our classifier. By testing many image regions at different locations and of different sizes, we can take those which are classified as matching the target object and construct bounding boxes for target objects in the image.

Sliding windows Detected bounding boxes

Once the bounding boxes are found we can then zoom in those regions to give the user an enhanced image of the target object. This allows for recognition of street signs, and in informal tests we found 72% of signs were correctly identified by a normally sighted individual from the simulated phosphene image with this method.

However, we found this approach to enhancement to be overly restrictive. The training stage is time-consuming, requires a large amount of varied training data, and a classifier would have to be trained for each different type of sign. This limits flexibility, and makes the system less useful for a real bionic vision system.

In order to overcome the inflexibility of the object detection approach we tried developing a system based on image segmentation, which turned out to be much more flexible.

A graph is produced based on the colour values (RGB) of each pixel in the image (a). A node in the graph (b) corresponds to an image pixel, and each node is connected to 4 adjacent nodes. Each edge in the graph has a weight associated with it, which is calculated from the difference in colour values between the pixels it connects. Then, the graph is split into regions of minimum energy - that is, edges are removed in closed paths that split the original graph and minimise the total edge weights in each resulting graph. This effectively divides the graph into connected regions which are grouped by colour. We then join regions based on colour and output the region around the fixation point (c) (which is taken to be in the centre of the image) to be converted to a phosphene representation (d).

We found this approach to be much more flexible than the object detection approach.

(a) Source image

(b) Graph generated from pixel values

(c) Fixated region

(d) Resulting phosphenised image

We tested the system with a video of text in a typical indoor setting. The camera swept from left to right across the sign shown at left. Below is the simulated phosphene result for different letters from a rectangular 30x35 arrangement (top) and a hexagonal 98-phosphene arrangement (bottom).

We also tested with a variety of symbols in indoor and outdoor settings. Shown below are examples of this, with results phosphenised with the 30x35 grid layout.

AcknowledgmentsThanks to Paulette Lieby and Adele Scott for providing the simulated phosphene vision system used to generate images in this poster, described in:P. Lieby, N. Barnes, C. McCarthy, N. Liu, H. Dennett, J.G. Walker, V. Botea, and A.F. Scott. Substituting depth for intensity and real-time phosphene rendering: Visual navigation under low vision conditions. EMBC, 2011.Thanks also to Gary Overett of the Automap project for providing training data for our object detection system.