Download ppt - Multi Processor Architecture for image processing

Multiprocessor Architecture for Image Processing

Under the guidance of

Dr. Anshul Kumar

Mayank Kumar 2006EE10331

Pushpendre Rastogi 2006EE50412

Introduction

Signal Processing, particularly image/video processing in embedded platform for implementing complex algorithms meeting real time deadlines requires high end processors.

Power consumption and cost are the major issues against massive deployments of Embedded processing nodes. Eg surveillance camera network, traffic monitoring and

control etc

Introduction

FPGA/Reconfigurable ASIC provide promising solution to the above problem by designing specific hardware utilizing the parallelism in algorithm.

Though, there are many shortcomings Gates get used up when complex algorithm are

implemented. Implementing sequential algorithms on FPGA

directly is highly inefficient.

Our approach

To design a multiprocessor architecture to facilitate the processing of high resolution image/video frames. Design of PE, or node processor customized to

handle pixel/region level operations efficiently. Given the PE, design of the architecture for

interconnecting these processors and design of input/output Hardware.

Novelty

By having an array of processors, we are exploiting the parallelism offered by processing different regions of frame in different processors.

In any processor, sequential algorithm are efficiently implemented by providing application specific instruction set.

Locally Sequential and Globally parallel

Locally Sequential Globally Parallel

Any class of algorithms which are window based and essentially operates on regions of the image, rather then the image as a whole. Image change detection for surveillance

applications Optic flow, motion estimation, filtering etc

We chose “Image change detection using Background Modeling” as a test algorithm.

Word Done

Hardware Part Initial Architecture

Drawbacks Change of platform New Architecture

Implementation

Software Part Algorithm Analysis and implementation Fixed point Matlab Simulation C Implementation

Initial Architecture

Camera

Video ADC`

Virtex II Pro

RGB Conversion Power PC

M1 M1M1

M1 M1M1

M1 M1M1

M

E

M

O

R

Y

Video DAC

MPMC

MonitorArray Topology

Architectural Drawbacks

Multi processor Memory controller could only handle finite (2-4) parallel access from different processors. Solution: We should use BRAM for parallel access.

We need to store the whole frame as the image format in XUPV30 is interlacing. -> Will use up all available BRAMs Solution: Use a board which provides progressive data.

Moreover, all digital camera these days provide progressive image data.

Change of Platform

We switched to Xilinx ML401 Virtex Video Starter Kit. Provides progressive Video input Much more BRAM, Matlab/Simulink as a design platform for

designing at higher abstraction level.

Though, switching platform consumed time due to a associated learning curve.

New Architecture

Camera

Video ADC`

VIO_inCustom Memory Controller

(Verilog Module)

`

Array of Block Ram

Array of Processor Network

VIO_in

Video DAC

Monitor

Description and Implementation

ML401 VSK provides two FPGAs Xilinx XUP2V7 for image input/output Xilinx ML401 for developing application.

VIO_in and VIO_out are reference design which sandwiches the user level design. It provides progressive image data.

We designed the custom Memory controller suited to our needs. It writes data to FIFOs implemented using BRAMs.

Custom Memory controller

Takes H_sync, v_sync, rst, Pixel_clk as input and selects a target FIFO to write the incoming data.

Each BRAM stores Image data corresponding to 4 lines.

It first empties the queue reading the result computed in the last iteration.

The other end of the FIFO is read through the Microblaze processor using FSL Links.

Processor Network

Each processor network comprises of one Master processor, and 1-7 slave processors.

Master processor reads data from FIFO and distribute the work among slave processors.

We demonstrated this using 3 processor- 1 master and 2 slave

Processor Network Basic Design

We connected the master processor to Uart to establish a serial link for input/output.

The master processor connected to slave processor which are running the same algorithm.

It takes input from uart, and passes it to diferent slaves.

Master processor distributes work, by sending different regions of the image to different processors.

Software Architecture

Studied the Adaptive Background Mixture Model. [1], [2]

Analysis of the algorithm for: Parallelism exploitation Length of code for implementation Memory requirements to store data. Feasibility

The Algorithm

Models each region of the image frame as a sum of N Gaussians with respective weights attached,

Update the model when new frame arrives. Depending on which Gaussian distribution (k)

the current pixel data belongs to , make the Foreground/Background decision

Effectively models repetitive changes in background.

Resistant to noise and slow illumination variations

Fixed Point Matlab simulation

Using Fixed point toolbox, we redefined our variables and constant in Q format.

Data Types:

DataTypeMode: Fixed-point: binary point scaling Signed: true WordLength: 32 FractionLength: 31

DataTypeMode: Fixed-point: binary point scaling Signed: true WordLength: 32 FractionLength: 23

Weight/other Constants

Pixel Data

Fixed Point Calculations

RoundMode: nearest OverflowMode: wrap ProductMode: SpecifyPrecision ProductWordLength: 32 ProductFractionLength: 23 SumMode: SpecifyPrecision SumWordLength: 32 SumFractionLength: 23 CastBeforeSum: true

Matlab simulation

C implementation

The Code is ported onto Xilinx Platform Studio for putting it onto Microblaze processors.

Simulations shows equivalent results. All the PE contains the same code, they

get different data to operate upon coming from different regions of the image.

Pitfalls

Xilinx VSK design suit promises high level design of image/video processing using simulink. We tried using this, but it does not provide

enough granularity for our design needs. Design become very complex to debug. Very tough to tweak sample design

Xilinx EDK should be used for these kind of designs.

Conclusions

We designed different parts of our proposed architecture: Input/output Custom Memory controller Basic Network processor.

We have simulated and implemented the test algorithm on a network of processor as a proof of concept.

We learnt the FPGA design flow and the Hardware Software Co-design.

Future work

In this work, we used Microblaze processors. Instruction set not optimized for Pixel/Region

based image processing. Lots of extra features that can be trimmed.

Design of a custom processor suited for these application. Less FPGA Area need More efficient

References

[1] Adaptive Background Mixture Model for Real-time tracking – Cris Stauffer, WELGrimson: AI, MIT – 1999

[2] Understanding Background Mixture model- P Wayne Power, Johnn A. Schoonees: Image and vision computing NZ, 2002

[3] A Microblaze based Multiprocessor SoC – P. Huerta, J. Castillo, J.I. Martinaze: 2007

[4]Xilinx Microblaze ProcessorReference V7.0 UG081[5]Xilinx Virtex II Pro User Guide[6] Xilinx Video Start Kit (VSK) user Guide[7] Xilinx: SAPP529 Connecting customized IP to the Microblaze Soft

Processor Core using FSL Link[8] EDK 9.1i Microblaze tutorial – A getting Started Guide[9] Xilinx White paper: Multiprocessor on XPS