Multiprocessor Architecture for Image Processing
Under the guidance of
Dr. Anshul Kumar
Mayank Kumar 2006EE10331
Pushpendre Rastogi 2006EE50412
Introduction
Signal Processing, particularly image/video processing in embedded platform for implementing complex algorithms meeting real time deadlines requires high end processors.
Power consumption and cost are the major issues against massive deployments of Embedded processing nodes. Eg surveillance camera network, traffic monitoring and
control etc
Introduction
FPGA/Reconfigurable ASIC provide promising solution to the above problem by designing specific hardware utilizing the parallelism in algorithm.
Though, there are many shortcomings Gates get used up when complex algorithm are
implemented. Implementing sequential algorithms on FPGA
directly is highly inefficient.
Our approach
To design a multiprocessor architecture to facilitate the processing of high resolution image/video frames. Design of PE, or node processor customized to
handle pixel/region level operations efficiently. Given the PE, design of the architecture for
interconnecting these processors and design of input/output Hardware.
Novelty
By having an array of processors, we are exploiting the parallelism offered by processing different regions of frame in different processors.
In any processor, sequential algorithm are efficiently implemented by providing application specific instruction set.
Locally Sequential and Globally parallel
Locally Sequential Globally Parallel
Any class of algorithms which are window based and essentially operates on regions of the image, rather then the image as a whole. Image change detection for surveillance
applications Optic flow, motion estimation, filtering etc
We chose “Image change detection using Background Modeling” as a test algorithm.
Word Done
Hardware Part Initial Architecture
Drawbacks Change of platform New Architecture
Implementation
Software Part Algorithm Analysis and implementation Fixed point Matlab Simulation C Implementation
Initial Architecture
Camera
Video ADC`
Virtex II Pro
RGB Conversion Power PC
M1 M1M1
M1 M1M1
M1 M1M1
M
E
M
O
R
Y
Video DAC
MPMC
MonitorArray Topology
Architectural Drawbacks
Multi processor Memory controller could only handle finite (2-4) parallel access from different processors. Solution: We should use BRAM for parallel access.
We need to store the whole frame as the image format in XUPV30 is interlacing. -> Will use up all available BRAMs Solution: Use a board which provides progressive data.
Moreover, all digital camera these days provide progressive image data.
Change of Platform
We switched to Xilinx ML401 Virtex Video Starter Kit. Provides progressive Video input Much more BRAM, Matlab/Simulink as a design platform for
designing at higher abstraction level.
Though, switching platform consumed time due to a associated learning curve.
New Architecture
Camera
Video ADC`
VIO_inCustom Memory Controller
(Verilog Module)
`
Array of Block Ram
Array of Processor Network
VIO_in
Video DAC
Monitor
Description and Implementation
ML401 VSK provides two FPGAs Xilinx XUP2V7 for image input/output Xilinx ML401 for developing application.
VIO_in and VIO_out are reference design which sandwiches the user level design. It provides progressive image data.
We designed the custom Memory controller suited to our needs. It writes data to FIFOs implemented using BRAMs.
Custom Memory controller
Takes H_sync, v_sync, rst, Pixel_clk as input and selects a target FIFO to write the incoming data.
Each BRAM stores Image data corresponding to 4 lines.
It first empties the queue reading the result computed in the last iteration.
The other end of the FIFO is read through the Microblaze processor using FSL Links.
Processor Network
Each processor network comprises of one Master processor, and 1-7 slave processors.
Master processor reads data from FIFO and distribute the work among slave processors.
We demonstrated this using 3 processor- 1 master and 2 slave
Processor Network Basic Design
We connected the master processor to Uart to establish a serial link for input/output.
The master processor connected to slave processor which are running the same algorithm.
It takes input from uart, and passes it to diferent slaves.
Master processor distributes work, by sending different regions of the image to different processors.
Software Architecture
Studied the Adaptive Background Mixture Model. [1], [2]
Analysis of the algorithm for: Parallelism exploitation Length of code for implementation Memory requirements to store data. Feasibility
The Algorithm
Models each region of the image frame as a sum of N Gaussians with respective weights attached,
Update the model when new frame arrives. Depending on which Gaussian distribution (k)
the current pixel data belongs to , make the Foreground/Background decision
Effectively models repetitive changes in background.
Resistant to noise and slow illumination variations
Fixed Point Matlab simulation
Using Fixed point toolbox, we redefined our variables and constant in Q format.
Data Types:
DataTypeMode: Fixed-point: binary point scaling Signed: true WordLength: 32 FractionLength: 31
DataTypeMode: Fixed-point: binary point scaling Signed: true WordLength: 32 FractionLength: 23
Weight/other Constants
Pixel Data
Fixed Point Calculations
RoundMode: nearest OverflowMode: wrap ProductMode: SpecifyPrecision ProductWordLength: 32 ProductFractionLength: 23 SumMode: SpecifyPrecision SumWordLength: 32 SumFractionLength: 23 CastBeforeSum: true
Matlab simulation
C implementation
The Code is ported onto Xilinx Platform Studio for putting it onto Microblaze processors.
Simulations shows equivalent results. All the PE contains the same code, they
get different data to operate upon coming from different regions of the image.
Pitfalls
Xilinx VSK design suit promises high level design of image/video processing using simulink. We tried using this, but it does not provide
enough granularity for our design needs. Design become very complex to debug. Very tough to tweak sample design
Xilinx EDK should be used for these kind of designs.
Conclusions
We designed different parts of our proposed architecture: Input/output Custom Memory controller Basic Network processor.
We have simulated and implemented the test algorithm on a network of processor as a proof of concept.
We learnt the FPGA design flow and the Hardware Software Co-design.
Future work
In this work, we used Microblaze processors. Instruction set not optimized for Pixel/Region
based image processing. Lots of extra features that can be trimmed.
Design of a custom processor suited for these application. Less FPGA Area need More efficient
References
[1] Adaptive Background Mixture Model for Real-time tracking – Cris Stauffer, WELGrimson: AI, MIT – 1999
[2] Understanding Background Mixture model- P Wayne Power, Johnn A. Schoonees: Image and vision computing NZ, 2002
[3] A Microblaze based Multiprocessor SoC – P. Huerta, J. Castillo, J.I. Martinaze: 2007
[4]Xilinx Microblaze ProcessorReference V7.0 UG081[5]Xilinx Virtex II Pro User Guide[6] Xilinx Video Start Kit (VSK) user Guide[7] Xilinx: SAPP529 Connecting customized IP to the Microblaze Soft
Processor Core using FSL Link[8] EDK 9.1i Microblaze tutorial – A getting Started Guide[9] Xilinx White paper: Multiprocessor on XPS