Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
ii
A TMD-MPI/MPE Based Heterogeneous Video System
by Tony Ming Zhou
Supervisor: Professor Paul Chow
April 2010
iii
Abstract:
The FPGA technology advancements have enabled reconfigurable large-scale
hardware system designs. In recent years, heterogeneous systems comprised of
embedded processors, memory units, and a wide-variety of IP blocks have become
an increasingly popular solution to building future computing systems. The TMD-
MPI project has evolved the software standard message passing interface, MPI, to
the scope of FPGA hardware design. It provides a new programming model that
enabled transparent communication and synchronization between tasks running on
heterogeneous processing devices in the system. In this thesis project, we present
the design and characterization of a TMD-MPI based heterogeneous video
processing system. The system is comprised of hardware peripheral cores and
software video codec. By hiding low-level architectural details from the designer,
TMD-MPI can improve development productivity and reduce the level of difficulty.
In particular, with the type of abstraction TMD-MPI provides, the software video
codec approach is an easy-to-entry point for hardware design. The primary focus is
the functionalities and different configurations of the TMD-MPI based
heterogeneous design.
iv
Acknowledgements
I would like to thank supervisor Professor Paul Chow for his patience and
guidance, and the University of Toronto and the department of Engineering Science
for the wonderful five year journey that led me to this point. Special thanks go to the
TMD-MPI research group, in particular to Sami Sadaka, Kevin Lam, Kam Pui Tang,
and Manuel Saldaña. Last but not the least, I would like to thank my family and my
friends for always being there for me. Stella, Kay, Grace, Amy, Rui, Qian, Chunan, and
David, you have painted the colours of my university life.
v
Glossary MPI Message Passing Interface API Application Program Interface FIFO First-In-First-Out NetIf Network Interface MPE Message Passing Engine TMD Originally meant Toronto Molecular Dynamics machine, but this
definition was rescinded as the platform is not limited to Molecular Dynamics. The name was kept in homage to earlier TM-series projects at the University of Toronto
VGA Video Graphics Array RGB Red-Green-Blue Colour Model FPS Frames Per Second DVI Digital Visual Interface FSL Xilinx Fast Simplex Link FIFO First-In-First-Out HDL Hardware Description Language TX Transmission/Transmitting RX Reception/Receiving MPMC Multi-Ported Memory Controller BRAM Xilinx Block RAM PLB Processor Local Bus
vi
Contents 1 Introduction
1.1Motivation …………………………………………………………………………………. 1 1.2 Objectives …………………………………………………………………………………. 1
2 Background 2.1 Literature Review ……………………………………………………………………. 2 2.2 Distributed/Shared Memory Approaches ………………………..………… 3 2.3 The Building Blocks ………………………..………………………………………… 4
3 Methods and Findings
3.1 The Video System in Software .………………………………………………….. 6 3.2 The Video System on FPGA ………………………….……………………………. 7
3.2.1 System Block Diagram ……………………………………………………… 7 3.2.2 Distributed Memory Model ………………………………………………. 8 3.2.3 Shared Memory Model ……………………………………………………… 10
4 Discussions and Conclusions
4.1 Software vs. Hardware ………………………………………………………………. 13 4.2 Conclusions and Future Directions ……………………………………………... 14
References ……………………………………………………………………………………... 16 Appendix A: Video System – Software Prototype …………………………. 17 Appendix B: Video System – Hardware System ……………………………. 21 Appendix C: File Structure ……………………………………………………………. 22
1
1 Introduction
1.1 Motivation
Chip development has become increasingly difficulty due to transistor
physical scaling limitations. Parallel processing stands out as one of the best
alternative solutions for performance improvements. Message Passing Interface, or
MPI, is a specification for an API that allows computers to communicate with one
another. After over a decade of development, it has become the de facto standard for
communications among software processes that model a parallel program with
distributed memory.
Hardware engines are generally better suited for parallel applications
compared to software. Modern day FPGA technology has enabled advances in
hardware design. With the aid of HDL and FPGA’s reprogramability, software
program can now be accelerated in hardware without the high cost of ASIC design.
However, unlike low level hardware design, high level system integration can
be complex and time consuming. Professor Paul Chow and his research group at UofT
have built a lightweight subset implementation of the MPI standard, called TMD-MPI.
It provides software and hardware middleware layers of abstraction for
communications to enable the portable interaction between embedded processors,
CEs and x86 processors. Previous work demonstrated that TMD-MPI is a feasible
high-level programming model for multiple embedded processors, but complex
systems with heterogeneous processing units are yet to be tested. [1]
1.2 Objectives
The development of TMD-MPI is still in the stage of infancy compared to the
MPI standard. Implementations and characterizations of designs are lacking. This
undergraduate thesis project attempts to fill this gap by the means of a TMD-MPI
based heterogeneous video system design and characterization.
2
After a simple feasible heterogeneous system was successfully demonstrated,
this thesis will focus on expanding the software element network to exploit more
parallelism.
2. Background
2.1 Literature Review
Although heterogeneous systems have numerous performance and energy
advantages, design complexity remains a major factor limiting its use. Successful
designs require developer expertise in multiple languages and tools. For instance, a
typical FPGA heterogeneous system engineer should possess the knowledge of HDLs,
software coding, interface details between source and destination engines, CAD tools,
and vendor-specific FPGA details. Ideally, a specialized hardware/software element
of the system could be designed independently from the other elements, yet still be
portable and easily integratable into the overall system. TMD-MPI achieves this by
abstracting away the details of coordinations between different task-specific
elements, and in addition, it provides an easy-to-use entry point.
Similar attempts have been made by OpenFPGA and NSF CHREC:
a) Open FPGA has released General API Specification 0.4 in 2008 in attempt to
propose an industry standard API for high-level language access to reconfigurable
FPGAs in a portable manner. [2] The scope of TMD-MPI is much larger, because the
types of interactions in GenAPI are very limited as it is only focusing on the low-level
X86-FPGA interaction, but not dealing with higher levels.
b) NSF CHREC on the other hand, developed a conceptually similar framework
adopting the message-passing approach. A more careful inspection reveals the
differences. The hardware and software elements in their SCF heterogeneous design
are statically mapped to one another [3]. In contrast, the mapping of TMD-MPI nodes
is dynamically defined, implying that point-to-point communication paths can be
redirected during run-time for more versatility.
3
2.2 Distributed/Shared Memory Approaches
The primary goal of this thesis focuses on functionality rather than
performance. Speed and performance considerations aside, two approaches from the
high level perspective can be adopted.
The first is a distributed memory system where all processing unit are
equipped with local memories. The fact that local data is not accessible by ranks
other than its owner implies the video frames must be passed as messages from one
rank to another. Video streaming has a unidirectional flow of data, which makes it a
well suited application for the distributed memory approach.
The second is a shared memory approach, where all the processing units
share a common memory space. Under the conditions that video data are properly
managed in memory by a special engine and the memory interface is not port limited,
the memory contents would be accessible by all the processing units. The result of
such approach suggests that the task designation to processing units can be as simple
as passing memory addresses.
The desired video frame has a size of 640px by 480px, in 32-bit RGA format,
which is equivalent to 1200 kilobytes. If the desired frame rate is 30 FPS, then the
shared memory system’s network must be able to handle roughly 1200KB * 30 =
35MB/s. The shared memory approach introduces a less network traffic-intensive
way of communication by passing 32-bit addresses as messages. The simplest
application potentially is only required to broadcast a base memory address location
to all processing units, resulting in a total network data traffic of 32 bits per rank.
The drawbacks of shared memory approach are the longer memory access
time, and proneness to data corruption. Comparison wise, the shared memory
approach exhibits a more asynchronous nature, if a section of memory is assigned to
more than one codec, racing conditions can occur to cause premature or delayed
memory updates. Moreover, if a FPGA is not robust and had caused a bit-flip in a
4
message on the network, a bad memory address in the shared memory model is
more likely to result in a catastrophic failure than a bad pixel in the distributed
memory model.
2.3 The Building Blocks
In this project, point-to-point communication channels are implemented
using Xilinx Fast Simplex Links, which are essentially FIFOs. The FIFO is a powerful
abstraction for on-chip communication that is able handle the bandwidth of this
video system. Because the FSL are unidirectional, they are implemented in pairs for
the transmission and reception of data. In the special case of CE’s interface with
TMD-MPE, an extra pair of command FIFOs is required exclusively for MPI
commands (Fig. 1).
Figure 1. The FIFO pairs act as point-to-point communication channels.
The software elements are implemented using Xilinx Microblaze soft-core
processors. Message passing protocol is instantiated at compile time by declaring the
TMD-MPI library in the source code. An additional message passing engine, or MPE,
was also created to perform a subset of message passing functions in hardware. The
hardware elements can be designed in any HDL, but will always require a TMD-MPE
to be able to connect to the network.
Figure 2. Simplified NetIf diagram showing the TX and RX muxes.
Left, RX half with RX FIFOs. Right, TX half with TX FIFOs.
5
In order to enable dynamic mapping capability for the nodes, extra state
machines are required for each element/core. The Network Interface block provides
such path routing functions to eliminate the need of extra states for each core.
Another issue arises as the system gets large, the Microblaze soft-processor only has
eight FSL channels available, limiting the number of nodes that can be connected for
communication. The Network Interface block also solves this issue extracting the
channel multiplexing away from Microblazes. It functions as a multiplexer controlled
by the destination rank in the TX half, and a demultiplexer controlled by the source
rank in the RX half (Fig. 2). Connections wise, on one side of the NetIf, a pair of Xilinx
FSLs connects the NetIf and the current node; on the other side of the NetIf, it may be
connected to all the other NetIfs for maximized freedom (Fig. 3a), or less channels for
improved performance at the cost of worse system visibility (Fig. 3b).
Figure 3. Different ways of interconnecting the nodes, a NetIf is attached to each element, a
(left), b (right).
Lastly, this thesis project was built upon an available video system by
Professor Paul Chow’s summer student Jeff Goeders. The three-node system is a
scalable TMD-MPI based video processing framework, it currently supports video
streaming from the VGA port to the DVI port on the Xilinx Vertex-5 board [5].
6
3 Methods and Findings
3.1 The Video System in Software
Implementing TMD-MPI requires first understanding how to use the MPI
specification. Logically, the first step which took place was building a video frame
processor prototype entirely in software.
This prototype application utilizes the multiple CPU cores available on a PC,
and if available cores are insufficient, the additional non-existing cores are simulated
as additional processes on the existing cores. In the application, input and output
video frames were replaced with bitmap images of the same RGB format, the
processing codec was coded in C++, and both distributed memory and shared
memory models were implemented with the aid of readily available software
libraries.
Results:
Two instances of the application are shown below in Figure 4 and 5. Memory
activity have been omitted in the figures. The black arrows in Figure 4 symbolizes the
flow of actual frame data as messages, whereas the black arrows in Figure 5
symbolizes the flow of memory pointers as messages. Although streaming is better
suited for a distributed memory model, streaming and parallel processing
applications can be interchanged for the two models.
Figure 4. Streaming using distributed memory message passing model.
ADD NOISE INVERT COLOUR FLIP IMAGE
7
Figure 5. Parallel processing using shared memory message passing model.
3.2 The Video System on FPGA
3.2.1 System Block Diagram:
Figure 6. The heterogeneous video processing system block diagram
Data passed to ranks
as pointers in memory
Parallely processed
data sent to the output
node as a pointer
8
The system block diagram in Figure 6 lays out the overall picture of the
heterogeneous video processing system. The Network Interface Block in the centre
comprised of many interconnected NetIfs is shown as a big block for simplicity. The
top region above the NetIfs Block consists of hardware elements. Hardware CEs
interface with the system network through TMD-MPEs. Rank 1 is a video decoder
that takes VGA input and places it out the system network, and Rank 2 is a video
receiver that gets the video frames from the network and stores it in the external
memory. All of the software elements reside below the NetIfs Block , there is the
special Rank 0 process and a network of Microblazes and several specialized codec-
related processes labeled by Rank 3-N. Software elements interface with the system
network through the TMD-MPI software library.
Not only is the 256MB of external memory ported for Rank 2’s memory
storage and the DVI-out core’s video output, it is also useful when a hardware
engine’s local memory is insufficient, or a shared memory model is implemented.
MPMC is the memory interface between the system and the external memory.
In MPI, nodes identify each other by ranks. The sending and receiving of
messages require source and destination ranks. Rank 0 acts as the central command
centre of the system that initializes all the ranks at the start of runtime, and it also
configures and reconfigures the mapping between ranks during runtime. In the
codec network of Microblazes labeled Rank 3-N, the hardware settings may or may
not be identical depending on the peripheral devices needed for each process. In
general, the quickest and most efficient implementation of codec processes is
duplicating a fully functioning Microblaze and its peripherals and settings, but
loading the copies with different source code for codec-specific functions.
3.2.2 Distributed Memory Model:
Jeff Goeders’ framework demonstrated video streaming from a video decoder
core (Rank 1) to a memory storage core (Rank 2). Eventually, the video is outputted
to a DVI port. The communication channel between the two cores is established
based on TMD-MPI rank-to-rank model described earlier. Therefore, a Microblaze
9
(Rank 3) can be inserted between the streaming paths of the two cores by only
changing the destination rank of Rank 1 and source rank of Rank 2. Functionality
wise, the following tasks must be performed by the Microblaze:
Receive frames from Rank 1 in units of 640 x 480 frames
Apply the video codec effects
Send modified frames out to Rank 2 in units of 640 x 480 frames.
The requirements above pose several challenges. First of all, the Microblaze
must operate at more than double the rate Rank 1 and Rank 2 respectively send and
receive data. In addition, extra clock cycles would be needed for video processing.
The most direct solution involves parallelizing the task by dividing each frame into
smaller units that are equally distributed among multiple Microblazes.
The second challenge is the local memory constraint on the Microblaze.
Messages are sent and received in the unit of frames of 1200KB in size. Unlike the
hardware engines, Microblaze suffers from the higher level interaction it has with
FIFOs. Hardware engines are able to receive and send data on 32-bit-FIFO-data-entry
basis, but Microblazes must received each 1200KB message as a whole. The
distributed memory and BRAM available to Microblaze are both too small for local
storage; the only option is the external off-chip memory. Note that using an external
memory slightly violates the principles of a true distributed local memory model.
Results:
As expected, a single Microblaze based codec suffers from poor performance.
Although the soft-core processor operates at the same system clock frequency of
100Mhz, the effective rate of video streaming results in 1-10Mhz (Fig. 7).
Figure 7. The Microblaze as the bottleneck in the streaming path.
10
There are three factors contributing to the slowdown of the Microblaze. First,
Microblazes interface with the Xilinx FSL less efficiently than hardware does. The
extra clock cycles needed for the Microblaze reoccurs for each 32-bit data entry or
each pixel on the FIFO. Secondly, due to the large size of a video frame (1200KB), a
local memory approach is not applicable, an external off-chip memory is used instead.
As the external memory must share the PLB bus with other peripheral devices, bus
arbitration and the extra traffic introduce significant delay (Fig. 8). The external
memory is off the FPGA chip, both the complex MPMC interface and long physical
distance translate to more delay. The last factor would be the implicit sequential
execution of instructions in a normal processor [1]. However, contributions of the
last factor are small since a well designed video codec should be well cached and
pipe-lined by the processor.
Figure 8. The peripheral devices that are connected to the Microblaze in this video system.
The number of remaining ports on the MPMC limits the number of codec
Microblazes to six. Even if they are perfectly parallelized, the combined effective
frequency is 60Mhz – still not caught up to the speed of other cores. Clearly, the
design needed a new direction, the shared memory model is introduced.
3.2.3 Shared Memory Model:
The shared memory approach, as described in an earlier section, enjoys the
benefits of significantly reduced network traffic at the cost of memory access time.
Implementations of the model may vary, but the key idea is such that there are a
finite number of messages sent to the codec Microblazes. In this project, six 32-bit
11
signals is sent to the codec Microblaze, they are related to the base and high
addresses of the video, the type of codec to run, and other control signals. The
number of Xilinx FSL interface delay cycles associated with the six messages is
negligible compared to the per pixel basis delay cycles associated with the
distributed memory model.
Note that the shared memory approach is at a disadvantage in terms of
memory access time only when it is compared to a true distributed memory model.
Due to the limited amount of local memory available to Microblazes, an external
memory has been used for the distributed memory system in this project. Given such,
simple analysis shows that the two models practically exhibit identical memory
access time: the distributed memory modeled Microblaze first writes to memory as it
receives the data, then reads the data back for processing and transferring to the
next node; the shared memory modeled Microblaze first reads the data for
processing, later it updates the memory contents. Exactly one read and one write took
place in both models.
The same framework by Jeff Goeders has been used as the groundwork the
shared memory model is built upon. The codec related Microblazes are placed on the
system without source and destination changes to Rank 1 and Rank 2. Memory space
management has become a crucial task in this model. The video storage core and the
DVI-out core memory spaces are separated from each other, and the codec
Microblaze’s memory access spans both spaces as it is the agent that transfers the
data from one space to the other (Fig. 9).
Figure 9. Microblaze spans two spaces as both the video processor and transferor.
12
Results:
The speed improvement is evident and linearly scalable as far as the
measurements have shown (Fig. 10). The frame rate up to four Microblazes was
measured, a linear trend has been observed. The codec effects include darkening
effect, addition of noise, colour inversion, and colour change. The spread of data is
expected for that different codec were applied during multiple runs.
The FPS was measured by a special function within each Microblaze as
opposed to being measured at the DIV-out core. Therefore, despite the limited
number of available MPMC ports, the FPS benchmark for an increased number of
Microblazes can still be simulated by reducing the task handled by each Microblaze.
In the shared memory model, most of the message passing occur between the
special Rank 0 node and the codec Microblazes. Extra measures must be taken to
ensure proper sequence of tasks for data correctness. As a result, the software code
in this model tends to be more complex.
Figure 10. Microblaze spans two spaces as both the video processor and transferor.
13
4 Discussions and Conclusions
4.1 Software vs. Hardware
All of the software cores except for Rank 0 are replaceable with hardware
cores. One of the main goals is seeking the effectiveness of software approach to
TMD-MPI programming.
The functionalities of the TMD-MPI library far exceed its hardware
counterpart - the TMD-MPE. There are 25 MPI commands available in the TMD-MPI
library, whereas the TMD-MPE has only 3 MPI commands in total - synchronous send,
asynchronous send, and receive.
For the design of these codec, given the same task and clock, a specialized
hardware engine is expected to outperform the software process. Hardware engines
can be better optimized to for specialized tasks and are better suited for parallel
applications, on the other hand, processors are general purpose oriented, making
them less suited for specialized tasks. Moreover, experimental results in the previous
sections suggest that the software processes are associated with more delay, this is
another major factor limiting a software process’s performance.
In terms of development speed and difficulty, the software method has clear
advantages because of better scalability and reduced compile time. For instance,
functionally different processors are structurally identical in hardware, thus scaling
the software processes is as simple as duplicating the existing Microblaze and its
settings. Compared to hardware development, the compile and debugging time in
software is significantly reduced: on average, regenerating the system bitstream
after modifying a codec takes 1 minute in software but 90 minutes in hardware.
Well designed hardware engines should occupy less area, but software core
based codec have less development costs. The cost function should not be simply
evaluated based on chip area and development cost, there are many other
contributing factors. Thus, the comparison is inconclusive.
14
The comparisons drew above are summarized in the following table:
Software Hardware
Functionality Very good Bad
Performance Slow Very Fast
Development Fast Slow
Cost Inconclusive Inconclusive
4.2 Conclusions and Future Directions
This thesis project is an implementation of TMD-MPI based heterogeneous
video processing system. More specifically, the video processing units were
implemented using the Microblaze soft-core processors to execute C programs in
parallel. Characterizations and analysis have demonstrated TMD-MPI as a feasible
and efficient approach for heterogeneous system design.
The performance drawbacks of software processes have limited the data
throughput. As TMD-MPI provides a scalable solution to enable parallel processing,
performance can be improved by duplicating current software processes. Despite
that the shared memory model outperformed the distributed memory model, a true
distributed local memory model was not achieved, and therefore such comparison is
slightly unfair. Finally, the shared memory model provides more abstraction for the
developer. By passing memory addresses as messages, it deals with less network
traffic and less hardware modifications. Based on experiences throughout this
project, the shared memory model is the more scalable solution.
The TMD-MPI programming model is common to both software and hardware.
For the MPI commands, a simple script can be written for conversion between TMD-
MPI and TMD-MPE commands. As C-to-HDL technology advance, automatic software
code conversion is possible. The TMD-MPI approach suggests the possibility of an
efficient automated method of hardware development for designers with little
hardware background.
15
The following is a list for future directions:
Implement more MPI commands into the TMD-MPE for better hardware
functionality. Since many functions are built upon the three basic ones
available in the MPE, a higher level MPE that utilizes TMD-MPE’s basic
functions can be introduced.
Expand the software codec network, try more complex structures and
parallelization for characterization. Because of the limited number of
MPMC ports, the codec network of Microblazes may adopt both
distributed and shared memory models. As control signals get complicated
for Rank 0, local hierarchy methodologies like the tree diagram in Fig.3b
might be needed.
Implement a multi-board system to explore a higher level of scalability.
The building pieces are already available: Sami Sadaka has a homogeneous
video processing system, and Kevin Lam has a gigabit Ethernet Bridge.
Develop a cost-effective method of automated C-to-HDL conversion so that
developers can enjoy both benefits of efficient software development and
hardware accelerated performance. One might wish to conduct an study
on tools such as Nios II C-to-Hardware Acceleration Compiler, Impulse C,
and FPGAC. Although a completely different research topic, it is one that
raises great interest for the TMD-MPI project.
16
References:
[1] M. Saldana, A. Patel, C. Madill, and P. Chow. “MPI as an Abstraction for Software-Hardware Interaction for HPRCs,” HPRCTA, 2008 Second International Workshop; Austin,TX,USA
[2] OpenFPGA Main Page, “OpenFPGA General API Specification 0.4,” OpenFPGA. [Online]. Available: http://www.openfpga.org/Standards%20Documents/OpenFPGA-GenAPIv0.4.pdf [Accessed: Feb 20th, 2010]
[3] V. Aggarwal, R. Garcia, A. George, and H. Lam, “SCF: A Device- and Language-Independent Task CoordinationFramework for Reconfigurable, Heterogeneous Systems,” HPRCTA, November 15, 2009; Portland, Oregon
[4] MPICH2: High-performance and Widely Portable MPI, [Online]. Available: http://www.mcs.anl.gov/research/projects/mpich2/ [Accessed: Feb 20, 2010].
[5] Jeffrey Goeders, “A Scalable, MPI-Based Video Processing Framework,” University of Toronto, August 2009.
[6] Manuel Saldana, “Message Passing Engine (MPE) User’s Guide,” ArchES Computing, September 2009.
17
Appendix A Video System - Software Prototype Description This software MPI application is a picture frame parallel processing program. There are currently five defined ranks in the system with two distinct tests. The number of ranks can be easily expanded. The project infrastructure was organized in Microsoft Visual Studio/C++, and it must run under MPICH2 environment. Currently, a multi-core computer is not necessary due to MPICH2’s capability of simulating it as multi-process single-core program. However, the instructions for running the program on multiple computers are given below. Key Functions: Void MPE_Master(): It is executed by Rank 0 only. The function first creates and initializes the shared memory. Then defines the tasks to be performed and assign them to the other ranks. Void_MPE_Slave(): The slave function is executed by all non 0 ranks. The function contains a polling loop that is always polling for tasks from Rank 0, the received task is then translated and the corresponding codec is executed. Void Codec (int add, int size, int rank): The codec function takes in three parameter, “add” determines the baseaddress in the memory for processing to start, size is essentially the size of frame to be executed, and lastly the rank parameter is needed to determine with codec to run. Instruction for running on single computer: 1) Set up MPICH2 2) Go to the working directory, debug folder. 3) Run MPI program in the command prompt:
“mpiexec –n 5 source.bmp mpi_test1.exe” “mpiexec –n 5 source2.bmp mpi_test2.exe” Format: mpiexec –n arg1 arg2 executable arg1 – the number of processes/ranks arg2 – the input picture frame to be processed executable – your MPI program generated by the compiler
Instruction for running multiple computers: 1) Makesure that MPICH2 versions are the same in all computers/machines. 2) Copy the executable to the same directory in each machine (node).
18
For example “C:\Program Files\MPICH2\examples\cpi.exe” 3) Set Network Connections: Ensure that each machine can let its files shared by
other computers 4) Set Windows Firewall: Ensure that Windows Firewall can allow files sharing by
checking the option 5) Add MPICH2 path to Windows User Variables and System Variables. 6) Run MPI program in the command prompt.
For example “mpiexec –hosts 2 domainnameA 1 domainnameB 1 c:\program files\mpich2\examples\cpi.exe”
Software Variable Definitions: 1) Tasks are MPI messages of type int array, assigned by Rank 0 through MPI_Send
command. Example task declaration: “int taskname[TASKSIZE] = {0,1,640*480, 0, 0, 0, 0, 0};” Format: t[TASKSIZE] = { 0: source rank,
1: destination rank, 2: size of frame to access, 3: memory address/pointer of the frame, 4-7: unused }
2) Tags are transferred with each MPI message, it is used to determine the type of
message this MPI command carries. TAG: the message only contains info about the source and destination. TAG_ext: the extended verion of the previous tag, the message only contains the source and destination rank, memory address and size is carried in the message as well. dieTAG: signal to shut down the current core.
3) A rank is a unique identity of each process. Rank 0 is always the system control centre for the initializing and task assigning of the system
Current Tests: Test 1: The memory pointer is passed from codec to codec, the entire frame is being operated on. Test 2: Different memory pointers are assigned by Rank 0. The frame is divided into three sections, which are handled by three different ranks/codecs/processes.
19
Appendix B Video System – Hardware System Description The hardware system is a heterogeneous video processing system. It is built upon Jeff Goeders’ video streaming framework. The software ranks provide a scalable solution to video processing. This project has been developed as a proof-of-concept implementation. The hardware engines interface with the network through TMD-MPE v1.0, and the soft-core processors interface with the network through TMD-MPI library v1.0. The Block Diagram:
Instruction for generating the bitstream: 1) Must first source the Xilinx ISE 10.1 suitte: “source /opt/xilinx10.1/10.1.sh” 2) In ./xps_streaming and ./xps folder, type: “make –f system.make clean” cleans the created netlist “make –f system.make libs” generates the software libraries “make –f system.make bits” generates the unmodified bitstream Distributed model, streaming example 3) In the project root folder, ./xps_streaming needs to be changed to ./xps
20
4) Execute the Python script to generate the final bitstream: “./compile.py ./streaming.cfg” Find the generated bitstream in ./bit. Shared memory model example 3) In the project root folder, the system is built in ./xps 4) Execute the Python script to generate the final bitsteram: “./compile.py ./compile.cfg” Find the generated bitstream in ./bit. Tips for adding additional ranks:
The NetIf channels must be connected in the correct order, the channel number correspond to the NetIf number it connects with.
The order that the new core is defined in the ./*.cfg file is the order or ranks (0 being the first item, 1 next, 2 follows, etc.)
The new rank definition must be added in ./rt/rt_m0_b0_f0.mem file, new rank number is added in the beginning since the routing table is defined in the reverse order.
The number of routing table words must increase by 2 every time a new rank is added, this parameter called C_NUM_OF_WORDS can be found in ./xps/system.mhs
The new rank should always be initialized by Rank 0 to ensure the correct order of operation to prevent racing conditions.
Dips Switches: The dips switches at the bottom right of the Vertex-5 board are used for both debugging and system settings. SW_Pin1 & SW_Pin2: Used for Uart Mux, supports up to 4 Microblazes SW_Pin3 & SW_Pin4: Used only in the shared memory model, must be set before the system initializes.
SW_Pin[4:3] Codec Effect
0 no modifications to the video
1 introduces dotted pattern to the video
2 causes colour inversion to the video
3 divides the screen in four sections and the following codec are applied: dotted
21
SW_Pin 5-8: The four bit control signal for the debug mux, signals from different cores are displayed on the GPIO LEDs based on this value:
SW_Pin[8:5] Signal displayed on LEDs
0 vga_in_to_fsl_0_o_DBG_H_CNT
1 vga_in_to_fsl_0_o_DBG_V_CNT 2 vga_in_fsl_to_mpe_0_o_DBG_CS
3 vga_in_fsl_to_mpe_0_o_DBG_SEND_CNT
4 vga_mpe_to_ram_0_o_DBG_RECV_CS 5 vga_mpe_to_ram_0_o_DBG_RECV_CNT
6 vga_mpe_to_ram_0_o_DBG_PLB_CS
7 vga_mpe_to_ram_0_o_DBG_WRITE_FRAME_CNT 8 vga_in_analyze_0_o_DBG_H_SYNC_WIDTH
9 vga_in_to_fsl_0_o_DBG_V_SYNC_WIDTH Rank 0 – The Control Microblaze: This software process is responsible for initializing the system and directing the traffic for the codec processes.
Parameter Description
BASEADDR The base address defined for the video frame to be processed
HIGHADDR The high address defined for the video frame to be processed
TFT_ADDR The starting address of the DVI-Out memory space
TMR_ADDR The address of the xps_timer counter
GPIO_DIP_SW The address to read SW_Pin[4:3]
FPS_DISPLAY The enable signal for displaying the FPS information
DEBUG_LEVEL The debugging level, see the source code for more details
DEBUG_REPS The number of repetitions for certain debugging stages Other Tips:
Null-modem setting RS-232 cable is required for the Vertex-5 board System infrastructure can be updated purely with system.mhs and
system.mss files. Initializing the DVI can sometimes fail if the port is plugged in, simply remove
it and plug it back after the configurations are complete.
22
Appendix C File Structure: The design is located in /work/zhoutony/video_proc_Microblaze/
File/Folder Description EDK-XUPV5-LX110T-Pack XUP Virtex-5 board definitions
arches-mpi TMD-MPI software library
compile.py Script to compile source code
compile.cfg Configuration file for shared mem model
streaming.cfg Configuration file for distributed mem model
doc Documentations folder.
sim_scripts Scripts and files necessary for simulation
xps Shared memory model project files
xps_streaming Distributed memory streaming project files
src/mb0_streaming.c Rank 0 code for distributed mem model
src/mb1_streaming.c Rank 3 code for distributed mem model
src/mb0_multi_main.c Rank 0 code for shared mem model
src/mbx_multi_main.c Rank 3-N code for shared mem model