Upload
lehanh
View
219
Download
0
Embed Size (px)
Citation preview
Throughput Exploration and
Optimization of a Consumer
Camera Interface for a
Reconfigurable Platform
By: Floris Driessen ([email protected])
11-12-2013
Introduction
• Video applications on embedded platforms
• Use of accelerators
• Faster
• Energy efficiency
• USB camera
1
11-12-2013
Platform of Interest - ZYNQ
• Zedboard by Digilent
• Xilinx Zynq platform
− Dual core ARM Cortex A9
− Programmable logic
• 512 MB RAM
• USB connectivity
• HDMI output
• USB camera
2
11-12-2013
Naïve implementation
• Software
1. Read camera frame
2. Copy frame to DMA region
3. Perform HW accelerated operation (Sobel)
4. Copy result from DMA region
5. Show result
• Separate DMA region needed due to lack of DMA
drivers
3
11-12-2013
Zynq platform
ARM Core 0
ARM Core 1
USB
Linux RAM
DMA RAM
Programmable logic
Bottleneck Study
• Performance limit
• Converting the format
• Camera output to
accelerator input
• Copying from/to DMA region
• Mmap
− Not cached
• Frame capturing
4
11-12-2013
Zynq platform
ARM Core 0
ARM Core 1
USB
Linux RAM
DMA RAM
Programmable logic
Possible improvements
• Exploiting scratchpad
• A frame would not fit
• DMA driver support
• Not feasible within time frame of project
• Optimize the current implementation
• Copying data
• Converting format
• Capturing camera frame
5
11-12-2013
Format conversion
• Naïve implementation
• Combined conversion and copy
− Writing small chunks to mmaped memory (slow)
• Split conversion and copy
• OpenCV mixChannels
• NEON interleaving
• ARM SIMD
• Next slide
6
11-12-2013
Implementation Convert + copy [s] Speed-up
Naïve 1,95 1x
Split 0,28+0,04=0,32 6,1x
OpenCV 0,05+0,04=0.09 21,7x
NEON 0,04 50,6x
0x00 R0 R0
0x01 G0 G0
0x02 B0 B0
0x03 R1 x
0x04 G1 R1
0x05 B1 G1
0x06 R2 B1
0x07 G2 x
.. .. ..
R7 R6 R5 R4 R3 R2 R1 R0 d0
G7 G6 G5 G4 G3 G2 G1 G0 d1
B7 B6 B5 B4 B3 B2 B1 B0 d2
x x x x x x x x d3
vld3.8 {d0-d2} [#0]
vst4.8 {d0-d3} [#0]
NEON RGB24 to RGB32 conversion example
0x00 R0 R0
0x01 G0 G0
0x02 B0 B0
0x03 R1 x
0x04 G1 R1
0x05 B1 G1
0x06 R2 B1
0x07 G2 x
.. .. ..
x x x x x x x x d0
x x x x x x x x d1
x x x x x x x x d2
x x x x x x x x d3
7
11-12-2013
0x00 R0 R0
0x01 G0 G0
0x02 B0 B0
0x03 R1 x
0x04 G1 R1
0x05 B1 G1
0x06 R2 B1
0x07 G2 x
.. .. ..
R7 R6 R5 R4 R3 R2 R1 R0 d0
G7 G6 G5 G4 G3 G2 G1 G0 d1
B7 B6 B5 B4 B3 B2 B1 B0 d2
x x x x x x x x d3
void __attribute__ ((noinline))
neonRGBtoRGBA_gas(unsigned char* src, unsigned char* dst,
int numPix)
{
asm(
// numpix/8
" mov r2, r2, lsr #3\n" // numpix/8
// load alpha channel value
" vmov.u8 d3, #0xff\n"
"loop1:\n"
// load 8 rgb pixels with deinterleave
" vld3.8 {d0,d1,d2}, [r0]!\n"
// preload next values
" pld [r0,#40]\n"
" pld [r0,#48]\n"
" pld [r0,#56]\n"
// substract loop counter
" subs r2, r2, #1\n"
//" vswp d0, d2\n"
// store as 4*8bit values
" vst4.8 {d0-d3}, [r1]!\n"
// loop if not ready
" bgt loop1\n"
);
}
Frame copy from/to DMA RAM
• OpenCV (as used in the naïve implementation)
• Manual copy (loop over virtual contiguous memory)
• Memcpy from C library
• NEON accelerated copy
8
11-12-2013
724 999 642
9
55
36
7
44
22
16
46
16
0
10
20
30
40
50
60
70
Linux RAM → Linux RAM Linux RAM → DMA RAM DMA RAM → Linux RAM
Ex
ecu
tion
tim
e [m
s]
OpenCV
Manual
Memcpy
Neon copy
Camera capture
• OpenCV
• Always BGR24
• Video4Linux
• Different formats
• Not a big improvement
9
11-12-2013
0.07
0.11
0.04
0.06
0.06
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Frame delay [s]
V4L2 RGB24
V4L2 BGR24
V4L2 MJPEG
V4L2 YUYV
OpenCV BGR24
Results
• Multiple configurations
• Combined the conversion and
copy (NEON accelerated)
• 1: Split convert and copy
• 2: OpenCV mixChannels
• 3: Combined mixChannels to external
• 4: No convert back + V4L capture
• 5: NEON copy
• 6: Combined NEON convert and NEON copy
10
11-12-2013
0.73
0.23
0.78
0.17 0.15 0.13
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6
Exec
uti
on
tim
e p
er f
ram
e [s
]
Application configuration
Copy back and convert
Sobel calculation
Convert and copy
Get frame
Contributions
• Framework for combining USB camera with
accelerators in programmable logic
• Multiple format conversion routines
• NEON
• NEON copying routines
• Video4Linux frame capture
11
11-12-2013
Capture
frame
Convert
format
Copy to
DMA RAM
Execute
accelerator
Copy
result back
Process
result
Convert
format
Conclusion and Future work
• Huge improvement 32x (0,2 to 7,7 FPS)
• Still one ARM core unoccupied for processing data
after accelerator
• Make camera frame buffer available to DMA
• DMA buffer sharing
− Linux kernel 3.8
• Improve frame capture
• Takes more than half of the time
• Latency of ~4 frames
• Driver from manufacturer
• Consider other cameras
12
11-12-2013