Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Multimed Tools ApplDOI 10.1007/s11042-010-0501-7
Evaluation of data-parallel H.264 decoding approachesfor strongly resource-restricted architectures
Florian H. Seitner · Michael Bleyer ·Margrit Gelautz · Ralf M. Beuschel
© Springer Science+Business Media, LLC 2010
Abstract Decoding of an H.264 video stream is a computationally demanding mul-timedia application which poses serious challenges on current processor architec-tures. For processors with strongly limited computational resources, a natural wayto tackle this problem is the use of multi-core systems. The contribution of thispaper lies in a systematic overview and performance evaluation of parallel videodecoding approaches. We focus on decoder splittings for strongly resource-restrictedenvironments inherent to mobile devices. For the evaluation, we introduce a high-level methodology which can estimate the runtime behaviour of multi-core decodingarchitectures. We use this methodology to investigate six methods for accomplishingdata-parallel splitting of an H.264 decoder. These methods are compared againsteach other in terms of runtime complexity, core usage, inter-communication and bustransfers. We present benchmark results using different numbers of processor cores.Our results shall aid in finding the splitting strategy that is best-suited for the targetedhardware-architecture.
Keywords Video decoding · H.264/AVC · Multimedia · Multi-core ·Embedded architectures
F. H. Seitner (B) · M. Bleyer · M. Gelautz · R. M. BeuschelInstitute for Software Technology and Interactive Systems,Vienna University of Technology,Favoritenstrasse 9-11/188/2, 1040 Vienna, Austriae-mail: [email protected]
M. Bleyere-mail: [email protected]
M. Gelautze-mail: [email protected]
R. M. Beuschele-mail: [email protected]
Multimed Tools Appl
1 Introduction
The H.264 video standard [9] is currently used in a wide range of video-related areassuch as video content distribution and television broadcasting. Compared to preced-ing standards such as MPEG-2 and MPEG-4 SP/ASP, improved coding efficiencycould be reached by introducing more advanced pixel processing algorithms (e.g.quarter-pixel motion estimation) as well as by the use of more sophisticated algo-rithms for predicting syntax elements from neighbouring macroblocks (e.g. context-adaptive VLC). These new coding tools result in significantly increased CPU andmemory loads required for decoding a video stream. In environments of limitedprocessing power such as embedded systems, the high computational demands posea challenge for practical H.264 implementations [6]. Multi-core systems provide anelegant and power-efficient solution to overcome these performance limitations.They enable running complex multimedia applications such as the one describedin [8] on embedded architectures.
However, the efficient distribution of the H.264 video algorithm among multipleprocessing units is a non-trivial task. For using the available processing resourcesefficiently, an equally balanced distribution of the decoder onto the hardware unitsmust be found. The system designer has to consider data dependency issues aswell as inter-communication and synchronization between the processing units.Furthermore, the resource limitations in an embedded environment such as lowcomputational power, small-sized on-chip memories and low bus bandwidth requirean efficient software design. A parallel decoder approach for mobile systems mustbe able to work under these resource restrictions.
The contributions of this paper are twofold. Our first and major contribution is anevaluation of different parallel decoding approaches. We evaluate these approachesin terms of runtime complexity, efficient core usage, data transfers and buffer sizes.This information is of prime importance when designing a resource-efficient multi-core decoder. Our results can be exploited for choosing that parallel approach whichperforms best under certain hardware restrictions. We provide a detailed overview ofsix different data-parallel splitting methods for H.264 and compare these approachesagainst each other. We investigate the impact of raising the number of cores on theresource requirements by processing different test sequences.
To accomplish our benchmark, we introduce a high-level simulation methodologyfor parallel decoder systems. This methodology represents the second contributionof this paper. The methodology, referred to as Partition Assessment Methodology,allows estimating the complex runtime behaviour of a parallel decoding architecture.The Partition Assessment Methodology models critical multi-core issues such asdata-dependencies between cores, buffer limitations and concurrency. We overcomethe problems of traditional estimation techniques that are typically not suitable formulti-core environments and not flexible enough for exploring a large range ofdifferent decoder configurations.
In the following, we go into more detail on both contributions and discuss themin the context of prior work. First, we describe methods for splitting the H.264decoder onto multiple processors. These methods form the foundation for thesplitting approaches investigated in this paper. We then focus on runtime estimationtechniques that we consider as being related to the proposed Partition AssessmentMethodology.
Multimed Tools Appl
1.1 Methods for parallel decoding
Various approaches for parallelizing the H.264 decoding process have been intro-duced in the literature. Van der Tol et al. [25] investigate methods for mappingthe H.264 decoding process onto multiple cores. Functional splitting of a videodecoder and the use of macroblock pipelining at thread-level are demonstratedin [3, 4, 10, 21]. Zhao and Liang [28] exploit frame parallelism in the Wavefronttechnique. A hierarchical approach working at group-of-picture and frame levels isdemonstrated in [20]. Meenderinck et al. [13] investigate the scalability of H.264 for adata-parallel decoding approach operating on the macroblock-level and on multipleframes in parallel. Moreover, these authors introduce an efficient technique forH.264 frame-level parallelization, namely the 3D-Wave strategy. All of those papersshare a focus on parallelization in terms of algorithmic scalability. Upper limits onthe number of processors and frames processed in parallel are given. However, thememory restrictions of embedded environments make these approaches hardly us-able for mobile and embedded architectures. More resource-efficient H.264 splittingapproaches are introduced in [23, 24, 26]. The focus of these papers is on efficientdecoder implementations for embedded architectures.
This paper focuses on evaluating the behaviour of different parallel decodingapproaches in terms of runtime complexity, efficient core usage, data transfers andbuffer sizes. This information is of prime importance when designing a resource-efficient multi-core decoder. The information can be exploited for choosing theparallel approach which performs best under certain hardware restrictions.
However, benchmarking a large variety of different decoder partitionings runningon multiple cores poses two problems. Firstly, architectures with four and more coresand appropriate benchmarking tools are not yet commonly available. Secondly, im-plementing the different partitioning methods and adapting them to run on a specificarchitecture is a highly labour-intensive task and limits the number of approachesthat can typically be evaluated. The estimation techniques described in the nextsection provide a way to gain insights into a video decoder’s runtime behaviourwithout necessarily requiring a completely functional system implementation.
1.2 Runtime estimation techniques
Analysing an algorithm’s definition (e.g., its source code) can provide coarse esti-mation on the computational complexity of an algorithm [12, 17]. However, thesetechniques do, in general, not consider the impact of the input data on the program’sexecution behaviour (e.g., input-dependent recursions and branches). For analysingvideo decoding systems where the runtime behaviour is typically strongly dependenton the input stream, these techniques are not suitable.
Profiling methods [1, 5, 7, 15, 18, 19, 27] overcome this problem. These methodseither simulate or extend the program in a way that the program’s executionbehaviour can be observed during runtime. This allows measuring the complexity ofdecoding functions in the context of a specific input stream. The profiling proceduretypically outputs measurements of consumed time or cycle counts of the investigatedprogram running on a specific hardware.
However, profilings are typically available only for single-core architectures. Forvideo decoding systems with strong variations in the runtime complexity, making
Multimed Tools Appl
predictions about the parallel decoding system’s behaviour based on a single-coreprofiling is non-trivial. Algorithmic dependencies, differences in workload and re-source limitations result in varying runtime constraints between the processing units.These constraints must be considered.
The simulator introduced in this work overcomes these shortcomings by extendingthe traditional profiling methods. Our simulator is capable of predicting the runtimebehaviour of a virtual multi-core decoder based on information gained from tradi-tional single-core profilers. In contrast to hardware-based development techniquessuch as HW/SW-Codesign methods [2, 11], the labour- and time-intensive task ofimplementing a fully functional hardware as well as the decoder software is avoided.This results in a high flexibility when exploring different hardware configurationsand decoder partitionings. In this work, we use this method to compare differentcompeting decoder partitionings and architecture configurations against each otherto find the best-suited one.
1.3 Paper structure
The remainder of the paper is organized as follows. Section 2 focuses on thefundamental differences between functional and data-parallel splitting approaches.Additionally, it describes the H.264 decoding process and the algorithmic depen-dencies that counteract the parallelization. In Section 3, we first analyse a single-coreprofiling of an H.264 decoder. We then derive our high-level simulation methodologyfor parallel decoders from this analysis. Section 4 gives a detailed overview of the sixevaluated parallelization approaches. We provide a structured analysis and describeour motivation for including them in our survey. Results are provided in Section 5.Final conclusions and an outlook on future work are given in Section 6.
2 Parallel H.264 decoding
This section first explains the two options for partitioning a video decoder. It thenanalyses the influence of the algorithmic structure of an H.264 video decoder on thepartitioning process.
2.1 Functional and data-parallel splittings
The memory restrictions of most embedded video coding architectures require theuse of small data packages and typically result in a macroblock-based decodingsystem. In functional-partitioned decoding systems, the decoding tasks such asparsing, motion compensation or deblocking are assigned to individual processingcores, typically one task per each processing unit. Macroblocks are processed byone processor after the other. The multiple processors allow starting the nextmacroblock’s decoding tasks before computation of the current macroblock hasfinished.
This splitting method has the advantage that each processing unit can be optimisedfor a certain task (e.g. by adding task-specific hardware extensions) and minimal-sized instruction caches. In contrast to data-dependent parallelization, also strongly
Multimed Tools Appl
sequential tasks can be accelerated by this strategy. The disadvantages are anunequal workload balancing and high transfer rates for inter-communication.
As opposed to functional splitting methods, data-parallel systems do not distributethe functions, but the macroblocks among multiple processing units. For efficientparallelization, the macroblocks’ core assignment algorithm has to address thefollowing issues:
– The data-dependencies between different cores must be minimized and data-locality must be exploited (i.e. supporting of caching strategies).
– The macroblock distribution onto the processing cores must achieve an equalworkload balancing.
– Generic macroblock core assignment for different frame sizes must be possible.
Scalability in parallel systems requires minimal data-dependency between thecores. A compromise between small memory size and data-dependencies can bereached by grouping the macroblocks as described in [25]. For supporting cachingstrategies at a more global scale, the groups of macroblocks assigned to one coremust be aligned closely together in each frame. By introducing a centralized andconstant macroblock assignment for each frame, this global caching issue can beaddressed efficiently. Additionally, a constant macroblock assignment allows theparsed macroblocks to be written directly to the input FIFOs of the correspondingreconstructing cores.
In this work, we focus solely on approaches based on data-parallelization. For ahigh core count, this parallelization paradigm offers better scalability and a morebalanced workload than functional splitting approaches [25].
2.2 The H.264 decoder
Figure 1 illustrates the principle of the H.264 decoding process. The decoder can bedivided into two fundamentally different parts, the parser and the reconstructor.
Stream ParsingEntropy Decoding
IDCT
SpatialPrediction
MotionCompensation
Deblocking+
ReferenceFrames
DecodedFrame
EncodedBitstream
(investigated in our study)(not investigated in our study)
- Parser- Functional Splitting
- Reconstructor- Data-Parallel Processing
Fig. 1 The H.264 decoding process. Macroblocks run through the decoder pipeline, one after theother
Multimed Tools Appl
The parser includes context calculations for the context-adaptive part of the en-tropy decoding process, the low-level entropy decoding functions (binary-arithmeticor variable length coding) and the reconstruction of the macroblock’s syntax ele-ments such as motion vectors or residual information. This decoder part is highlysequential with short feedback loops at bit- and syntax element-levels. Furthermore,the conditional calculations of the entropy encoder’s context information requireshort branch executions of the executing processor and efficient data access. Thismakes data-parallel processing hardly usable in these decoding stages and functionalsplitting is preferable. Our work therefore focuses on the reconstructor module ofthe H.264 decoder, which provides a wide range of possibilities for exploiting data-parallelization. In our simulations, we assume that stalls cannot be caused by theparser. In other words, an idealized parser always has those macroblocks availablethat are currently needed in the reconstructor module.
In the reconstruction part, the H.264 decoder uses the parsed syntax elements togenerate spatial (=intra) or temporal (=inter) predictions for each macroblock. Theresidual coefficients are inverse transformed (IDCT) and added to the macroblock’sprediction. Finally, an adaptive deblocking filter is applied to the outer and inneredges of the 16 4 × 4 pixel sub-blocks of a macroblock in order to remove blockingartefacts.
2.3 Macroblock dependencies
Data-parallel splitting of the decoder’s reconstruction module is challenging due todependencies between spatially as well as temporally neighbouring macroblocks.These dependencies originate from three sources illustrated in Fig. 2 and aredescribed as follows.
Firstly, in progressive frame coding, the intra-prediction uses unfiltered pixelinformation from up to four spatially neighbouring macroblocks to predict thecurrent macroblock. These dependencies are depicted in Fig. 2a. In general, it isa good option to gather the current macroblock and its reference macroblocks onthe same CPU to avoid expensive inter-processor communication for resolving thisdependency. For an efficient parallelization, this dependency must be addressedcarefully.
Secondly, the deblocking filter imposes additional spatial dependencies. Forfiltering the outer edges of the current macroblock, up to four pixel rows/columnsfrom the upper and left neighbouring macroblocks are used as filter input. These
CurrentMB
CurrentMB
CurrentMB
Frame n-1
Frame n
(a) (b) (c)
Fig. 2 Macroblock dependencies in H.264 decoding. Arrows mean that the macroblock at the originof the arrow needs to be processed before decoding the current macroblock. a Intra-predictiondependency. b Deblocking filter dependency. c Inter-prediction dependency
Multimed Tools Appl
macroblock dependencies are visualized in Fig. 2b. An efficient parallelizationmethod will focus on avoiding that these dependencies have to be resolved acrossindividual processors.
The third and final macroblock dependency arises from the inter-prediction.The inter-prediction reads reference data from macroblocks of previously decodedframes. Obviously, it is required that processing of these reference macroblockshas already been completed before they can be used for inter-prediction of thecurrent macroblock. This results in the temporal dependency depicted in Fig. 2c.In fact, the current macroblock can depend on a rather large number of referencemacroblocks. H.264 allows splitting of the current macroblock into small sub-blocks,for each of which a separate motion vector is computed. In P-slices, each inter-codedmacroblock can contain up to 16 motion vectors and point to one reference frame.For bi-directionally predicted macroblocks in B-slices, a maximum of 32 motionvectors and two reference frames is possible.
3 A method for simulating video decoding architectures
In this section, we describe the simulation methodology used for this evaluation.First, we analyse a single-core decoder profile and describe the profiling informationthat we use as input for the high-level simulation of a parallel decoder. Second, weprovide a detailed description of the methodology and the internals of the high-levelsimulation approach.
3.1 Analysis of a single-core decoder
Macroblock prof iling Traditional dynamic profiling provides runtime informationfor each decoding function. In our implementation, such profiles are derived from asingle-core instruction-set (IS) simulator for the multimedia processor CHILI [22].The profiler of the CHILI simulator provides the time of each function call and eachfunction return that has occurred during the execution of the decoder (Fig. 3). Notethat in this work we have used the CHILI profiler for obtaining this information.Nevertheless, this is no restriction of our approach, since profilers for other platformsare, in general, also able to provide this information.
The idea behind recording the time of function calls and returns is to determinethe computation time that an individual macroblock spends in the decodings stepsdepicted in Fig. 1. Figure 3 visualizes this idea of mapping each function call to thedecoding task of an individual macroblock. For example, we want to determine thecomputation time that is required for the intra-prediction of the third macroblock inthe stream. We know that a function IntraPrediction implements the intra-predictionfor this macroblock. Let us assume that in the decoding of a macroblock thisfunction is called exactly once. The processing times for the third invocation ofIntraPrediction will consequently represent the time that Macroblock 3 spends forthe intra-prediction.
However, it should be noticed that this mapping is considerably more complexin practice, since functions are typically called arbitrary times for the decoding ofa single macroblock, which depends on the macroblock’s coding mode. We have
Multimed Tools Appl
Macroblock 3
Parse
Intra
IDCT
Deblock
Inputstream
Profiler for IS simulator
Macroblock 2
A Cycles
B Cycles
C Cycles
D Cycles
Mapped profiling Macroblockprofiling
Call DecodeMacroblock....
Return DecodeMacroblock
Call DecodeMacroblockCall ParseMacroblock...Return ParseMacroblockCall IntraPrediction....Return IntraPredictionCall IDCT...Return IDCTCall Deblock....Return Deblock
Return DecodeMacroblock
Call DecodeMacroblock....
Return DecodeMacroblock....
Macroblock 1
....
Fig. 3 A dynamic profiler delivers the exact execution times for function calls and returns. We mapthis information to the major decoding functions of each macroblock and calculate the number ofcycles spent in each decoding block. We refer to this macroblock-assigned profiling information asmacroblock prof iling
therefore implemented an interpreter that performs the mapping of raw profilingdata to meaningful function units.
Decoder software The profiles are based on a commercial H.264/AVC main profiledecoder for embedded architectures. This decoder has been optimised in termsof memory usage and support of DMA transfers. In addition, the regular pixel-based processing functions of the decoder (e.g., interpolation, prediction) have beenassembly optimised to make use of the SIMD processor commands. For mobilemultimedia applications, using SIMD instructions typically improves the run-timeperformance in a power-efficient way [16].
Test sequences We use four HD test sequences that are shown in Table 1. At aPeak Signal-to-Noise Ratio of approximately 40 db, the four test sequences havebitrates between 12.6 and 50.8 MBit/s. The bitrate is an indicator for the amountof texture and motion occurring in a sequence. The Bus sequence with large, low-textured regions and moderate motion has the smallest bitrate of all test sequences(12.6 MBit/s). Shields and Stockholm are both moderately textured. The horizontalmovement in the Shields sequence as well as the zooming operation of the camerain the Stockholm sequence result in a slightly higher motion activity than in the Bussequence. The higher texturedness and motion activity for the Shields and Stockholmsequences lead to bitrates of 18.8 and 25.6 MBit/s, respectively. The backgroundin the Parkrun sequence contains strong texture patterns and a strong horizontalmotion with various temporal occlusions. This leads to the highest bitrate in our testset (50.8 MBit/s).
Multimed Tools Appl
Table 1 Four test sequences at 720p resolution
Sequences
Bus Shields Stockholm Parkrun
12.6 MBit/s 18.8 MBit/s 25.6 MBit/s 50.8 MBit/s1,280 × 720 pixels 1,280 × 720 pixels 1,280 × 720 pixels 1,280 × 720 pixelsModerately textured Strongly textured Moderately textured Strongly texturedModerate motion Moderate motion Continuous global Strong motion
motion
Rows 3 and 4 show the resulting bitrates (at 25 frames per second) and the frame resolution in pixels
Dynamic runtime behaviour We have computed macroblock-based runtime mea-surements for the test sequences of Table 1. Figure 4 visualizes the macroblocks’decoding complexity for all four sequences. Here, the important point lies in thedynamic behaviour of macroblocks. It can be clearly seen that cycle counts are verydifferent among individual macroblocks. As is also shown in the figure, this observa-tion can still be made when considering the classes of I-, B- and P-macroblocks alone.The pixel information that each macroblock encodes seems to have large influenceon its runtime behaviour.
Variations in the execution times of the decoding functions will result in a highlydynamic system when developing parallel decoders. Runtime prediction of a paralleldecoding system is not straight-forward in the light of this dynamic behaviour.Moreover, multi-core decoder simulation is complicated by dependencies that existbetween individual macroblocks. The methodology described in the following sec-tion considers these variations and dependencies when predicting the runtime of amulti-core decoding system.
(Bus) (Shields) (Stockholm) (Parkrun)
0 50 100 1500
0.5
1
1.5
2
2.5
3
3.5x 10
4
Complexity [KCycles/MB]
Bin
cou
nt
I-MBsP-MBsB-MBs
0 50 100 1500
0.5
1
1.5
2
2.5
3
3.5x 10
4
Complexity [KCycles/MB]
I-MBsP-MBsB-MBs
0 50 100 1500
0.5
1
1.5
2
2.5
3
3.5x 10
4
Complexity [KCycles/MB]
I-MBsP-MBsB-MBs
0 50 100 1500
0.5
1
1.5
2
2.5
3
3.5x 10
4
Complexity [KCycles/MB]
I-MBsP-MBsB-MBs
Fig. 4 Dynamic variations in the execution times of individual macroblocks in the H.264 decodingprocess. Histogram bins plot the number of macroblocks having similar runtimes. The colours indi-cate the contributions of macroblocks from I-, P- and B-slices to the overall bin counts. Histogramsare shown for 50 frames of the four sequences from Table 1. These sequences are IPB coded with theGroup of Pictures (GOP) size being 12. It is observed that the runtimes of macroblocks considerablyvary within a sequence. This observation is also made when considering I-, P- and B-macroblocksseparately
Multimed Tools Appl
3.2 Methodology
For a flexible description of the hardware as well as the decoding algorithm, we pro-pose a high-level modelling approach. This approach is implemented by a simulatorthat we refer to as the Partition Assessment Simulator (PAS). Figure 5 illustrates thebasic idea behind the PAS. To use the PAS, one has to give the following inputs.First, a description of a specific video decoding algorithm is required. Second, theuser provides an abstract description of the hardware as well as a partitioning ofdecoder functions onto this hardware. Third, the user presents macroblock profilingresults and coding information (e.g. coding type and motion vectors) of a specificstream to the PAS. The task of the PAS is then to simulate the runtime behaviour ofthe described distributed decoder running on this virtual architecture.
Modelling the hardware and the decoder software results in high flexibility whendoing design space explorations. This includes the evaluation of unknown hardwareconfigurations as well as decoder software partitionings. For example, let us assumethat we want to test a new multi-core hardware system. In this case, the systemdesigner just needs to provide an abstract description of the new hardware as wellas an updated mapping of decoder tasks to the given processing cores. These smallmodifications are already sufficient to estimate the runtime behaviour of the decodersoftware on the new hardware.
Figure 6 illustrates how the PAS works in more detail. The decoding processfor three macroblocks is shown in six simulation steps. Core 1 performs the firsttwo decoding tasks (i.e. parsing and entropy decoding) and writes the results toFIFO 1. In this example, the buffer size of FIFO 1 is limited to one macroblock.Core 2 reads the results from FIFO 1 and applies the remaining decoding tasks.Internally, the PAS maintains a list for each processing unit and FIFO buffer. Itstores the states and the filling levels of each processing unit and FIFO, respectively.The simulator sequentially processes one macroblock decoding task after the other.It uses the single-core macroblock profiles to determine the time required to finish
PARTITIONASSESSMENT
SIMULATOR (PAS)
Runtime behaviour- Complexity- Memory requirements- Transfer volume...
Video Stream
Video Algorithm
Simulated Hardware
Runtime profilingsCoding information
AlgorithmicStructure
Abstract HardwareDescription
Fig. 5 The Partition Assessment Simulator (PAS). Our simulator combines information about avideo codec (e.g., H.264), a virtual multi-core architecture and a video input stream in order toestimate the runtime behaviour of this specific video decoding architecture. Approximations of thedecoder’s performance (e.g., complexity, memory accesses, etc.) are accomplished without the needto fully implement the hardware or the decoder software
Multimed Tools Appl
Fig. 6 Visualization of theinternal simulation process inthe PAS. The f igure shows theexecution states and bufferlevels of two cores and oneFIFO, respectively. Threemacroblocks are processed ontwo individual cores. Afterpartially decoding eachmacroblock, Core 1 writes theresults to FIFO 1. Forsimplicity, the FIFO’smaximum size is set to onemacroblock. Core 2 reads themacroblocks from this bufferand computes the remainingdecoding tasks. A detailedexplanation is given in the text
2
2
1
1
Core 1
Core 2
FIFO 1
1
(a)
1Core 1
Core 2
FIFO 1(b)
1
1Core 1
Core 2
FIFO 1(c)
2
1
1Core 1
Core 2
FIFO 1(d)
21
1Core 1
Core 2
FIFO 1(e)
3
2
2
1
1Core 1
Core 2
FIFO 1(f)
3
3
1
2
2
32
2 3
S
S E
E
E
E
E
E
E
E
E
E
E
E
S
S
S
S
S
S
S
S
S
4
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
time
a specific decoding task. The data-dependencies of a macroblock depend on themacroblock’s coding mode. This coding mode is derived from the macroblocks’coding information.
For each decoding task, the duration and the dependencies on other tasks andmacroblocks are determined by the macroblock profiling and coding information.Firstly, the PAS uses the algorithm mapping to determine the processing unitexecuting a decoding task. Secondly, it evaluates when this processing unit can start
Multimed Tools Appl
to execute a task. In Fig. 6a, this is indicated by the white marker (S). The core’sexecution counter is increased by the task’s duration. The black marker (E) indicatesthe end of a task. At this point, the core has finished the task execution and writtenits results to the output buffer FIFO 1. At this stage, the PAS does not know whenthe data is removed by another task and marks the state of FIFO 1 as occupied.
In Fig. 6b, Core 2 reads the macroblock data from the FIFO and frees theoccupied memory in FIFO 1. During the decoding of a macroblock, functionaldependencies between the decoding tasks and data dependencies between theindividual macroblocks exist. We have presented macroblock dependencies for theH.264 codec in Section 2.3. These dependencies are determined by the algorithmdefinition and change according to a macroblock’s specific coding (e.g. intra-prediction poses dependencies on the spatially neighbouring macroblocks). Thesedependencies obviously change the output of our simulator. We use the macroblocks’coding information for adapting the dependencies of a macroblock according to itscoding. In Fig. 6b, a read stall due to data-dependencies is illustrated. Core 2 cannotstart its decoding operations simultaneously with Core 1, but has to wait until therequired data becomes available in the input buffer. The PAS uses the algorithmdescription for detecting this read stall. The start of the task execution is delayedautomatically.
In Fig. 6c, the second macroblock is executed by Core 1 and written to FIFO 1.After Core 2 has finished its decoding operations for Macroblock 1, it reads Mac-roblock 2 and executes it (Fig. 6d).
Until now we have only considered the impact of computational complexity onour multi-core decoding system. Additionally, the PAS provides flexible interfaces todescribe buffer size constraints. For each macroblock task, limitations in the buffercommunication (e.g., write stalls due to insufficient buffer sizes) can be considered.The PAS maintains a set of user-defined rules on how the runtime behaviour of thedecoder is influenced if buffer scarcity occurs. For example, the parsing task requiresa certain amount of free memory for writing its results into an output buffer. Weextract this implementation-dependent information from the macroblocks’ codinginformation. We specify how much memory is required and how the decoder’sbehaviour is influenced by accessing these resources (e.g., “What happens in thecase of insufficient buffer?” or “How does the memory access latency influence thedecoder’s runtime?”). The PAS provides a language for fast specification of bufferand memory resources.
Figure 6e visualizes a case where insufficient buffer is available and a writestall occurs. Since FIFO 1 has a maximum size of 1, Core 1 cannot write theresults of decoding Macroblock 3 immediately. It has to wait until Core 2 has readMacroblock 2 and freed the occupied memory in FIFO 1. A write stall occurs in thiscase. In the following, Macroblock 2 is finished and Core 2 reads Macroblock 3 fromFIFO 1 and decodes it. After Macroblock 3 could be written to FIFO 1 (Fig. 6f),Core 1 continues with the next macroblock.
4 Evaluated methods
The approaches considered in this evaluation represent the main groups of data-parallel approaches addressed in the literature. All of them work under the re-
Multimed Tools Appl
5678
1 1 1 1 1 1 1 12 2 2 2 2 2 2 21 1 1 1 1 1 1 12 2 2 2 2 2 2 212
1 1 1 1 1 1 12 2 2 2 2 2 2
1 1 1 1 1 1 1 12 2 2 2 2 2 2 2
1 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 41 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 4
1 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 4
5 5 5 5 5 5 56 6 6 6 6 6 67 7 7 7 7 7 78 8 8 8 8 8 8
(2 Cores) (4 Cores) (8 Cores)
Fig. 7 The Single-row splitting approach. The assignment of processors to macroblocks is shown
strictions of typical embedded architectures. These restrictions include low systemmemory and buffer sizes on the hardware side as well as thread-less and application-dedicated runtime environments on the OS side.
Our evaluation study compares the performance of six different approaches foraccomplishing data-parallel splitting of the decoder’s reconstructor module. Theseare (1) a Single-row approach, (2) a Multi-column approach, (3, 4) a blocking and anon-blocking version of a Slice-parallel method, (5) a dynamic version of the non-blocking Slice-parallel approach as well as (6) a Diagonal technique. Details on thebenchmarked approaches are provided in the following.
4.1 Single-row approach
The first splitting strategy investigated in our study distributes horizontal lines ofmacroblocks among different processors as shown in Fig. 7. More formally, let Nbe the number of processors. Processor i ∈ {0, · · · , N − 1} is then responsible fordecoding the yth row of macroblocks if y mod N = i.
To illustrate the Single-row (SR) approach, we give an example with two proces-sors on an image divided into 8 × 8 macroblocks in Fig. 8. In this example as well asin the following ones of Figs. 11, 13, 14 and 16, we assume that it takes a constantvalue of 1 unit of time to process each macroblock. It is, however, important tonotice that this is a coarse oversimplification. In real video streams, there are largevariations in the processing times of individual macroblocks, which makes it difficult
12121212
1 1 1 1 12
1 12
1 1 1 1 1 12 2 2 2 2
1 12
1 1 1 1 1 12 2 2 2 2
1 12 2
1 1 1 1 1 1 12 2 2 2 2 2 21 1 1 1 1 1 12 2 2 2 2 2 21 1 1 1 1 1 12 2 2 2 2 2 21 1 1 1 1 1 12 2 2 2 2 2 2
(t = 2) (t = 3) (t = 8) (t = 10) (t = 34)
Fig. 8 Example of the Single-row splitting approach used with two cores. Processed macroblocksare shown at different instances of time t. It takes a constant value of 1 unit of time to process amacroblock
Multimed Tools Appl
to evaluate the goodness of a parallelization approach. In Fig. 8, only processor1 is able to decode macroblocks at time t = 2, since all other macroblocks areblocked as a consequence of the dependencies illustrated in Fig. 2. After the firsttwo macroblocks of row 1 have been computed, the second core can start processingthe first macroblock of the second row, since the dependencies for this macroblockare now resolved (t = 3). The next interesting event occurs at t = 8, when Core 1has finished the computation of the first row. Macroblocks of the second row havealready been computed and therefore Core 1 can start decoding macroblocks of row3 that are dependent on their upper neighbours. At time t = 10 we obtain the samesituation as at t = 2 where the first core unlocks the second one. Finally, the wholeframe has been decoded at t = 34.
The advantage of the Single-row approach lies in its simplicity. It is very easy tosplit the frame among the individual processors. There is only a small start delay afterwhich all processors can effectively work. The potential downside of this approachis that there are many dependencies that need to be resolved across processorassignment borders. This has played no role in our example where we have assumedconstant processing time for each macroblock. However, this problem will be no-ticeable for real videos streams that contain macroblocks of considerably differentruntime characteristics. In fact, each macroblock processed by Core i depends on itsupper neighbours that are processed by a different Core i − 1. If processor i − 1 failsto deliver these macroblocks at the right time, this will immediately produce stalls atCore i. This behaviour is shown in Fig. 9a. On the other hand, this strong coupling ofCPUs can potentially lead to low buffer requirements.
4.2 Multi-column approach
The second splitting method of our study divides the frame into equally-sizedcolumns as shown in Fig. 10. Let w denote the width of a multi-column that istypically derived by dividing the number of macroblocks in a row by the numberof processors. Processor i is then responsible for decoding a macroblock of the xthcolumn if iw ≤ x < (i + 1)w. A similar method to partition the image has recentlybeen proposed for the H.264 encoder in [24].
Figure 11 simulates the Multi-column (MC) approach using two processors. Core 1thereby starts processing the first row of macroblocks until it hits the border to
Time
Core 1
Core 2
Time
Core 1
Core 2
(a) (b)
Fig. 9 The number of inter-processor dependencies is crucial for the overall performance of themulti-core system. Rectangles represent macroblocks. A macroblock’s width indicates the requiredprocessing time. Arrows between two macroblocks mean that processing of the macroblock whichthe arrow points to can only start after the other macroblock has been decoded. a A large numberof inter-processor dependencies slows down the system. b Due to the low amount of inter-processordependencies, different running times of individual macroblocks become averaged out. This shouldimprove the overall performance
Multimed Tools Appl
3 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 23 4 5 6 7 81 2
11111111
11111111
22222222
22222222
33333333
33333333
44444444
44444444
11111111
11111111
11111111
11111111
22222222
22222222
22222222
22222222
(2 Cores) (4 Cores) (8 Cores)
Fig. 10 The multi-column splitting approach
the macroblocks assigned to processor 2 (t = 4). Since the dependency for the left-most macroblock of Core 2 is now resolved, processor 2 can finish decoding its firstmacroblock at t = 5. We obtain a similar situation at t = 8. The dependencies of theleftmost macroblock of the second row have been resolved, and Core 2 can thereforecontinue its work. Decoding of the frame is finally completed at t = 36.
The basic idea behind using the Multi-column approach is to obtain a loosercoupling of processor dependencies. In fact, the processor assignment borders aresignificantly reduced in comparison to the Single-row approach. One processor hasto wait for the results of another one only at the boundary of its multi-column.Within the multi-column, macroblock dependencies can be resolved on the sameprocessor. This should lead to reduced inter-processor dependencies and couldtherefore improve the overall runtime behaviour of the multi-core system as isdepicted in Fig. 9b.
4.3 Slice-parallel approach
As a third splitting method, we investigate a 90-degree rotated version of theMulti-column approach that divides the frame into even-sized rows. This methodis depicted in Fig. 12. Formally spoken, let h denote the height of a multi-row. Amacroblock of the yth row is then assigned to Core i if ih ≤ y < (i + 1)h.
The runtime behaviour of the Slice-parallel (SP) approach is illustrated in Fig. 13.Here, Core 2 has to wait for a relatively long time (t = 26) until the dependencies for
11111111
11111111
11111111
11111111
22222222
22222222
22222222
22222222
1 1 1 1 1 1 1 11
2 1 1 1 11
2 2 2 21 1 1
(t = 4) (t = 5) (t = 8) t = 36)(
Fig. 11 Example of the multi-column splitting approach
Multimed Tools Appl
1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2
1 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 23 3 3 3 3 3 3 33 3 3 3 3 3 3 34 4 4 4 4 4 4 44 4 4 4 4 4 4 4
5678
1 1 1 1 1 1 1 12 2 2 2 2 2 2 23 3 3 3 3 3 3 34 4 4 4 4 4 4 4
5 5 5 5 5 5 56 6 6 6 6 6 67 7 7 7 7 7 78 8 8 8 8 8 8
(2 Cores) (4 Cores) (8 Cores)
Fig. 12 The slice-parallel splitting approach
its first assigned macroblock are resolved. While the first processor can complete itswork on the current frame at t = 32, it still takes 26 units of time until the second corefinishes processing the remainder of the frame at t = 58. In the following, we refer tothis approach as the blocking Slice-parallel technique.
The incorporation of the Slice-parallel approach into our benchmark is alsomotivated by recent work [14] that has presented a non-blocking encoder versionof this method. The authors encode their video streams so that slice borders coincidewith horizontal lines in the frames. Since neither dependencies introduced by intra-prediction nor dependencies introduced by the deblocking filter occur across sliceborders, the multi-rows can be processed completely independent from each other.
Obviously, this non-blocking SP approach (NBSP) requires having full controlover the encoder, which will not be the case for many applications. This downsideshould be considered when interpreting the results for this method in our benchmark.For completeness, we also give an example of this approach in Fig. 14. Here, bothcores can start processing their assigned macroblocks immediately (t = 1) and finishthe decoding of the complete frame at t = 32.
4.4 Rotating slice-parallel approach
Content-dependent differences in the decoding complexity of the slices will resultin an unequal workload of each core [23]. For example, static and low textured
1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1
1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1
1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2
1 1 1 1 1 12 2 2 2 2 2
(t = 26) (t = 32) t = 58)(
Fig. 13 Example of the Slice-parallel approach in the blocking version
Multimed Tools Appl
Fig. 14 Example of theSlice-parallel approach in thenon-blocking version
1
2
1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 12 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2
(t = 1) (t = 32)
frame regions will use coarse macroblock partitions and few residual information.This requires less processing power than for moving and strongly textured regions.If we assign a slice with low complexity constantly to the same core, a drift betweenthis core and the cores working on more complex slices will occur on the long termin the non-blocking NBSP approach. This will result in low system usage and highbuffer size requirements for compensating the differences in the workload.
For investigating the impact of the sequence content on the NBSP approach’sruntime behaviour, we introduce a modified version of this approach (NBSPrapproach). It supports a rotating core assignment depending on the frame number.For example, in a two-core splitting, the first processor would process the upperimage part in the first frame, the bottom part in the second frame, then again theupper part and so on. In contrast to the NBSP approach with static assignment, onaverage, this rotating approach should reach a more balanced system workload.
4.5 Diagonal approach
As a final approach, we include the splitting method illustrated in Fig. 15 intoour evaluation. This processor assignment is obtained by dividing the first line ofmacroblocks into equally-sized columns. The assignments for the subsequent linesare then derived by left-shifting the macroblock assignments of the line above. Thisprocedure leads to diagonal patterns.
11112222
2222
22
2 22
2 22
22 22 22 2 2
22
111 1
11
3
3 333
333
33
33
3
333
1 1 1 11 1 1 1
1 1 1 11 1 1 1
1 1 11 1
1
2 2 2222
11 1
1 11 1
1
1
11 1
111
1
22
22
22 2
2 22 2
222
22
4 44 4
4 44 4
4 44 4
4 44 4
3 4 5 6 7 81 23 4 5 6 7 82
3 4 5 6 7 84 5 6 7 85 6 7 86 7 87 88 3 4 5 6 71 2
3 4 5 61 23 4 51 2
3 41 231 2
1 21
(2 Cores) (4 Cores) (8 Cores)
Fig. 15 The diagonal splitting approach
Multimed Tools Appl
1 1 1 1 1 1 1 1
11 11 1 1
2 2 2 22 2
1 1 1 1
11 11 1 1
2 2 2 22 2 2 2
2 2 2 21 1 1 11
22 2 2 2
2 2 2 21 1 11 1 1
11 1
2
2 2 2 21 1 1 11
22 2 2 2
2 2 2 21 1 11 1 1
11 1
2
(t = 4) (t = 10) (t = 12) (t = 13) (t = 16)
2 2 2 21 1 1 1
1 1 1 11 1 1 1
1 1 1 11 1 1 1
1 1 11 1
1
2 2 2222
12222
222
22
22 2 2 2
2 2 2 22 2 2 2
1 11 1 11 1 1
2 2 2 21 1 1 1
1 1 11 1
1
12 2 2 2
2 2 2 22 2 2 2
2 2 2 21 11 1 11 1 1
2 2 2 21 1 1 1
1 1 11 1
1
22 2 2 2
2 2 2 22 2 2 21 1 11 1 1
12 2
1 1
2 2 2 21 1 1 1
1 11
2 2 2 22 2 2 2
2 2 2 21 1 11 1 1
11 1
2 2 2 21 1 1 1
1 11
2 22 2 2 2
2 2 2 21 1 11 1 1
11 1
(t = 18) (t = 20) (t = 23) (t = 24) (t = 43)
Fig. 16 Example of the diagonal approach
Figure 16 gives an example of the diagonal (DG) approach using two processors.Here, the second core stalls until its dependencies become resolved by Core 1 att = 4. The first core completes computation of its first image partition at t = 10.Unfortunately, it cannot directly start processing the second partition, but has towait for Core 2 to resolve dependencies until t = 12. The following images (t ={13, 16, 18, 20, 23, 24}) show situations where the first core has to wait for mac-roblocks decoded by processor 2. For legibility, we do not show subsequent stateswhere one processor blocks the other one, but directly proceed to the final resultderived at t = 43.
Inclusion of the Diagonal approach into our evaluation is motivated by themacroblock partitioning approaches proposed in [25]. The Diagonal splitting methodis regarded as an approach that “respects” the dependency patterns spanned by theintra-prediction and the deblocking filter. (Dependencies are shown in Fig. 2). We il-lustrate the idea behind the Diagonal splitting method in Fig. 17. The figure comparesthe inter-processor dependencies introduced by Diagonal and Multi-column splittingtechniques. The Diagonal method thereby only shows dependencies on macroblocksfrom its left neighbouring processor, which is in contrast to the Multi-column method
1 1 2 2
1 1
1
2 22
2 2
3
3
3
3
3
1 1 1 2 2 2
1 1
1
2 22
2 22
3
3
3
3
33
2
2
3
1
(a) (b)
Fig. 17 Processor dependencies in the diagonal and multi-column approaches. a In the Diagonalmethod, dependencies for Core 2 originate solely from macroblocks assigned to Core 1. Core 2therefore never has to wait for Core 3. b In the Multi-column approach, macroblocks assigned toCore 2 are also dependent on processor 3, as indicated by the dashed arrow
Multimed Tools Appl
(AVERAGE) (Bus) (Shields) (Stockholm) (Parkrun)
0
10
20
30
40
Fra
mes
per
sec
ond
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
10
20
30
40NBSPrMCNBSPSRDGSP
0
10
20
30
40NBSPrMCNBSPSRDGSP
0
10
20
30
40NBSPrMCNBSPSRDGSP
0
10
20
30
40NBSPrMCNBSPSRDGSP
Fig. 18 Frames per second achieved for a system running at 800 MHz. The achieved frame ratefor each parallelization approach on average is shown. We considered 1, 2, 4 and 8 cores. Theparallelization methods are: the Single-row method (SR), the Multi-column method (MC), the non-blocking versions of the Slice-parallel method (NBSP/NBSPr), the Diagonal method (DG) and theblocking version of the Slice-parallel method (SP). Additionally, the f igure provides detailed runtimeresults for each of the four 720p sequences
that contains dependencies on macroblocks of both neighbouring cores. Thesereduced inter-processor dependencies could lead to an improved runtime behaviourof the multi-core system.
5 Results
5.1 Run-time complexity
Two major indicators for the efficiency of a multi-core decoding system are thedecoder’s runtime and the number of data-dependency stalls occurring during thedecoding process. A low runtime indicates a high system decoding performance,while the number of stalls provides an estimate on how efficiently the system’scomputational resources are used.
Figure 18 provides runtime estimates for the decoder performance in framesper second for a system running at 800 MHz. Additionally, Fig. 19 shows thepercentage of stall cycles occurring during the overall decoding process for the sixinvestigated approaches. We provide the average core usage of the decoders usingdifferent approaches and core counts in Fig. 20. Additionally, Table 2 providesdetailed performance numbers and shows the average performance results over all
(AVERAGE) (Bus) (Shields) (Stockholm) (Parkrun)
0
20
40
60
80
Sta
lls [%
]
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
20
40
60
80
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
20
40
60
80
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
20
40
60
80
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
20
40
60
80
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
Fig. 19 Stall cycles caused by the data-dependencies between the cores. We plot the percentageof stall cycles in relation to the overall cycles spent for decoding a sequence. The f igure shows theaverage number of stalls over all HD sequences and for the individual HD sequences considering 1,2, 4 and 8 cores
Multimed Tools Appl
(AVERAGE) (Bus) (Shields) (Stockholm) (Parkrun)
50
100
150
200
Usa
ge [%
]
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
50
100
150
200
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
50
100
150
200
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
50
100
150
200
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
50
100
150
200
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
Fig. 20 Average core usage for four test sequences. For the parallelization approaches, the system’saverage usage is shown for 1, 2, 4 and 8 cores
HD sequences considering 1, 2, 4 and 8 cores. It ranks the approaches according totheir runtime performance and memory transfer rates.
In Fig. 18, it is observed that the SR method and MC method as well as the NBSPapproach perform considerably better than the DG and the blocking SP approaches.Despite the usage of depending partitions, the MC approaches even outperform theNBSP approach without data-dependencies between the cores.
We found that the constant core assignment to frame regions and the runtimedifferences between these regions result in an unbalanced workload distributionfor the NBSP approach. According to our results, approaches working on a finergranularity level and alternating the cores over the frame more frequently (e.g. theSR approach) adapt better to local complexity peaks.
For comparing the scalability of the parallel approaches for different resolutions,we have included five common CIF sequences in our evaluation. These are theBarcelona, the Bus, the Flowergarden, the Foreman and the Mobile sequenceswhich have been encoded using the same encoding parameters as for the four HDsequences. Figure 22 shows the average performance results in terms of runtime,stalls, core usage and memory transfer rates for these sequences. For CIF resolution,the NBSP and the SR show similar runtime performance. With increasing numberof cores the NBSP approach has the advantage of less dependencies between thesplitting partitions which results in a better runtime performance.
The impact of the rotating order in the NBSPr approach can be observed for allcore counts and sequences. However, architectures with high core counts tend tobenefit stronger from this rotating order than systems with few cores. A good rankingof the NBSPr approach for the CIF and the HD sequences indicates that rotating theslices increases the runtime performance independent of the resolution.
Evaluating only the encoder-independent approaches shows a similar scalabilityfor the SR and the MC approaches for HD decoding. For most of these sequencesand core counts, the MC approach outperforms the SR approach. For the SRapproach, the different complexity between adjacent lines accumulates and results ina faster increase of the stall cycles for increasing core counts. For decoding sequencesat CIF resolution, the SR approach seems to be more suitable. The simulations showthat the performance for the MC approach with 8 cores decreases more stronglycompared to the systems with fewer cores. At CIF resolution with a frame width of22 macroblocks, no partitioning into equally sized vertical partitions is possible formore than two cores. Additionally, the column width decreases with each additionalcore and results in a more frequent inter-communication between the cores.
Multimed Tools Appl
Tab
le2
Res
ults
ofou
rev
alua
tion
Alg
orit
hmA
vg.
Bus
Shie
lds
Stoc
khol
mP
arkr
unra
nk↓
FP
SS
UT
FP
SS
UT
FP
SS
UT
FP
SS
UT
1C
ore
NoS
plit
–5.
20.
010
0.0
39.8
4.5
0.0
100.
042
.04.
50.
010
0.0
41.4
4.1
0.0
100.
043
.12
Cor
esN
BSP
r1.
610
.03
3.6 3
96.3
338
.81
8.8 1
1.0 1
98.9
141
.11
9.0 2
0.9 2
99.1
240
.41
8.2 1
0.3 1
99.6
142
.11
MC
2.4
9.9 4
4.7 4
95.2
440
.03
8.8 2
1.2 2
98.7
242
.33
9.0 1
0.7 1
99.3
141
.63
8.1 2
0.9 2
99.1
243
.43
NB
SP2.
710
.12
2.1 2
97.8
238
.81
8.8 3
1.6 3
98.4
341
.11
8.8 4
3.1 4
96.9
440
.41
7.9 4
3.6 4
96.4
442
.11
SR3.
610
.21
1.7 1
98.2
158
.66
8.7 4
2.3 4
97.7
461
.06
8.9 3
1.8 3
98.2
360
.86
8.0 3
2.1 3
97.9
362
.16
DG
5.0
9.6 5
7.6 5
92.4
540
.75
8.5 5
4.5 5
95.4
542
.95
8.7 5
4.0 5
95.9
542
.35
7.9 5
3.9 5
96.0
544
.05
SP5.
56.
2 640
.46
59.5
640
.14
5.5 6
38.9
661
.06
42.4
45.
5 639
.26
60.8
641
.84
5.0 6
38.9
661
.16
43.5
4
4C
ores
NB
SPr
1.0
20.1
12.
7 197
.21
37.1
117
.81
0.5 1
99.5
139
.21
17.7
11.
9 198
.01
38.4
116
.31
0.9 1
99.1
140
.21
NB
SP2.
519
.82
4.2 2
95.8
237
.11
16.8
37.
3 394
.23
39.2
117
.53
3.3 3
96.6
338
.41
15.3
47.
0 492
.94
40.2
1
MC
2.6
19.2
47.
1 492
.84
40.5
317
.52
4.4 2
97.7
242
.83
17.7
12.
5 297
.52
42.1
316
.02
2.7 2
97.3
243
.83
SR4.
119
.63
5.4 3
94.6
358
.66
16.7
49.
5 493
.64
61.0
617
.14
5.6 4
94.4
460
.86
15.4
36.
1 393
.93
62.1
6
DG
5.0
17.6
515
.25
84.8
541
.65
15.9
520
.55
89.1
543
.95
16.3
59.
7 590
.35
43.2
514
.85
10.0
589
.95
44.9
5
SP5.
57.
1 665
.66
34.3
640
.94
6.3 6
79.6
635
.06
43.2
46.
4 664
.86
35.2
642
.54
5.7 6
65.3
634
.76
44.3
4
8C
ores
NB
SPr
1.0
39.8
13.
9 196
.11
33.5
134
.91
2.3 1
97.7
135
.51
35.2
12.
6 197
.31
34.8
132
.21
1.9 1
98.0
136
.41
MC
2.4
37.3
39.
7 390
.23
41.4
334
.22
4.4 2
95.6
243
.73
34.5
24.
7 295
.32
43.1
331
.42
4.5 2
95.5
244
.83
NB
SP3.
136
.54
11.9
488
.14
33.5
133
.13
7.3 3
92.7
335
.51
32.0
411
.64
88.4
434
.81
30.0
48.
8 491
.24
36.4
1
SR3.
838
.42
7.3 2
92.7
258
.66
32.3
49.
5 490
.54
61.0
633
.33
8.0 3
91.9
360
.86
30.2
38.
0 392
.03
62.1
6
DG
5.0
31.3
524
.35
75.7
543
.45
28.4
520
.55
79.5
545
.75
29.2
519
.35
80.7
545
.15
26.4
519
.75
80.3
546
.75
SP5.
58.
3 679
.96
20.1
642
.44
7.3 6
79.6
620
.46
44.7
47.
4 679
.56
10.5
644
.14
6.6 6
79.9
620
.16
45.8
4
We
eval
uate
four
perf
orm
ance
met
rics
for
each
test
sequ
ence
and
deco
der
split
ting
appr
oach
.The
sear
eth
enu
mbe
rof
fram
esde
code
din
1s
(FP
S),t
hepe
rcen
tage
ofst
allc
ycle
sin
rela
tion
toth
eov
eral
lcyc
leco
unt
(S),
the
aver
age
core
usag
e(U
)gi
ven
inpe
rcen
tan
dth
eda
tatr
ansf
er(T
)m
easu
red
inM
B/s
.The
tabl
e’s
valu
esar
evi
sual
ized
inF
igs.
18–2
1.T
hesu
bscr
ipts
deno
tera
nks
com
pute
dby
com
pari
son
agai
nst
the
com
peti
ngsp
litti
ngap
proa
ches
.Ran
k1
mea
nsth
atth
eap
proa
chis
the
best
-per
form
ing
one
acco
rdin
gto
the
perf
orm
ance
met
ric.
The
colu
mn
Avg
.ran
kis
com
pute
das
the
aver
age
ofal
lsub
scri
pts
wit
hin
aro
w.T
ode
rive
anov
eral
lpe
rfor
man
cera
nkin
gof
split
ting
appr
oach
es,w
eso
rtth
eta
ble
acco
rdin
gto
Avg
.ran
k
Multimed Tools Appl
(AVERAGE) (Bus) (Shields) (Stockholm) (Parkrun)
20
40
60
80
100
Mem
ory
tran
sfer
s [M
B/s
]
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
20
40
60
80
100
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
20
40
60
80
100
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
20
40
60
80
100
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
20
40
60
80
100
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
Fig. 21 Data transfer volume for intra and inter reference data and deblocking information, if onemacroblock line can be cached in local core memory
The diagonal (DG) approach shows relatively poor performance, especially forhigh core counts and small resolutions. For each additional core, the DG approachcreates two additional partitions with dependencies on other cores’ partitions.Adding a new core to the system raises the number of dependencies faster than forthe other approaches and results in a higher number of stalls. For CIF resolutionand high core counts a stronger decrease in performance than for HD resolution isobserved.
For the SP approach, the coarse horizontal splitting and the dependencies betweeneach partition and its upper partitions result in high stall counts and the worst runtimeof all approaches. With increasing resolutions, each core has to wait longer until theprevious partition is finished. Hence, the efficiency of the SP approach decreaseswith increasing resolution.
5.2 Inter-communication
Memory transfers to and from the external DRAM and between the cores’ localmemories are expensive in terms of power consumption and transfer times. In avideo decoding system, these transfers are strongly influenced by the core inter-communication and the loading of reference data and deblocking pixels. Dependingon the size of the cores’ local memories and how efficiently the parallelizationapproach exploits locality, macroblock information can be kept on a core and reusedfor the decoding of future macroblocks. In our work, we assume that exactly one
(Average frame rate) (Average stalls) (Average usage) (Average transfer)
0
200
400
600
800
1000
1200
Fra
mes
per
sec
ond
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
10
20
30
40
50
Sta
lls [%
]
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
50
100
150
200
Usa
ge [%
]
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
0
2
4
6
8
10
Mem
ory
tran
sfer
s [M
B/s
]
1 Cor
e
2 Cor
es
4 Cor
es
8 Cor
es
NBSPrMCNBSPSRDGSP
(a) (b) (c) (d)
Fig. 22 Average frame rate, stall cycles and usage for a system running at 800 MHz and decodingsequences with CIF resolution. The sequences included in these calculations are Barcelona, Bus,Foreman, Flowergarden and Mobile. We calculated (a) the average framerate, (b) the percentage ofstall cycles, (c) the average usage and (d) the memory transfer rate for 1, 2, 4 and 8 cores
Multimed Tools Appl
macroblock line of intra-prediction and deblocking data can be kept in the corememory of each core. The resulting transfer rates for HD and CIF resolution videosare shown in Figs. 21 and 22d, respectively.
A strong difference between the approaches can be observed for the HD as wellas for the CIF sequences. The SR approach only exploits data locality between hor-izontally neighbouring macroblocks. The fine horizontal splitting into macroblocklines results in the highest transfer rate of all approaches and represents the worstcase for horizontal splitting approaches in terms of memory transfers. Note that fortwo or more cores the transfer rate of the SR approach is constant. The alternatingprocessing order of the macroblock lines requires the transfer of vertical referencedata and deblocking information between the cores.
For the MC approach, a smaller amount of data has to be exchanged between thecores and efficient data reuse is possible. This results in the lowest transfer rate of allencoder-independent approaches and even outperforms the blocking SP approachat HD as well as CIF resolution. Increasing the core counts only results in a slighttransfer increase for the MC and the SP approaches. For the NBSP and the NBSPrapproaches, the data communication even decreases for high core counts, since thenumber of independent frame partitions increases. However, this results in blockingartefacts at slice borders and a less efficient prediction.
The transfer rates for the DG approach are slightly worse than for the MC and theSP approaches for all core counts. These results can be observed for HD as well asCIF resolutions. The two additional partitions for each additional core result in newdata dependencies to neighbouring partitions and a stronger increase of the datatransfers when raising the number of cores.
6 Conclusions
In this study, we have evaluated six data-parallel approaches for decoding H.264video streams in environments with strong resource restrictions. We compare theapproaches against each other and investigate the impact of raising the core counton the decoder’s runtime, core usage and transfer rates. For accomplishing thisevaluation, we have developed a high-level simulator. It is capable of estimating theruntime behaviour of a virtual multi-core decoder based on single-core profilings.
The runtime of each parallelization approach is strongly influenced by the framepartitions’ sizes and shapes. Large and dependency-minimizing partitions causeless inter-communication between the cores and a lower runtime. Additionally, astatic assignment of each core to constant frame regions makes an approach highlysensitive to local complexity peaks and can result in an unequal workload balancing.Assigning each core to multiple and spatially separated frame regions (e.g. the SRapproach) or rotating the assignment order between frames can compensate thesecomplexity variations more effectively. A rotating assignment has the advantage thatit can reduce dependencies by maintaining large partitions and process diverse frameregions on each core.
For approaches working on larger partitions, it is generally easier to exploit local-ity. This reduces transfer rates, since more spatially close neighbouring macroblocksare processed on the same core and efficient data reuse is possible. Reusing thedata from horizontal neighbouring macroblocks is easily possible by keeping a single
Multimed Tools Appl
macroblock’s prediction and deblocking information in the local cache. Verticalcaching requires that the partitions’ width is smaller than the number of locallystorable macroblocks. For approaches based on horizontal splitting, this results inthe highest requirements on local memories.
In future work, we are planning to investigate the efficiency of parallel codingapproaches for different H.264 coding tools and GOP structures. Other videostandards such as VC-1 and AVS shall be included in further evaluations.
Acknowledgements This work has been supported by the Austrian Federal Ministry of Transport,Innovation, and Technology under the FIT-IT project VENDOR (Project nr. 812429). MichaelBleyer would like to acknowledge the Austrian Science Fund (FWF) for financial support underproject P19797.
References
1. Ball T, Larus JR (1994) Optimally profiling and tracing programs. ACM Trans Program LangSyst 16(4):1319–1360
2. Cesario WO, Lyonnard D, Nicolescu G, Paviot Y, Yoo S, Jerraya AA, Gauthier L, Diaz-NavaM (2002) Multiprocessor SoC platforms: a component-based design approach. IEEE Des TestComput 19(6):52–63
3. Chen TW, Huang YW, Chen TC, Chen YH, Tsai CY, Chen LG (2005) Architecture design ofH.264/AVC decoder with hybrid task pipelining for high definition videos. In: Proc. of the IEEEint. symposium on circuits and systems, pp 2931–2934
4. Chen YK, Tian X, Ge S, Girkar M (2004) Towards efficient multi-level threading of H.264encoder on Intel hyper-threading architectures. In: Proc. of the 18th int. parallel and distributedprocessing symposium, vol 1, pp 63–72
5. Cmelik B, Keppel D (1994) Shade: a fast instruction-set simulator for execution profiling. In:Proc. of the ACM SIGMETRICS conference on measurement and modeling of computer sys-tems, pp 128–137
6. Faichney J, Gonzalez R (2001) Video coding for mobile handheld conferencing. J MultimedTools Appl 13(2):165–176
7. Graham SL, Kessler PB, McKusick MK (1982) gprof: a call graph execution profiler. In: Proc. ofthe SIGPLAN symposium on compiler construction, pp 120–126
8. Gulliver SR, Ghinea G, Patel M, Serif T (2007) A context-aware tour guide: user implications.Mobile Inform Syst 3(2):71–88
9. ITU-T, ISO/IEC (2005) Advanced video coding for generic audiovisual services (ITU Rec. H.264| ISO/IEC 14496-10). ITU-T and ISO/IEC
10. Jeon J, Kim H, Boo G, Song J, Lee E, Park H (2000) Real-time MPEG-2 video codec systemusing multiple digital signal processors. J Multimed Tools Appl 11(2):197–214
11. Knudsen PV, Madsen J (1996) Pace: A dynamic programming algorithm for hardware/softwarepartitioning. In: Proc. of the int. workshop on hardware–software co-design, pp 85–92
12. Malik S, Martonosi M, Li YTS (1997) Static timing analysis of embedded software. In: Proc. ofthe 34th ACM/IEEE design automation conference, pp 147–152
13. Meenderinck C, Azevedo A, Juurlink B, Alvarez M, Ramirez A (2008) Parallel scalability ofvideo decoders. J Signal Process Syst 56:173–194
14. Moriyoshi T, Miura S (2008) Real-time H.264 encoder with deblocking filter parallelization. In:IEEE int. conference on consumer electronics, pp 63–64
15. Nachtergaele L, Catthoor F, Kapoor B, Janssens S, Moolenaar D (1996) Low power storageexploration for H.263 video decoder. In: Proc. of the IX workshop on VLSI signal processing,pp 115–124. doi:10.1109/VLSISP.1996.558310
16. Paver N, Khan M, Aldrich B (2006) Optimizing mobile multimedia using SIMD techniques.J Multimed Tools Appl 28(2):221–238
17. Puschner PP, Koza C (1989) Calculating the maximum execution time of real-time programs.J Real-Time Syst 1(2):159–176
18. Ravasi M, Mattavelli M (2003) High-level algorithmic complexity evaluation for system design.J Systems Archit 48(13–15):403–427
Multimed Tools Appl
19. Ravasi M, Mattavelli M (2005) High abstraction level complexity analysis and memory archi-tecture simulations for multimedia algorithms. IEEE Trans Circuits Syst Video Technol 15(5):673–684
20. Rodriguez A, Gonzalez A, Malumbres M (2006) Hierarchical parallelization of an H.264/AVCvideo encoder. In: Proc. of the int. symposium on parallel computing in electrical engineering,pp 363–368
21. Schöffmann K, Fauster M, Lampl O, Böszörményi L (2007) An evaluation of parallelizationconcepts for baseline-profile compliant H.264/AVC decoders. In: Proc. of the Euro-Par 2007,pp 782–791
22. Seitner F, Meser J, Schedelberger G, Wasserbauer A, Bleyer M, Gelautz M, Schutti M, SchreierR, Vaclavik P, Krottendorfer G, Truhlar G, Bauernfeind T, Beham P (2008) Design methodologyfor the SVENm multimedia engine. In: Proc. of the Austrochip 2008, poster presentation
23. Seitner FH, Schreier RM, Bleyer M, Gelautz M (2008) A high-level simulator for the H.264/AVCdecoding process in multi-core systems. In: Proc. of the SPIE, multimedia on mobile devices,vol 6821, pp 5–16
24. Sun S, Wang D, Chen S (2007) A highly efficient parallel algorithm for H.264 encoder based onmacro-block region partition. In: Proc. of the 3rd int. conference on high performance computingand communications, pp 577–585
25. van der Tol EB, Jaspers EG, Gelderblom RH (2003) Mapping of H.264 decoding on a multi-processor architecture. In: Proc. of the SPIE, vol 5022, pp 707–718
26. Wang SH, Peng WH, He Y, Lin GY, Lin CY, Chang SC, Wang CN, Chiang P (2003) A platform-based MPEG-4 advanced video coding (AVC) decoder with block-level pipelining. In: Proc. ofthe 2003 joint conference of the 4th int. conference on information, communications and signalprocessing and the 4th Pacific rim conference on multimedia, vol 1, pp 51–55
27. Witchel E, Rosenblum M (1996) Embra: fast and flexible machine simulation. In: Proc. ofthe ACM SIGMETRICS int. conference on measurement and modeling of computer systems,pp 68–79
28. Zhao Z, Liang P (2006) A highly efficient parallel algorithm for H.264 video encoder. In: Proc.of the 31st IEEE int. conference on acoustics, speech, and signal processing, vol 5, pp 489–492
Florian H. Seitner received his diploma degree (M.S.) in Computer Science from the ViennaUniversity of Technology in 2006. He is currently employed as a project assistant at the ViennaUniversity of Technology where he works on his Ph.D. degree. His scientific interests include videocoding, parallel software design and video analysis.
Multimed Tools Appl
Michael Bleyer received his M.S. and Ph.D. degrees in Computer Science from the ViennaUniversity of Technology in 2002 and 2006, respectively. He is currently employed as a post-docresearcher at the Vienna University of Technology. His research interests include stereo matching,optical flow computation, combinatorial optimization and video coding.
Margrit Gelautz received her PhD degree in Computer Science from Graz University of Technology,Austria, in 1997. She worked on optical and radar image processing for remote sensing applicationsduring a postdoctoral stay at Stanford University. In her current position as an associate professor atVienna University of Technology, her research focuses on image and video analysis for multimediaapplications. She is Women in Engineering (WIE) officer of IEEE Austria Section.
Multimed Tools Appl
Ralf M. Beuschel was born in Friedrichhafen, Germany in 1974. He received the diploma degree(M.Eng.) in electrical engineering (systems engineering) from the University of Ulm in 2000 witha best thesis award donated by the VDE/VDI. From 2000 to 2006 he was a research fellow at theMicroelectronics Department at the University of Ulm doing research in the field of low-delayvideo coding algorithms and real-time DSP implementations. Afterwards he was with the Institute ofSoftware Technology and Interactive Systems of the TU Vienna for two years. In early 2009 he hiredat DSP-Weuffen in Amtzell, Germany in the position of a Senior System Designer for DSP videoapplications. He is member of IEEE and the German TV and Cinema Technology Society (FKTG).