Decoder-Complexity-Aware Encoding of Motion Compensation ...shervin/pubs/VideoSVC... · must stream the same video simultaneously to multiple receivers with different band-width and

Decoder-Complexity-Aware Encoding of Motion Compensation forMultiple Heterogeneous Receivers

MOHSEN JAMALI LANGROODI and JOSEPH PETERS, Simon Fraser UniversitySHERVIN SHIRMOHAMMADI, University of Ottawa

For mobile multimedia systems, advances in battery technology have been much slower than those inmemory, graphics, and processing power, making power consumption a major concern in mobile systems.The computational complexity of video codecs, which consists of CPU operations and memory accesses,is one of the main factors affecting power consumption. In this paper, we propose a method that achievesnear-optimal video quality while respecting user-defined bounds on the complexity needed to decode avideo. We specifically focus on the motion compensation process, including motion vector prediction andinterpolation, because it is the single largest component of computation-based power consumption. We startby formulating a scenario with a single receiver as a rate-distortion optimization problem and we developan efficient decoder-complexity-aware video encoding method to solve it. Then we extend our approach tohandle multiple heterogeneous receivers, each with a different complexity requirement. We test our methodexperimentally using the H.264 standard for the single receiver scenario and the H.264 SVC extension forthe multiple receiver scenario. Our experimental results show that our method can achieve up to 97% of theoptimal solution value in the single receiver scenario, and an average of 97% of the optimal solution value inthe multiple receiver scenario. Furthermore, our tests with actual power measurements show a power savingof up to 23% at the decoder when the complexity threshold is halved in the encoder.

Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software—Performance evaluation (efficiency and effectiveness); I.4.2 [Image Processing and Computer Vision]:Compression (Coding)—Approximate Methods

General Terms: Design, Algorithms, Experimentation, Measurement

Additional Key Words and Phrases: Decoder complexity modeling, H.264/AVC decoding, H.264/SVC decod-ing, motion compensation

ACM Reference Format:Mohsen Jamali Langroodi, Joseph Peters, and Shervin Shirmohammadi. 2014. Decoder-Complexity-AwareEncoding of Motion Compensation for Multiple Heterogeneous Receivers. ACM Trans. Multimedia Comput.Commun. Appl. 0, 0, Article 0 ( 0), 22 pages.DOI:http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTIONDespite recent advances in mobile multimedia systems, power consumption is still amajor concern. A crucial problem for hand-held devices such as netbooks and smart-phones is sustaining long battery life given the large amount of energy required for

This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC) ofCanada and in part by the British Columbia Innovation Council (BCIC).Authors’ addresses: M. Jamali Langroodi and J. Peters, School of Computing Science, Simon Fraser Univer-sity, 250-13450 102nd Ave., Surrey, B. C., Canada, V3T 0A3; S. Shirmohammadi, School of Electrical En-gineering and Computer Science, University of Ottawa, 800 King Edward Ave., Ottawa, Ontario, Canada,K1N 6N5Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 0 ACM 1551-6857/0/-ART0 $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0, Publication date: 0.

0:2 Mohsen Jamali Langroodi et al

video decoding and rendering. Furthermore, as codecs become more advanced withhigher computational complexity, their power consumption increases. This can beseen in the H.264 video coding standard [Joint Video Team 2010] which requiresmore power than its predecessor, and the HEVC specification [Bross et al. 2013; Heet al. 2013] which increases the decoding complexity by more than 61% compared toH.264/AVC [Viitanen et al. 2012].

Among various factors that affect power consumption, computational complexity,which consists of CPU operations and memory accesses, is one of the main factors[Lee and Kuo 2011; Semsarzadeh et al. 2012]. As such, a method for encoding videostreams in a way that optimizes the trade-off between the video quality of a streamand the decoding complexity of a receiver would be a vital and valuable contribution.Such a method would be useful for video on demand content providers, such as NetFlixand YouTube, who could use our method to enable longer playing times for their mobilecustomers.

Video conferencing and live broadcasting services also have mobile customers. How-ever, these services have a different requirement than video on demand services: theymust stream the same video simultaneously to multiple receivers with different band-width and complexity requirements. To address the different requirements in thesesettings, solutions such as Scalable Video Coding (SVC) [Schwarz et al. 2007; Wiegandet al. 2007] have been proposed and deployed. SVC creates one or more subset bit-streams, known as layers, each of which has a lower bit-rate than the original videostream, and SVC offers different types of scalability including temporal, spatial, andSNR/quality scalabilities. Each receiver can select the video layer(s) that it needs de-pending on its capabilities. For mobile receivers in a video conferencing or live broad-casting session, we need a method that can encode a video in a way that optimizes thetrade-offs between video quality and multiple decoding complexity limits based on thevideo layer(s) to which the different receivers subscribe.

In this paper, we first address the video on demand scenario by developing a methodto maximize the video quality of a single video stream while guaranteeing that thecomplexity needed to decode the video does not exceed a specific user-defined threshold.We formulate this single threshold scenario as a rate-distortion optimization problemand present an efficient decoder-complexity-aware video encoding method to solve it.Our experiments with the H.264/AVC video codec show that our method can achieve upto 97% of the optimal solution value. Our method for a single video stream can be usedto complement existing video streaming techniques in video on demand applications,such as MPEG Dynamic Adaptive Streaming over HTTP (DASH), Real-Time Protocol(RTP), or any other streaming method for video on demand. A preliminary version ofthis part of our work was presented in [Jamali Langroodi et al. 2013].

Then we extend our approach from the single threshold scenario to the multiple re-ceiver scenario of video conferencing and live broadcasting, taking into account thedifferent complexity restrictions of the mobile users. Our method for this scenario is adecoder-complexity-aware video encoding method that satisfies the decoding complex-ity requirements of multiple users while maximizing the overall video quality of alllayers. Our experiments with H.264/SVC show that an average of 97% of the optimalsolution value can be achieved.

We also measured the actual savings in power consumption achieved by our methodin the multiple receiver scenario. Our results show power savings of up to 23% and anaverage of 13%. In our work, we focus on the motion compensation process, includingmotion vector prediction and interpolation, in which macroblocks coded with differentinter prediction modes have different decoding complexities. The motion compensationprocess accounts for as much as 38% of the decoding complexity [Horowitz et al. 2003]and is the single largest component of computation-based power consumption. Our


Decoder-Complexity-Aware Encoding for Multiple Heterogeneous Receivers 0:3

reported average power saving of 13% is more than one-third of the 38% portion thatthe motion compensation component contributes.

The rest of this paper is organized as follows. Related work and essential multimediabackground are reviewed in the next two sections. Our system model and methodologyare described in detail in Section 4. Section 5 presents our experimental results andperformance evaluations. Finally, we conclude in Section 6.

2. RELATED WORKMany researchers have focused on reducing the decoder side complexity and have pro-posed heuristics to facilitate receiver side computations. Wang et al. [2013] proposeda low complexity deblocking filter algorithm for multi-view video coding based on aninvestigation of view correlations in the motion skip coding mode. They claimed that ifthe boundary strength (BS) value, which is one of the criteria for smoothing the edgesin the deblocking filter component, can be directly copied from the associated refer-ence macroblock in a neighbouring view, then a significant amount of deblocking filtercomputation can be eliminated with negligible quality degradation.

Another deblocking filter heuristic was proposed by Choi and Ho [2008] to decreasethe number of unnecessary BS decisions leading to a computational complexity reduc-tion for this component. The authors exploited the number of operations involved ineach BS type in the deblocking filter process and also applied a newly designed filter,which is much simpler than the conventional filter, to each BS type. A fast predictionalgorithm was also proposed to determine the BS values of successive lines of pixelsusing the BS values of the first line of pixels since they possess similar characteristicsand amounts of blocking artifacts.

A fast smoothing decision algorithm was proposed by Jung and Ryoo [2013] to re-duce the computational complexity of intra prediction hardware in an HEVC decoderhardware architecture. The proposed hardware is a shared operation unit that sharesadders for computing common operations of smoothing equations to eliminate compu-tational redundancy which reduces the number of the execution cycles of the compo-nent.

There have been numerous attempts to create comprehensive modelling schemes fordecoders with the goal of creating a decoder complexity controller infrastructure. Maet al. [2011] modelled the whole H.264/AVC video decoding process by determiningthe fundamental operation unit, called the Complexity Unit (CU), in each decodingmodule. Each module, based on its individual properties, has its own CU such as bitparsing, macroblock data structure initialization, and half-pel interpolation. Such CUsare determined empirically for a fixed implementation and then a product of the aver-age number of cycles required by each CU over a frame or a group of pictures (GOP) ismultiplied by the number of CUs reported in a decoder module to obtain the time com-plexity. To validate, the proposed complexity model was tested on both Intel and ARMhardware platforms to decode H.264/AVC bitstreams. A major difference between theapproach that we propose in this paper and all of the work cited above is that ourmethods adapt to user-specified limits on computational complexity.

A complexity model of the motion compensation component of a video decoder basedon the number of cache misses, the number of y-direction interpolation filters, thenumber of x-direction interpolation filters, and the number of motion vectors per mac-roblock was presented by Lee and Kuo [2010]. This work is similar to the model pro-posed by Semsarzadeh et al. [2012]; in both models, there is a weight associated witheach number of complexity units calculated in a training phase. Firstly, the number ofeach type of CU is extracted in a training pool of video sequences during the encodingphase, and then each bitstream is decoded on the encoder side to find the best fittingweights for each CU to establish an estimation model. The approach in [Semsarzadeh



et al. 2012] differs from the approach in [Lee and Kuo 2010] in that the former isplatform independent.

A complexity analysis of the entropy decoder module was done by Lee and Kuo [2010]for both context-based adaptive variable length coding (CAVLC) and universal lengthcoding (UVLC) which are supported in all profiles of the H.264/AVC codec. The pro-posed model is a function of the number of CAVLC executions, the number of trailingones, the number of remaining non-zero quantized transformed coefficients, and thenumber of run executions. The H.264/AVC encoder integrated with the proposed com-plexity control scheme can generate a bit stream that is suitable for a receiver platformwith a power/hardware constraint [Lee and Kuo 2011]. The main difference betweenour proposed method and the work in [Lee and Kuo 2011] is that we study the motioncompensation component. In addition, in contrast to the models proposed by Lee andKuo [2010; 2011], the model used in this paper, based on [Semsarzadeh et al. 2012],can be utilized in any hardware or software implementation.

Recently, an analytical energy consumption and decoding performance model sim-ilar to the model in [Semsarzadeh et al. 2012] was presented by Benmoussa et al.[2013]. The model achieves a balance between the abstract high level and detailed lowlevel based on the video bit-rate and clock frequency. Parameters are extracted andevaluated to determine their relationships to decoding speed and thereby to energyconsumption. The main difference with our work is that the focus in [Benmoussa et al.2013] was on complexity modelling of the H.264/AVC standard, whereas we have de-veloped a generic decoder complexity-aware encoding based on the complexity modelin [Semsarzadeh et al. 2012].

A recent study that concentrated on power-aware video encoding while maximizingvideo quality under specific encoder constraints was reported by Lee et al. [2012]. Theauthors tackled the power consumption of surveillance cameras without power linesby proposing an optimal video coding configuration to enhance video quality based onthe remaining charge in the battery. They investigated stochastic event characteris-tics and nonlinear discharges of the battery that change the behaviour of the powerconsumption. More precisely, whenever a suspicious event occurs within the camera’sscope making the sum of the absolute difference between the captured image and back-ground image larger than a predefined threshold value, the camera will capture thevideo with higher complexity encoding leading to higher video quality. The proposedscheme also estimates the remaining event active duration to set the IDR (Instanta-neous Decoding Refresh) period with the goal of increasing the video quality as muchas possible given the battery energy constraints.

Da Fonseca and de Queiroz [2013] extended the conventional rate-distortion modelto a new model for controlling power while optimizing the bit rate and visual quality ofa video. The prediction stage was optimized for energy-constrained compression usinga new framework called RDE (Rate-Distortion-Energy) optimization by adding a newenergy dimension. This dimension is derived by determining the different effects thatvarious encoding parameters have on the energy demand and encoder cost. The au-thors investigated this issue by measuring energy and encoder cost on comprehensivetraining data sets which cover all possible parameters. The main difference betweenthe research in [da Fonseca and de Queiroz 2013; Lee et al. 2012] and ours is that wefocus on the power consumption of the decoder instead of the encoder.

In a recent paper, Grois and Hadar [2014] focused on complexity-aware adaptivescalable video coding similar to our multiple receiver approach. The method in [Groisand Hadar 2014] uses a dynamic transition for region-of-interest that adaptivelychanges encoding and decoding complexity using various preprocessing parameterssuch as standard deviation, kernel matrix size, and number of applied filters. Theirmethod is not as flexible as our approach in adjusting the video complexity because



it only considers the region-of-interest properties. Furthermore, the video encoderis forced to use the region-of-interest feature during the encoding phase to performcomplexity-aware video coding, whereas there is no such limitation in our approach.

3. BACKGROUNDIn this section, we explain some necessary background which is the same as the back-ground for our preliminary study [Jamali Langroodi et al. 2013].

In general, multimedia data, including video, do not follow a uniform pattern. Levelof activity, detail, motion speed, and direction are factors that impact the encodingprocess, hence each frame in the video is partitioned into smaller pieces called mac-roblocks. In the motion estimation process, using smaller blocks leads to more compres-sion and higher detail. H.264 allows the encoder to do motion estimation on 16 × 16,16× 8, 8× 16, and 8× 8 blocks and these blocks can be further partitioned into blocksas small as 4× 4 for even better effectiveness.

The H.264/AVC standard defines half-pel and quarter-pel accuracy for luma samples[Richardson 2004], so fractional position accuracy can be used to find good matches.Macroblock size and sub-pel interpolation are the factors that make the bit streammore complicated to decode on the receiver side. The computational complexity is di-rectly related to the amount of interpolation performed on the decoder side.

In [Semsarzadeh et al. 2012], we presented a generic complexity model based on analgorithmic analysis of the motion compensation process that considers the influenceof mode decisions and interpolation on the computational complexity of the decoder.The interpolation of pixels is done using a specific Finite Impulse Response filter onneighbouring pixels as shown in [Semsarzadeh et al. 2012]. Essentially, the basic op-erations in motion compensation, such as addition, multiplication, shift, and memoryaccess, contribute to complexity. In each filter operation, the number of basic opera-tions involved depends on the type of interpolation and the size of the macroblock. Thetotal computational complexity is therefore equal to the total numbers of basic oper-ations multiplied by their corresponding weights. These weights are not pre-definedvalues, rather they are trained with a set of pre-encoded bit streams with known com-putational complexities. Once the weights are initialized and tuned, the model is es-tablished and can be used to estimate the total computational complexity of motioncompensation for sequences outside of the training set. For more information and de-tails, we refer readers to [Semsarzadeh et al. 2012].

In this paper, our goal is to control the computational complexity involved in motioncompensation at the decoder. Thus, a generic complexity model is needed to monitormotion compensation and ensure that it does not exceed pre-defined complexity thresh-olds for both the single and multiple receiver scenarios. We have used the model in[Semsarzadeh et al. 2012] as our generic model, and our proposed solution, describedin the next section, depends on it.

4. METHODOLOGYIn this section, we present our system architecture and the optimization problemsof complexity-aware encoding for both the single and multiple receiver scenarios aredefined.

Our model for the single receiver scenario is based on a client-server system asshown in Figure 1. The encoder is located on the server side. The decoder on the clientside receives the video data and is able to send information about its resource require-ments to the server. Client information is translated into complexity demands whenit is received by the encoder. The encoder processes the video sequence to extract theencoding configuration and customize it according to the complexity bound specified



Fig. 1. Our system model in the single receiver scenario

Fig. 2. Our system model in the multiple receiver scenario

by the client. Then, the video is encoded with the new encoding configuration whichsatisfies the input complexity bound, and the bit-stream is sent to the user(s).

For the multiple receiver scenario, the client-server system includes a wide rangeof receivers as shown in Figure 2. In this scenario, the solution includes the encodingconfiguration for each layer, and each receiver can select appropriate layers accordingto its bandwidth and power limitations.

In H.264/AVC, the default decision for the encoding process is to choose the resultthat yields the highest quality output image. However, the choices that the encodermakes during the motion estimation process might require more bits to deliver a rel-atively high quality benefit. Therefore, it cannot necessarily be claimed that the morecomplexity a mode has, the better bit-rate and quality it can provide for the bit-stream.The motion estimation process solves this issue by “rate-distortion” optimization of theaforementioned problem based on the following two metrics.

— Distortion: The deviation from the source material is usually measured as the meansquared error to maximize the peak signal-to-noise ratio (PSNR) video quality metric.

— Bit-rate: The is the bit cost for a particular decision outcome.

There is a trade-off among the bit-rate of the video (Rtot), quality of the video (Dtot),and the computational complexity of decoding the motion compensation (Ctot). Mo-tion vector precision (MVP) and mode decisions are the source coding parameters thathave a direct influence on this trade-off. In our proposed optimization problem, thecomputational complexity of the decoding process acts as a constraint on conventional



rate-distortion optimization. So, we formulate the problem as a function of sourcecoding parameters plus the computational complexity constraint. The formulation forH.264/AVC is

minimize Dtot(MV P,mode) given Rtot(MV P,mode)

subject to Ctot(MV P,mode) ≤ β (1)

where β is a complexity threshold received from the user’s device.A combination of MVP and intra or inter mode is called a state in this paper. Each

macroblock is in one possible state that includes a specific precision and a particu-lar mode after the motion compensation process in the encoder. We reformulate Prob-lem (1) as the following optimization problem.

minimizen∑i=1

m∑j=1

RDCij ·Xij

subject to

∑i,j CCij ·Xij ≤ β

CCij =∑k PijkWk∑

j Xij = 1; ∀1 ≤ i ≤ nXij ∈ {0, 1}; 1 ≤ i ≤ n; 1 ≤ j ≤ m

(2)

RDCij and CCij are the rate-distortion cost and computational complexity, respec-tively, when state j is chosen for macroblock i. Wk is the weight corresponding to thekth type of basic operation and Pijk is the number of type k basic operations when mac-roblock i is in state j. The numbers of macroblocks and states are n and m respectively.

Problem (2) is a multiple-choice knapsack problem. Each (macroblock,state) pair(i, j) is an object that can be placed into the knapsack, and the total computationalcomplexity of the objects in the knapsack cannot exceed the threshold β. Furthermore,exactly one state j must be chosen for each macroblock i, so exactly one pair from eachmacroblock group Gi = {(i, j)|1 ≤ j ≤ m} is placed into the knapsack.

In the SVC extension of H.264, there can be multiple layers. There is a base layer(which we call layer 0) which is similar to the baseline profile of the H.264/AVC stan-dard, and there can be one or more optional enhancement layers (layers 1, 2, . . .), eachof which depends on the base layer and the enhancement layers below it. Thus, thelayers must be decoded in order starting with the base layer, and the formulation be-comes

minimize Dtot(MV P,mode) given Rtot(MV P,mode)

subject to

C0,tot(MV P,mode) ≤ β0C0,tot(MV P,mode) + C1,tot(MV P,mode) ≤ β1

...C0,tot(MV P,mode) + C1,tot(MV P,mode) + · · ·+ Cl,tot(MV P,mode) ≤ βl

(3)

where l is the maximum number of enhancement layers that any user device will use,Ch,tot is the computational complexity of decoding the motion compensation in layerh, and β0, β1, . . . , βl are complexity thresholds received from user devices that will use



0, 1, . . . , l enhancement layers, respectively. The extension of Problem (2) to SVC is

minimizel∑

h=0

nh∑i=1

m∑j=1

RDChij ·Xhij

subject to

∑i,j CC0ij ·X0ij ≤ β0∑i,j CC0ij ·X0ij +

∑i,j CC1ij ·X1ij ≤ β1

...∑h,i,j CChij ·Xhij ≤ βl

CChij =∑k PhijkWk∑

j Xhij = 1; ∀ 0 ≤ h ≤ l; 1 ≤ i ≤ nhXhij ∈ {0, 1}; 0 ≤ h ≤ l; 1 ≤ i ≤ nh; 1 ≤ j ≤ m

(4)

In (4), nh is the number of macroblocks in layer h, and the first subscript of othervariables refers to layers; CChij is the computational complexity when state j is cho-sen for macroblock i in layer h, and so on. As for Problem (2), each (macroblock,state)pair (i, j) is an object that can be placed into the knapsack and exactly one pair fromeach macroblock group Gi = {(i, j)|1 ≤ j ≤ m} is placed into the knapsack. Similarto (2), the total computational complexity of the objects (from all layers) in the knap-sack cannot exceed the threshold βl. The main difference is that Problem (4) has l + 1complexity constraints instead of one, so Problem (4) is a multi-dimensional multiple-choice knapsack problem.

Our approaches to solving Problems (2) and (4) are both based on dynamic program-ming with greedy heuristics. Our dynamic programming method has been designedfor a single constraint and is used in both approaches; the greedy heuristics and theorganizations of the two approaches are different. We start by describing our dynamicprogramming method in the context of H.264/AVC which has a single constraint. Thenwe use the method, together with (different) greedy heuristics to solve Problems (2)and (4).

4.1. Dynamic programming methodThe recurrence relations for a dynamic programming formulation of (2) are [Bean1987]:

f(0, 0) = 0; f(0, b) = 0; f(i, 0) =∞f(i, b) = min{f(i, b− 1),

minj{f(i− 1, b− CCij) +RDCij |CCij ≤ b}}

1 ≤ i ≤ n; 1 ≤ b ≤ β; 1 ≤ j ≤ m

(5)

A conventional dynamic programming approach to solving (5) uses an (n+1)×(β+1)table and results in a pseudo-polynomial time algorithm with O(nβ) space complexityand O(mnβ) time complexity. In our context, this algorithm is both CPU- and memory-bound because β is very large (over one million nanoseconds), so it is only practical forsolving small instances of the problem. To reduce the resource requirements, we applytwo techniques. The first technique is to scale the CCij values by a factor S and thenround to the nearest integer (denoted b e) to reduce the problem size. The resulting



optimization problem is

minimizen∑i=1

m∑j=1

RDCij ·Xij

subject to

∑i,jb

CCij

S e ·Xij ≤ b βS eCCij =

∑k PijkWk∑

j Xij = 1; ∀ 1 ≤ i ≤ nXij ∈ {0, 1}; 1 ≤ i ≤ n; 1 ≤ j ≤ m

(6)

and the space and time complexities of the table-driven pseudo-polynomial algorithmare reduced to O(nb βS e) and O(mnb βS e) respectively. Note that this approximation tech-nique is different from the usual scaling approaches for pseudo-polynomial time algo-rithms which scale the objective function. Instead, we are scaling the constraints. Thisis possible because the CCij values are sparsely distributed in a large range, so theerrors introduced by careful scaling are small.

The second technique that we apply to our dynamic programming algorithm is aspace reduction technique from [Hirschberg 1975]. In this method, the dynamic pro-gramming table is modelled as a graph in which each node(i, b) corresponds to tableentry (i,b), 0 ≤ i ≤ n, 0 ≤ b ≤ β, and the weighted edges are as follows: there is anedge(i,b,j) connecting node(i,b) and node(i− 1,b− CCij) with weight RDCij if b ≥ CCijfor 1 ≤ i ≤ n, 1 ≤ b ≤ β, 1 ≤ j ≤ m, and an edge(i,b,0) connecting node(i,b) andnode(i,b− 1) with weight 0 for 0 ≤ i ≤ n, 1 ≤ b ≤ β.

The shortest path between node(0,0) and node(n,β) corresponds to an optimal so-lution. Such a path must pass through a node(n2 ,b) for some 1 ≤ b ≤ β. The idea isto apply divide-and-conquer to recursively solve two sub-problems: find the shortestpaths from node(0,0) to each node(n2 ,b), 1 ≤ b ≤ β; find the shortest paths from eachnode(n2 ,b) to node(n,β); and then combine the sub-problem solutions. This techniquereduces the space complexity by a factor of O(n) at the expense of an O(log2 n) increasein the time complexity. Combining the two techniques results in space and time com-plexities O(b βS e) and O(mn log2 nb

βS e) respectively.

4.2. Solution method for a single thresholdOur method to solve Problem (2) takes advantage of similarities among consecutiveframes by considering solutions for previous frames as potential solutions for the cur-rent frame. Our method consists of three steps; first we search through all potentialsolutions to find the best initial solution for the current frame. The second and thirdsteps iteratively improve the initial solution by local exchanges. We discuss the threesteps in detail.

Finding a good initial solution is the most important step of our method. We considerall previous frames with computational complexity no more than 1.2β to be candidatesfor the initial solution. We have set the complexity threshold in this step to be higherthan the actual threshold β to increase the number of candidates. The constant 1.2 hasbeen tuned experimentally to give a good trade-off between efficiency and accuracy.We choose the most recent of the candidate frames as the basis for our initial solutionbecause the frames closest to the current frame are usually the most similar. If thechosen basis frame E is the frame F that immediately precedes the current frame,then the initial solution is I = E = F . Referring to (2), we set XI

ij = XEij , 1 ≤ i ≤ n,

1 ≤ j ≤ m. If E and F are different, then we integrate them into a single initial solutionI by resolving the differences between the states of their macroblocks. We initialize



I = E and then apply the following pseudo-code for each macroblock i, 1 ≤ i ≤ n.

ifm∑j=1

RDCij ·XFij >

m∑j=1

RDCij ·XEij then XI

ij = XFij for each 1 ≤ j ≤ m

The idea behind these substitutions is to modify the basis solution E to get an initialsolution that is closer to frame F . However, we do not perform substitutions for Mac-roblocks that would decrease the total RDC while potentially increasing the solutionweight.

The second step is to improve the initial solution. This step is quite similar to solv-ing a dynamic programming problem with an initial solution. First we place the items(i, j) in each macroblock group Gi into a table sorted by RDC value. Then, we repeat-edly exchange one of the items in the current solution with another one in the samemacroblock group to decrease the sum of the RDCs according to the following policy.Let vci and wci be the solution value (sum of the RDCs) and weight (computationalcomplexity) of a candidate item in Gi, and let vsi and wsi be the value and weight of thecurrent solution in Gi. We define the Resource Requirement (RR) of the candidate itemto be RRi = wci − wsi and its Value Update to be V Ui =

vci−vsi

wci−ws

i.

Given two candidate items (i, j1) and (i, j2) in Gi with RR values RR1, RR2 and V Uvalues V U1, V U2, we use the following rules for the exchange:

• If RR1 ≤ 0 or RR2 ≤ 0, the item that minimizes the solution value the most withoutincreasing the weight is chosen.• If both RR1 > 0 and RR2 > 0, the item with the largest negative V U value (if oneexists) is chosen.

The goal of these exchanges is to find the best exchange that does not increase thetotal weight of the current solution. If the first rule does not apply, then we choose theexchange that gives the largest decrease in RDC per unit of increase in computationalcomplexity.

In this second step, we redefine feasibility from 1.2β that we used in the first stepto max(

∑i,j CCij ·Xij , β) so that the solution converges towards the target feasibility

value of β. We continue exchanging items until no eligible item exists for exchange.In the third step, we consider pairs of groups looking for a pair of exchanges, one in

each group, which when done together will decrease the total RDC value while main-taining feasibility which in this step means that the total computational complexitydoes not exceed β.

Concerning the time complexity of the algorithm, we first sort all items in each groupin time O(nm log2m). We then perform at most n(m − 1) exchanges to find the bestsolution and each exchange searches through all groups for a total of O(n2m). Thethird step checks O(n2) possible pairs of exchanges. The total time complexity of ourpolynomial algorithm is O(n2m+ nm log2m).

The dynamic programming method in Section 4.1 provides good solutions, but evenwith our modifications that permit the solution of large instances, the CPU and mem-ory requirements are large. The greedy method that we have developed in this sectionis much more efficient, but it requires an initial solution that is reasonably close tothe optimal solution, and errors propagate if it is used to find solutions for a sequenceof consecutive frames. Our proposed method for controlling computation-based powerconsumption is a combination of the dynamic programming and greedy methods thatattempts to minimize these shortcomings. We partition the sequence of frames intohybrid groups of pictures (HGOPs). The dynamic programming method is used to findan accurate solution for the first frame of each HGOP and to provide a good initial so-



lution for the greedy solutions of the remaining frames. Our results in Section 5 showthat this method provides a good trade-off between efficiency and accuracy.

4.3. Solution method for multiple complexity thresholdsOur method for the SVC extension of H.264/AVC is also based on the dynamic program-ming method of Section 4.1. The primary goal is to minimize the total rate-distortionfor all layers while satisfying multiple user complexity constraints. The dynamic pro-gramming method is designed for a single constraint, but we can use it recursively tohandle the multiple complexity constraints in Problem (4).

The general idea of our approach is to solve a relaxation of Problem (4) to obtainan initial solution, which provides a lower bound on the optimal solution, and thenconverge the initial solution towards the optimal solution by reinforcing the originalconstraints.

The relaxed problem is obtained by removing all constraints except the last one toobtain Problem (7) below. The solution for (7) provides a lower bound on the optimalsolution for Problem (4), but it is not necessarily a feasible solution for (4) becausesome of the omitted complexity constraints might not be satisfied.

minimizel∑

h=0

nh∑i=1

m∑j=1

RDChij ·Xhij

subject to

∑h,i,j CChij ·Xhij ≤ βl



(7)

If the initial solution for (7) is a feasible solution for Problem (4), then it is the finalsolution. Otherwise, one or more constraints are not satisfied, and we converge theinitial solution for (7) towards a solution for (4). First, we calculate a fair distributionof the available complexity budget βl among the macroblocks of all layers. In particular,each layer h will be given a fraction of the complexity budget proportional to the rationh

n0of the number of macroblocks in layer h to the number of macroblocks in the base

layer. Let αf denote the amount of the budget that will be allocated to layer f . We startby solving for α0 using the following method with f = 0.

αf ≤ βf

αf +

(nf+1

nf

)αf ≤ βf+1 (8)

...

αf +

(nf+1

nf

)αf + · · ·+

(nlnf

)αf ≤ βl



Rearranging (8), we obtain

αf ≤ βf(nf + nf+1

nf

)αf ≤ βf+1

... (9)(∑lh=f nh

nf

)αf ≤ βl

The solution of (9) gives the value of αf :

αf = min

(βf ,

βf+1nf∑f+1h=f nh

, . . . ,βlnf∑lh=f nh

)(10)

Next, we allocate budget α0 to the base layer (layer 0) and solve Problem (11) inde-pendently of the enhancement layers using the dynamic programming method.

minimizenf∑i=1

m∑j=1

RDCfij ·Xfij

subject to

∑i,j CCfij ·Xfij ≤ αf

CCfij =∑k PfijkWk∑

j Xfij = 1; 1 ≤ i ≤ nfXfij ∈ {0, 1}; 1 ≤ i ≤ nf ; 1 ≤ j ≤ m

(11)

After solving (11) with f = 0, the base layer and its budget are removed from (4)to obtain the following Problem (12) which we solve for the enhancement layers byrepeating the method above with f = 1, 2, 3, . . . until a feasible solution for Problem (4)is found.

minimizel∑

h=1

nh∑i=1

m∑j=1

RDChij ·Xhij

subject to

∑i,j CC1ij ·X1ij ≤ β1 − α0∑i,j CC1ij ·X1ij +

∑i,j CC2ij ·X2ij ≤ β2 − α0

...∑h,i,j CChij ·Xhij ≤ βl − α0



(12)

This algorithm has the same space complexity O(b βS e) as the dynamic program-ming method described in Section 4.1. The worst-case time complexity occurs whenthe method has to be repeated for all enhancement layers. In this case, there are l − 1instances of the single constraint Problem (11), and l lower bound computations (7).As mentioned in Section 4.1, it takes O(mn log2 nb

βS e) time to solve Problem (11). Since

β ≤ βl and n ≤∑lh=0 nh in each iteration, and the number of states m is usually small

(m = 12 in our simulations in the next section), the running time of our approximationalgorithm is O(bβl

S e∑lh=0 nh log2

∑lh=0 nh).



5. SIMULATION RESULTSWe evaluated our decoder-complexity-aware encoding methods on a PC platform withan 8-core Intel 3.40 GHz Core i7 CPU and 8 GB of RAM, running the 64-bit Windows7 Enterprise operating system. Four video sequences, each with 45 frames, were usedfor testing purposes: Parkrun, Stockholm, Shields, and Bluesky. The motion compensa-tion component of H.264 consists of interpolation and mode decisions. The 12 possiblestates considered in our general testbed for each macroblock are five inter modes withinterpolation, five inter modes without interpolation, and two intra modes which havezero complexity.

5.1. Simulation results for H.264/AVCWe tested our single stream method using the JM implementation of the H.264/AVCcodec [Joint Video Team 2009]. The experiments were performed using an H.264 base-line profile, which has only I and P frames, and the configuration shown in Table I. Itshould be noted that the quantization parameter has been fixed to a constant valuebecause changes to this parameter do not affect the performance of our algorithms.

Table I. AVC Encoder Configuration

Parameter Setting Parameter SettingLevel 4 Number of encoded frames 45Frame rate 30 GOP size 25Intra period 25 Number of reference frames 1Search Mode Full search Search range 32Size HD Quantization parameter 36

The simulation procedure consisted of two phases. In the first phase, we initializedthe weights Wk using pre-encoded bit streams in a training pool for the specific plat-form [Semsarzadeh et al. 2012]. The sequences were then fed into the JM referencesoftware to determine RDC and CC values, and a lower bound for β. In the secondphase, we applied our hybrid method to decrease the complexity to satisfy a user-defined threshold β. We used the dynamic programming method without the approx-imation and space reduction techniques to find the solution of equation (2) for eachframe. This produced an optimal solution that we used as a benchmark to evaluate theperformance of our hybrid method. We used the following three performance metricsto evaluate our methods.

— The optimality ratio (OR) measures the quality of a solution X relative to a bench-mark solution XB :

OR =

(1−

∑i,j RDCij ·Xij −

∑i,j RDCij ·XB

ij∑i,j RDCij ·XB

ij

)— The computation running time (CRT) is the average execution time of the algorithm

per frame.— The complexity error (CE) is the relative difference between the complexity of a solu-

tion X and β:

CE =

(∑i,j CCij ·Xij − β

β

)We tested the dynamic programming method (with the approximation and space

reduction techniques) for a wide range of scaling factors to determine the best oneto use in the hybrid method. Figure 3 shows the trade-offs between complexity error(CE) and computation running time (CRT) for four video sequences. The CEs for the



Fig. 3. Trade-offs between CE and CRT for the dynamic programming method. The dashed lines showCRTs and bars show CEs.

Parkrun, Stockholm, Shields, and Bluesky sequences are only 0.9%, 0.7%, 1%, and 1%,respectively, using scaling factor S = 213. Furthermore, with S = 213 the CRTs aretrivial compared to the encoding time and the optimality ratios are close to 1 indicatingthat the solutions are close to optimal. Therefore, we used S = 213 when testing thehybrid method.

The reason that S = 215 is not chosen as the scaling factor is due to error cancellationthat occurs in this scale. When the scaled computational complexity values are roundto the nearest integer to obtain bCCij

S e values in (6), the errors can be positive or nega-tive, and they can cancel each other when combined into a CE value. When S = 215 wasused, the total positive and total negative errors cancelled each other which resultedin an artificially low total error, but the results that we obtained with S = 215 were notas good as the results for S = 213.

Figure 4 shows the trade-offs among OR, CRT, and the length of the HGOPs for thehybrid method for the four 45-frame sequences. We varied the length of the HGOPsbetween 2 and 30 and used β = 2 for all experiments. In the figure, CRTS is theaverage time per frame taken by the hybrid method to solve all 45 frames sequentially.CRTP refers to a parallel implementation that will be discussed later.

We can conclude from Figure 4 that the accuracy achieved by our hybrid method in-creases with decreasing HGOP length. When the HGOP length is 2, the OR (expressedas a percentage) is greater than 92% and it decreases gradually to the 82%-91% rangeas the HGOP length increases. The CRTS decreases gradually with increasing HGOPlength and is between 0.34 and 3 seconds per frame for most HGOP lengths.

Even better real-time encoding times can be achieved for H.264 using fine- andcoarse-grain parallelism [Rodriguez et al. 2006; Sankaraiah et al. 2011; Zrida et al.2009]. Our method can be adapted to take advantage of additional hardware resourcesbecause it can take advantage of GOP parallelism to process the HGOPs independentlyin parallel. In Figure 4, CRTP is the maximum of the CRTs for the HGOPs in contrastto CRTS which is the average of the CRTs. Both CRTP and OR improve when theHGOP length decreases, but more hardware is needed to achieve these gains.

Figure 5 shows the impact of the user-defined threshold β on the performance of ouralgorithms. We tested our hybrid method for all video sequences for HGOP size 2 and



Fig. 4. Trade-offs between CRT and OR for different HGOP lengths.

HGOP size 15 with S = 213 and β ranging from 2 to 10. Values of β > 10 do not result insignificant performance improvements for any of the sequences. Variations in β haveno impact on our dynamic programming method but they have a large influence on theinitial solution chosen by the greedy method. The greedy method examines all previousframes with computational complexity less than 1.2β for initial solution candidacy. Ifβ is increased, then more of the recent frames will be examined. This increases thechance of choosing an initial solution with a high rate-distortion value and improvesthe convergence of our hybrid method to an optimal solution. The improvements whenβ is increased from 2 to 10 for the Parkrun, Shields, Stockholm, and Bluesky sequencesare 3%, 5.5%, 8.2%, are 7.2%, respectively, for HGOP size 2, and 6.7%, 12.3%, 16.5%,and 16.5%, respectively, for HGOP size 15.

5.2. Feasibility of real-time implementationIn a media streaming context, the term real-time means delivering frames at the samerate that the client decodes them. A parallel hardware architecture for the 0/1 knap-sack problem was presented in [Nibbelink et al. 2007]. The proposed architecture usesθ(n+p(C+Wmax)) memory and the running time is θ(nC/p+n log(n/p)), where n is thenumber of objects, C is the knapsack capacity, p is the number of processors for parallelexecution, and Wmax is the maximum size of any object. In our application, C = b βS e,



Fig. 5. Impact of β on the performance of the hybrid algorithm. The dashed lines are for HGOP size 15 andthe solid lines are for HGOP size 2.

C � log n, and Wmax is a relatively small value compared to C, so utilizing such ahardware implementation to solve knapsack problems in our application without anyparallelism would result in θ(n+C) memory usage and θ(nC) running time. Therefore,if we can adapt the method in [Nibbelink et al. 2007] to our application, we will reducethe running time by a logarithmic factor of log n compared to the approach that wepresent in this paper at the expense of an increase in the memory usage by an additivefactor of n. With parallel processing, even more time can be saved at the expense ofallocating more memory space and processing units to the algorithm. According to theexperimental results of Nibbelink et al. [2007] which evaluate the running time of thealgorithm based on knapsack capacity C and the number of objects n, it takes approx-imately 50 seconds using 32 processors when n = 250000 and C = 50000. In our case,n = 3600 and β varies from 1 second to 10 seconds which translates to C ranging from12000 to 120000 in their formulation when our scaling factor S is set to 213. Since theproposed algorithm by Nibbelink et al. [2007] is proportional to the number of objects,the running time can be normalized to n = 3600 with a resulting time of 720 ms.

A hardware implementation of dynamic programming for the multi-dimensionalknapsack problem was presented by Berger and Galea [2013] who implemented dy-namic programming by parallelizing the for loops in the algorithm on Graphics Pro-cessing Units (GPUs). Their experimental results show that a multi-dimensional knap-sack problem with n = 10000 objects and knapsack capacity β = 4989314 can be solvedin 900 ms. The results in [Berger and Galea 2013] cannot be directly compared to ourscenario with n = 3600 and β ranging from 12000 to 120000, but the application of theirmethod to our scenario would certainly result in a running time of less than 900 ms.

In summary, the two implementations in [Nibbelink et al. 2007] and [Berger andGalea 2013] are promising hardware approaches that show the feasibility of imple-menting our method in real-time in the near future.



5.3. Simulation results for SVC extensionWe tested our method for multiple complexity thresholds using the conventional SVCreference software JSVM 9.19.15 [Joint Video Team 2011] and the same four test se-quences that we used in the previous section for H.264/AVC. We used three layers andthe configuration show in Table II for our experiments. The profile has frames of typesI, B, and P in contrast to the profile for H.264/AVC which had only types I and P. Wetested several different temporal and spatial scalabilities. Since SNR scalability doesnot affect our experiments, we fixed the value of the quantization parameter to 36. Wetested two GOP sizes (5 and 9) for layer 2; the size of the GOP for layer 2 determinesthe sizes for layers 0 and 1.

We used a lower bound on the optimal solution for Problem (4) (obtained by solvingProblem (7)) as the benchmark for evaluating our heuristic algorithm. The optimalsolution for Problem (4) is not necessarily feasible and the real optimal solution forProblem (4) might have a higher RDC. Thus, the actual performance of our algorithmmight be better than the values reported here. The optimality ratio is calculated usingthe following formula.

OR =

(1−

∑lh=1

∑nh

i=1

∑mj=1RDChij ·Xhij −

∑lh=1

∑nh

i=1

∑mj=1RDChij ·XB

hij∑lh=1

∑nh

i=1

∑mj=1RDChij ·XB

hij

)

Table II. SVC Encoder Configuration

Quantization Parameter 36Search Mode Block searchIntra period 8Fast search OFFInter-layer prediction ON

Search range Base layer 16Enhancement Layers 8

GOP sizeBase layer (Layer 0) 2 3Enhancement Layer 1 3 5Enhancement Layer 2 5 9

Spatial ResolutionBase layer (Layer 0) CIFEnhancement Layer 1 4CIFEnhancement Layer 2 HD

Frame rateBase layer (Layer 0) 15Enhancement Layer 1 30Enhancement Layer 2 60

We conducted extensive tests of our method with different combinations of complex-ity thresholds (β0, β1, and β2) for the three layers for GOP sizes 5 and 9. Each of thegraphs in Figures 6 and 7 shows 150 data points. The values chosen for each β rangebetween 0 and the saturation point for the corresponding layer. The saturation pointfor a layer is the complexity consumption of the layer in the lower bound solution. Al-locating complexity to a layer beyond its saturation point can only result in negligibleperformance improvements.

Each curve in Figures 6 and 7 corresponds to a value of β2 as indicated in the legendat the top of each graph. There are five columns of curves in each graph correspondingto different values of β1 according to the legend at the bottom of each graph. Eachcurve contains five points corresponding to different values of β0 ranging from 0.1 to0.02 as indicated near the bottom of each graph above the legend for β1.

It is clear in the figures that decreasing β1 for a fixed value of β2 results in a decreasein OR values. In general, the values decrease from the range 99%-100% to 91%-100%



Fig. 6. Optimality Ratio for different β0, β1, and β2 for GOP size 5.

for GOP size 5 and the effects are similar for GOP size 9. These decreases are de-pendent on the extent to which the lower bound solution is infeasible. In contrast,decreasing β2 for a fixed value of β1 increases the relative amount of complexity avail-able to layers 0 and 1 and this results in an improvement in OR values. Reductions inthe value of β1 increase the variations in OR values. The worst cases for GOP size 5 arewhen β1 = 0.6 for which the variations range from 96.5% to 99.9%, 93% to 99%, 91% to95%, and 86.4% to 96.1% in the Parkrun, Stockholm, Shields, and Bluesky sequences,respectively. The patterns are similar for GOP size 9. The effects of changes to β0 aresmall because layer 0 is small relative to layers 1 and 2, but it is clear in the figuresthat the changes in OR are measurable. In summary, the OR achieved by our methodis at least 84% in all tested cases and the average optimality ratio is at least 97%.

5.4. Power AnalysisWe evaluated the actual power consumption of our method using a Qualcomm Snap-dragon MSM8960 mobile device and the Trepn Profiler available from Qualcomm[Qualcomm 2014]. Our analysis consisted of comparing two different encoder settings.In the high efficiency (HE) setting, which is the default, there are no constraints on thedecoder, and the output bit-stream is efficient in terms of bit-rate and visual quality. Inthe low complexity (LC) setting, we used our multiple stream method with complexityconstraints to reduce the complexity of the MC component in each SVC layer to half



Fig. 7. Optimality Ratio for different β0, β1, and β2 for GOP size 9.

of the value it had in the HE setting. Then, we used JSVM with the new encodingconfiguration from the solution produced by our multiple stream method to re-encodeall of the test sequences.

Since there is no SVC technology implemented on mobile devices, we implementedthe multiple stream scenario by simplifying the decoding procedure and performingeach step manually. First, we used the JSVM encoding software to extract the videobit-streams for each layer from the main stream, and then each video stream wasdecoded in an Android-based FFmpeg video decoder [Bellard and Niedermayer 2012].The Trepn Profiler was used to monitor the power consumption during the decodingprocess of each video sequence. We used three layers in our experiments with differentTemporal and Spatial scalabilities and the encoder configuration shown in Table IIover 30 frames of each test sequence.

During the measurement process, the operating system was also consuming powerin the background and introducing noise into our measurements. To remove the noisefrom the results, we measured the baseline power consumption by monitoring the sys-tem when it was in an idle state, and then deducted it from the measurements of thedecoding process.

Figure 8 shows the average power consumption measured over 15 seconds of decod-ing in the LC and HE settings for four video sequences. Generally, more power wasconsumed in the HE setting than the LC setting for each sequence. Furthermore, Fig-



Fig. 8. Average decoder power consumption of two different types of input sequences, low complexity (LC)and high efficiency (HE) in four different test sequences.

ure 8 shows that power consumption was reduced in each layer of each test sequence.Power saving is defined as the power reduction in the LC setting compared to the HEsetting. The average reduction in power consumption per layer shown in Figure 8 is13% with some reductions as high as 23%. For example power was reduced in layer 1 ofthe Stockholm sequence from 500mW to less than 400mW. When evaluating the powersavings achieved by our method, we should consider the fact that the motion compen-sation process in the HE setting only contributes around 38% of the total complexityof the decoder and our method manipulates only this component to reduce the finalpower consumption. Thus, reducing the complexity of the MC component by 50% canpotentially reduce the total power consumption by an average of around 19%. However,the behaviour of the other components that we do not manipulate will be affected bychanges in the complexity of the MC component. In particular, a decrease in the com-plexity of the MC component can increase the power consumption of the other com-ponents such as intra coding and these increases are included in the measurementsreported in Figure 8.

6. CONCLUSIONSMobile multimedia systems are becoming increasingly power-hungry while batterytechnology is not keeping pace. This increases the importance of power-aware videocodecs. The computational complexity of video codecs, which consists of CPU opera-tions and memory accesses, is one of the main factors affecting power consumption.The trade-off between the computational complexity of the decoding process and therate-distortion of the output stream is one the main features that can be tailored ac-cording to user preferences.

In this paper, we formulated the rate-distortion optimization problems and pre-sented efficient methods for decoder-complexity-aware video encoding for both singleand multiple receiver scenarios, and we used experiments with the H.264/AVC videocodec and the SVC extension of H.264 to evaluate our methods.



Our formulation of the rate-distortion optimization problem for the single receiverscenario is a multiple-choice knapsack problem. Our hybrid method is a combinationof dynamic programming, scaling techniques, and greedy heuristics which attemptsto find an optimal balance between the execution time and the optimality ratio for aconsecutive series of video frames. Our experiments with H.264/AVC show that ourmethod achieves up to 97% of the optimal video quality while at the same time guar-anteeing that the computational complexity needed to decode the video does not exceeda specific threshold defined by a user.

The generalization of our formulation to the multiple receiver scenario is a multi-dimensional multiple-choice knapsack problem in which the number of dimensions(complexity constraints) corresponds to the number of layers. We assume that eachlayer has its own computational complexity threshold and rate-distortion values whichare reflected in the constraints and objective function. Our method is a combination ofdynamic programming and greedy heuristics which first relaxes the problem to obtainan initial solution and then converges the initial solution towards an optimal solutionby reinforcing the original constraints. Our experiments with the SVC extension ofH.264 show that our method achieves an average of more than 97% of the optimalvideo quality.

We also measured the actual savings in power consumption achieved by our methodin the multiple receiver scenario. Our results show up to 23% and an average of 13%power savings.

Our future work will mainly focus on expanding the current work to a heteroge-neous environment with a large number of users in which each user can subscribe to agroup with a complexity demand. The main challenge will be selecting the appropriatenumber of layers and suitable complexity threshold for each group of users.

ACKNOWLEDGMENTS

We thank the referees for detailed constructive criticism which significantly improved this paper.

REFERENCESJ.C. Bean. 1987. Multiple choice knapsack functions. Technical Report. University of Michigan.F. Bellard and M. Niedermayer. 2012. FFmpeg. (2012). Retrieved October 8, 2014 from http://www.ffmpeg.

orgY. Benmoussa, J. Boukhobza, E. Senn, and D. Benazzouz. 2013. Energy Consumption Modeling of

H.264/AVC Video Decoding for GPP and DSP. In Euromicro Conf. on Digital System Design (DSD).IEEE, 890–897.

K.E. Berger and F. Galea. 2013. An Efficient Parallelization Strategy for Dynamic Programming on GPU.In 27th Int. Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). IEEE,1797–1806.

B. Bross, W.-J. Han, J.-R. Ohm, G.J. Sullivan, Y.-K. Wang, and T. Wiegand. 2013. High Efficiency VideoCoding (HEVC) Text Specification Draft 10 (for FDIS & Final Call). (Jan 2013). JCT-VC Doc. JCTVC-L1003.

J.A. Choi and Y.S. Ho. 2008. Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding.In Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia InformationProcessing (PCM ’08). Springer-Verlag, Berlin, Heidelberg, 138–147.

T.A. da Fonseca and R.L. de Queiroz. 2013. Energy-constrained real-time H.264/AVC video coding. In Int.Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1739–1743.

D. Grois and O. Hadar. 2014. Complexity-Aware Adaptive Pre-Processing Scheme for Region-of-InterestSpatial Scalable Video Coding. IEEE Trans. on Circuits and Systems for Video Technology 24, 6 (2014),1025–1039.

Y. He, M. Kunstner, S. Gudumasu, Eun-Seok R., Y. Ye, and X. Xiu. 2013. Power aware HEVC streaming formobile. In Visual Communications and Image Processing (VCIP). IEEE, 1–5.

D.S. Hirschberg. 1975. A Linear Space Algorithm for Computing Maximal Common Subsequences. Commun.ACM 18, 6 (1975), 341–343.



M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro. 2003. H.264/AVC baseline profile decoder complexityanalysis. IEEE Trans. on Circuits and Systems for Video Technology 13, 7 (2003), 704–716.

M. Jamali Langroodi, J. Peters, and S. Shirmohammadi. 2013. Complexity Aware Encoding of the MotionCompensation Process of the H.264/AVC Video Coding Standard. In Proceedings of Network and Op-erating System Support on Digital Audio and Video Workshop (NOSSDAV ’14). ACM, New York, NY,USA, Article 103, 6 pages.

Joint Video Team. 2009. H.264/AVC Reference Software Version JM 16.2. (Dec 2009). Retrieved October 8,2014 from http://iphome.hhi.de/suehring/tml/

Joint Video Team. 2010. Advanced Video Coding for Generic Audiovisual Services of ISO/IEC MPEG & ITU-T VCEG. (2010). ITU-T Rec. H.264 and ISO/IEC 14496-10 Advanced Video Coding, Edition 5.0 (incl.SVC extension).

Joint Video Team. 2011. H.264/SVC Reference Software (JSVM 9.19) and Manual. (2011). Retrieved October8, 2014 from CVS server at garcon.ient.rwth-aachen.de.

H. Jung and K. Ryoo. 2013. An Intra Prediction Hardware Architecture with Low Computational Complexityfor HEVC Decoder. In Future Information Communication Technology and Applications. Springer, 549–557.

S.W. Lee and C.-C.J. Kuo. 2010. Complexity Modeling of Spatial and Temporal Compensations in H.264/AVCDecoding. IEEE Trans. on Circuits and Systems for Video Technology 20, 5 (May 2010), 706–720.

S.W. Lee and C.-C.J. Kuo. 2011. H.264/AVC Entropy Decoder Complexity Analysis and Its Applications.Journal of Visual Communication and Image Representation 22, 1 (Jan 2011), 61–72.

Y. Lee, J. Kim, and C.M. Kyung. 2012. Energy-Aware Video Encoding for Image Quality Improvement inBattery-Operated Surveillance Camera. IEEE Trans. on Very Large Scale Integration (VLSI) Systems20, 2 (2012), 310–318.

Z. Ma, H. Hu, and Y. Wang. 2011. On Complexity Modeling of H.264/AVC Video Decoding and Its Applicationfor Energy Efficient Decoding. IEEE Trans. on Multimedia 13, 6 (Dec 2011), 1240–1255.

K. Nibbelink, S. Rajopadhye, and R. McConnell. 2007. 0/1 Knapsack on Hardware: A Complete Solution. InInt. Conf. on Application-specific Systems, Architectures, and Processors. IEEE, 160–167.

Qualcomm. 2014. Trepn Profiler. (2014). Retrieved February 20, 2014 from https://developer.qualcomm.com/mobile-development/increase-app-performance/trepn-profiler

I.E. Richardson. 2004. H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multime-dia. John Wiley & Sons, Inc.

A. Rodriguez, A. Gonzalez, and M.P. Malumbres. 2006. Hierarchical parallelization of an H.264/AVC videoencoder. In 6th Int. Symposium on Parallel Computing in Electrical Engineering. IEEE, 363–368.

S. Sankaraiah, H.S. Lam, C. Eswaran, and J. Abdullah. 2011. GOP Level Parallelism on H.264 Video En-coder for Multicore Architecture. In Int. Conf. on Circuits, System and Simulation (IPCSIT), Vol. 7.IACSIT Press, 127–132.

H. Schwarz, D. Marpe, and T. Wiegand. 2007. Overview of the Scalable Video Coding Extension of theH.264/AVC Standard. IEEE Trans. on Circuits and Systems for Video Technology 17, 9 (Sept 2007),1103–1120.

M. Semsarzadeh, M. Jamali Langroodi, M.R. Hashemi, and S. Shirmohammadi. 2012. Complexity Mod-eling of the Motion Compensation Process of the H.264/AVC Video Coding Standard. In Int. Conf. onMultimedia and Expo (ICME). IEEE, 925–930.

M. Viitanen, J. Vanne, T.D. Hamalainen, M. Gabbouj, and J. Lainema. 2012. Complexity analysis of next-generation HEVC decoder. In Int. Symposium on Circuits and Systems (ISCAS). IEEE, 882–885.

Y. Wang, W. Zhang, Z. Zhang, and P. An. 2013. A low complexity deblocking filtering for multiview videocoding. IEEE Trans. on Consumer Electronics 59, 3 (Aug 2013), 666–671.

T. Wiegand, G. Sullivan, J. Reichel, H. Schwarz, and M. Wien. 2007. Joint Draft 10 of SVC Amendment. (Apr2007). Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG Doc. JVT-W201.

H.K. Zrida, A. Jemai, A.C. Ammari, and M. Abid. 2009. High level H.264/AVC video encoder parallelizationfor multiprocessor implementation. In Design, Automation and Test in Europe Conference and Exhibi-tion. IEEE, 940–945.


Documents

Decoder-Complexity-Aware Encoding of Motion Compensation ...shervin/pubs/VideoSVC... · must stream the same video simultaneously to multiple receivers with different band-width and