A Technical Overview of AV1 · 2020. 8. 17. · AV1 codec, libaom [6], has since been developed as a refer-ence codec for various production use cases including VoD, video conferencing

1

A Technical Overview of AV1Jingning Han, Senior Member, IEEE, Bohan Li, Member, IEEE, Debargha Mukherjee, Senior Member, IEEE,

Ching-Han Chiang, Cheng Chen, Hui Su, Sarah Parker, Urvang Joshi, Yue Chen, Yunqing Wang, Paul Wilkins,Yaowu Xu, Senior Member, IEEE, and James Bankoski

Abstract—The AV1 video compression format is developed bythe Alliance for Open Media consortium. It achieves more than30% reduction in bit-rate compared to its predecessor VP9 forthe same decoded video quality. This paper provides a technicaloverview of the AV1 codec design that enables the compressionperformance gains with considerations for hardware feasibility.

Index Terms—AV1, Alliance of Open Media, video compression

I. INTRODUCTION

THE last decade has seen a steady and significant growthof web-based video applications including video-on-

demand (VoD) service, live streaming, conferencing, and vir-tual reality [1]. Bandwidth and storage costs have driven theneed for video compression techniques with better compres-sion efficiency. VP9 [2] and HEVC [3], both debuted in 2013,achieved in the range of 50% higher compression performancethan the prior codec H.264/AVC [4] and were quickly adoptedby the industry.

As the demand for high performance video compressioncontinued to grow, the Alliance for Open Media [5] wasformed in 2015 as a consortium for the development of open,royalty-free technology for multimedia delivery. Its first videocompression format AV1, released in 2018, enabled about 30%compression gains over its predecessor VP9. An open-sourceAV1 codec, libaom [6], has since been developed as a refer-ence codec for various production use cases including VoD,video conferencing and light field, with encoder optimizationsthat utilize the AV1 coding tools for compression performanceimprovements while keeping the computational complexity incheck. The AV1 format is already supported by many webplatforms including Android, Chrome, Microsoft Edge, andFirefox and multiple web-based video service providers, e.g.,YouTube, Netflix, Vimeo, and Bitmovin, have begun rollingout AV1 streaming services at scale.

Web-based video applications have seen a rapid shift fromconventional desktop compouters to mobile devices and TVsin recent years. For example, it is quite common to seeusers watch YouTube and Facebook videos on mobile phones.Meanwhile nearly all the smart TVs after 2015 have nativeapps to support movie playback from YouTube, Netflix, andAmazon. Therefore, a new generation video compression for-mat needs to ensure its decodable on these devices. However,to improve the compression efficiency, it is almost inevitablethat a new codec will include coding techniques that are

This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this version mayno longer be accessible. J. Han is with the WebM Codec team, Google LLC,Mountain View, CA, 94043 USA e-mail: {jingning}@google.com.

more computationally complex than its predecessors. Withthe slowdown in the growth of general CPU clock frequencyand power constraints on mobile devices in particular, nextgeneration video compression codecs are expected to relyheavily on dedicated hardware decoders. Therefore during theAV1 development process, all the coding tools were carefullyreviewed for hardware considerations (e.g., latency, siliconarea, etc.), which resulted in a codec design well balancedfor compression performance and hardware feasibility.

This paper provides a technical overview of the AV1 codec.Prior literature highlights some major characteristics of thecodec and reports preliminary performance results [7]–[9]. Adescription of the available coding tools in AV1 is providedin [8]. For syntax element definition and decoder operationlogic, the readers are referred to the AV1 specification [9].Instead, this paper will focus on the design theories of thecompression techniques and the considerations for hardwaredecoder feasibility, which together define the current stateof the AV1 codec. For certain coding tools that primarilyextend existing concepts in VP9 and hence demand substantialsearches to realize the compression gains, it is imperative tocomplement them with proper encoder strategies that mate-rialize the coding gains at a practical encoder complexity.We will further explore approaches to optimizing the tradeoff between encoder complexity and the coding performancetherein. We note that the libaom AV1 encoder optimization isbeing actively developed for better compression performanceand higher encoding speed. We refer to the webpage [10]for the related performance statistics update. The AV1 codecincludes contributions from the entire AOMedia teams [5]and the greater eco-system around the globe. An incompletecontributor list can be found at [11] .

The AV1 codec supports input video signals in the 4:0:0(monochrome), 4:2:0, and 4:4:4 formats. The allowed pixelrepresentations are 8-, 10-, and 12-bit. The AV1 codec operateson pixel blocks. Each pixel block is processed in a predictive-transform coding scheme, where the prediction comes fromeither intra frame reference pixels, inter frame motion com-pensation, or some combinations of the two. The residualsundergo a 2-D unitary transform to further remove the spatialcorrelations and the transform coefficients are quantized. Boththe prediction syntax elements and the quantized transformcoefficient indexes are then entropy coded using arithmeticcoding. There are 3 optional in-loop post-processing filterstages to enhance the quality of the reconstructed frame forreference by subsequent coded frames. A normative film grainsynthesis unit is also available to improve the perceptualquality of the displayed frames.

We will start by considering frame level designs, before

arX

iv:2

008.

0609

1v1

[ee

ss.I

V]

13

Aug

202

0

2

progressing on to look at coding block level operations and theentropy coding system applied to all syntax elements. Finallywe will discuss in-loop and out-of-loop filtering.

II. REFERENCE FRAME SYSTEM

A. Reference Frames

The AV1 codec allows a maximum of 8 frames in itsdecoded frame buffer. For a coding frame, it can choose any 7frames from the decoded frame buffer as its reference frames.The bit-stream allows the encoder to explicitly assign eachreference a unique reference frame index ranging from 1 to 7.The reference frames index 1-4 are designated for the framesthat precede the current frame in terms of natural display orderwhilst index 5-7 are for reference frames coming after the cur-rent one. For compound inter prediction, two references can becombined to form the prediction (see Section IV-C4). If bothreference frames either precede or follow the current frame,this is considered to be uni-directional compound prediction.This contrasts with bi-directional compound prediction wherethere is one previous and one future reference frame.

In estimation theory, it is commonly known that extrap-olation (uni-directional compound) is usually less accuratethan interpolation (bi-directional compound) prediction [12] .The allowed uni-directional reference frame combinations arehence limited to only 4 possible pairs, i.e., (1, 2), (1, 3), (1,4), and (5, 7), but all the 12 combinations in the bi-directionalcase are supported. This limitation reduces the total numberof compound reference frame combinations from 21 to 16.

When a frame coding is complete, the encoder can decidewhich reference frame in the decoded frame buffer to replaceand explicitly signals this in the bit-stream. The mechanismalso allows one to bypass updating the decoded frame buffer.This is particularly useful for high motion videos where certainframes are less relevant to neighboring frames.

B. Alternate Reference Frame

The alternate reference frame (ARF) is a frame that will becoded and stored in the decoded frame buffer without beingdisplayed. It serves as a reference frame for subsequent framesto be processed. To transmit a frame for display, the AV1codec can either code a new frame or directly use a framein the decoded frame buffer – this is called “show existingframe”. An ARF that is later being directly displayed can beeffectively used to code a future frame in a pyramid codingstructure [13] .

Moreover, the encoder has the option to synthesize a framethat can potentially reduce the collective prediction errorsamong several display frames. One example that we use inthe libaom encoder is to apply temporal filtering along themotion trajectories of consecutive original frames to build anARF, which retains the common information [14] with theacquisition noise on each individual frame largely removed.The encoder typically uses a relatively lower quantizationstep size to code the common information (i.e., ARF) tooptimize the overall rate-distortion performance [15], [16]. Apotential downside here is that this results in an additionalframe for decoders to process, which could potentially stretch

throughput capacity on some hardware, especially for highresolution formats and frame rates such as 4K 60 fps andabove. To balance compression performance and hardwaredecoder throughput, the frequency of the synthesized ARFsis typically limited to once per group of pictures (GOP).The minimum distance between two synthesized ARFs is alsolimited according to the frame resolution and playback ratesspecified in the level definition (see Section VII).

C. Frame Scaling

The AV1 codec supports the option to scale a source frameto a lower resolution for compression, and re-scale the recon-structed frame to the original frame resolution. This designis particularly useful when a few frames are overly complexto compress, and hence cannot fit in the target streamingbandwidth range. The down scaling factor is constrained tobe within the range of 8/16 to 15/16. The reconstructed frameis first linearly upscaled to the original size, followed by a looprestoration filter as part of the post processing stage. Both thelinear upscaling filter and the loop restoration filter operationsare normatively defined. We will discuss it with more details inSection VI-D. In order to maintain a cost-effective hardwareimplementation where no additional expense on line buffersis required beyond the size for regular frame decoding, there-scaling process is limited to the horizontal direction. Theup-scaled and filtered version of the decoded frame will beavailable as a reference frame for coding subsequent frames.

III. SUPERBLOCK AND TILE

A. Superblock

A superblock is the largest coding block the AV1 codeccan process. The superblock size can be either 128 × 128luma samples or 64× 64 luma samples, which is decided bythe sequence header. A superblock can be further partitionedinto smaller coding blocks, each with their own prediction andtransform modes. A superblock coding is only dependent onits above and left neighboring superblocks.

B. Tile

A tile is a rectangular array of superblocks whose spatialreferencing, including intra prediction reference and the prob-ability model update, is limited to be within the tile boundary.As a result, the tiles within a frame can be independentlycoded, which facilitates simple and effective multi-threadingfor both encoder and decoder implementations. The minimumtile size is 1 superblock. The maximum tile width correspondsto 4096 luma samples and the maximum tile size correspondsto 4096× 2304 luma samples.

AV1 supports two ways to specify the tile size for eachframe. The uniform tile size option follows the VP9 tiledesign and assumes all the tiles within a frame are of thesame dimension, except those sitting at the bottom or rightframe boundary. It allows one to identify the number of tilesvertically and horizontally in the bit-stream and derives the tiledimension based on the frame size. A second option, the non-uniform tile size, assumes a lattice form of tiles. The spacing

3

Uniform tile partition Non-uniform tile partition

Width 1 Width 2 Width 3 Width 4

Height 1

Height 2

Height 3

Height 4

width

Height

Fig. 1: An illustration of the uniform and non-uniform tilesizes. The uniform tile size option uses the same tile dimensionacross the frame. The non-uniform tile size option requires aseries of width and height values to determine the lattice.

is non-uniform in both vertical and horizontal directions andtile dimensions must be specified in the bit-stream in unitsof superblocks. It is designed to recognize the fact that thecomputational complexity differs across superblocks within aframe, due to the variations in video signal statistics. The non-uniform tile size option allows one to use smaller tile sizes forregions that require higher computational complexity, therebybalancing the workload among threads. This is particularlyuseful when one has ample computing resource in terms ofthreads and needs to minimize the frame coding latency. Anexample is provided in Figure 1 to demonstrate the two tileoptions.

The uniform/non-uniform tile size options and the tile sizesare decided on a frame by frame basis. It is noteworthy that thepost-processing filters are applied across the tile boundaries toavoid potential coding artifacts (e.g., blocking artifacts) alongthe tile edges.

IV. CODING BLOCK OPERATIONS

A. Coding Block Partitioning

A superblock can be recursively partitioned into smallerblock sizes for coding. VP9 uses a 4-way block partition treethat splits an N × N block into either N × N , N/2 × N ,N ×N/2, or N/2×N/2 blocks. Only the square block sizescan be further partitioned. The superblock size is 64×64 lumasamples. The minimum coding block size is 8×8, within whicheach 4 × 4 sub-block can potentially have different motionvectors towards the same reference frame(s) in inter blockmode, or different prediction directions in intra block mode.

AV1 inherits the recursive block partitioning design. Toreduce the overhead cost on prediction mode coding for videosignals that are highly correlated, a situation typically seenin 4K videos, AV1 increases the maximum coding block sizeto 128 × 128 luma samples. The allowed partition options ateach block level are extended to 10 possibilities as shownin Figure 2, which include N × N/4 and N/4 × N blocks.To improve the prediction quality for complex videos, theminimum coding block size is reduced to 4×4 luma samples.While such extensions provide more coding flexibility, theyhave implications for hardware decoders. Certain block sizedependent constraints are specifically designed to circumventsuch complications.

128x128

RR

RRR: Recursive

64x64

RRRR

Fig. 2: The recursive block partition tree in AV1.

1) Block Size Dependent Constraints: The core computingunit in a hardware decoder is typically designed around asuperblock. Increasing the superblock size from 64 × 64 to128 × 128 would require about 4 times silicon area for thecore computing unit. To resolve this issue, we constrain thedecoding operations to be conducted in 64 × 64 units evenfor larger block sizes. For example, to decode a 128 × 128block in YUV420 format, one needs to decode the luma andchroma components corresponding to the first 64× 64 block,followed by those corresponding to the next 64 × 64 block,etc, in contrast to processing the luma component for the entire128 × 128 block, followed by the chroma components. Thisconstraint effectively re-arranges the entropy coding order forthe luma and chroma components, and has no penalty on thecompression performance. It allows a hardware decoder toprocess a 128 × 128 block as a series of 64 × 64 blocks,and hence retain the same silicon area.

At the other end of the spectrum, the use of 4 × 4 codingblocks increases the worst-case latency in YUV420 format,which happens when all the coding blocks are 4 × 4 lumasamples and are coded using intra prediction modes. To rebuildan intra coded block, one needs to wait for its above andleft neighbors to be fully reconstructed, because of the spatialpixel referencing. In VP9, the 4 × 4 luma samples within an8 × 8 block are all coded in either inter or intra mode. Ifin intra mode, the collocated 4 × 4 chroma components willuse an intra prediction mode followed by a 4 × 4 transform.An unconstrained 4× 4 coding block size would require each2× 2 chroma samples to go through prediction and transformcoding, which creates a dependency in the chroma componentdecoding. Note that inter modes do not have such spatialdependency issues.

AV1 adopts a constrained chroma component coding for 4×4 blocks in YUV420 format to resolve this latency issue. If allthe luma blocks within an 8×8 block are coded in inter mode,the chroma component will be predicted in 2× 2 units usingthe motion information derived from the corresponding lumablock. If any luma block is coded in an intra mode, the chromacomponent will follow the bottom-right 4 × 4 luma block’scoding mode and conduct the prediction in 4 × 4 units. Theprediction residuals of chroma components then go through a4× 4 transform.

These block size dependent constraints enable the extensionof the coding block partition system with limited impacton hardware feasibility. However, an extensive rate-distortion

4

R

R

R

R

R

R

R R

Fig. 3: The first pass of the two-stage block partition searchgoes through square blocks only. The recursive partition pointis denoted by R.

optimization search is required to translate this increasedflexibility into compression gains.

2) Two-Stage Block Partitioning Search: Observing that thekey flexibility in variable coding block size is provided by therecursive partition that goes through the square coding blocks,we devise a two-stage partition search approach. The first passstarts from the largest coding block size and goes throughsquare partitions only. The recursion search tree is illustratedin Figure 3. For each coding block, the rate-distortion searchis limited, e.g. only using the largest transform block and2-D DCT kernel. Its partition decisions will be analyzed todetermine the most likely operating range, in which the secondblock partition search will conduct an extensive rate-distortionoptimization search for all the 10 possible partitions.

An example of the first-pass decision tree is depicted inFigure 4, where the second pass conducts a full partition searchat the same block level decided by the first pass. The encoderbypasses all the partition search at 64×64 block partition level,and goes directly towards the four 32× 32 block. For the first32× 32 block A, it will check all the possible partitions at itslevel, i.e., 32 × 16, 16 × 32, 32 × 8, 16 × 16, etc. Note thatit will not further check recursive partition going down from16× 16. The partition with lowest rate-distortion cost will bepicked as the final partition decision for block A.

The second 32× 32 block B will bypass the 32× 32 levelpartition check. Instead the encoder will check each 16 × 16block level partition, e.g., 16×16, 16×8, 8×16, 8×8, etc. Itwill not continue down the path of any 8×8 blocks. Similarly,block C would bypass the 32 × 32 level check. Three of the16× 16 in block C will go through a 16× 16 level partitionsearch, while the top-right 16 × 16 will bypass the 16 × 16level search and proceed to an 8× 8 level partition search.

Changing the allowed block size search range drawn fromthe first pass partition results would give different trade-offsbetween the compression performance and the encoding speed.We refer to [17] for more experimental results.

We will next discuss the compression techniques availableat a coding block level within a partition.

64x64 block

Block A (32x32) Block B (32x32)

Block C (32x32)

Block D (16x16)

Fig. 4: The second pass of the two-stage block partition searchconducts extensive rate-distortion search within the block sizerange provided by the first pass results.

Delta angle

Base directional mode

Fig. 5: Directional intra prediction modes. The original 8 direc-tions in VP9 are used as a base. Each allows a supplementarysignal to tune the prediction angle in units of 3°.

B. Intra Frame Prediction

For a coding block in intra mode, the prediction modefor the luma component and the prediction mode for bothchroma components are signaled separately in the bitstream.The luma prediction mode is entropy coded using a probabilitymodel based on its above and left coding blocks’ predictioncontext. The entropy coding of the chroma prediction mode isconditioned on the state of the luma prediction mode. The intraprediction operates in units of transform blocks (as introducedin Section IV-E) and uses previously decoded boundary pixelsas a reference.

1) Directional Intra Prediction: AV1 extends the direc-tional intra prediction options in VP9 to support higher gran-ularity. The original 8 directional modes in VP9 are used asa base in AV1, with a supplementary signal to fine tune theprediction angle. This comprises up to 3 steps clockwise orcounter clockwise, each of 3° as shown in Figure 5. A 2-tapbilinear filter is used to interpolate the reference pixels whena prediction points to a sub-pixel position. For coding blocksize of less than 8× 8, only the 8 base directional modes areallowed, since the small number of pixels to be predicted doesnot justify the overhead cost of the additional granularity.

5

L

TL T

P

TR

BL

x

y

Fig. 6: An illustration of the distance weighted smooth intraprediction. The dark green pixels are the reference and thelight blue ones are the prediction. The variables x and y arethe distance from left and top boundaries, respectively.

2) Non-directional Smooth Intra Prediction: VP9 has 2non-directional intra smooth prediction modes: DC PRED andTM PRED. AV1 adds 3 new smooth prediction modes thatestimate pixels using a distance weighted linear combina-tion, namely SMOOTH V PRED, SMOOTH H PRED, andSMOOTH PRED. They use the bottom-left (BL) and top-right(TR) reference pixels to fill the right-most column and bottom-row, thereby forming a closed loop boundary condition forinterpolation. We use the notations in Figure 6 to demonstratetheir computation procedures:

• SMOOTH H PRED: PH = w(x)L+ (1− w(x))TR;• SMOOTH V PRED: PV = w(y)T + (1− w(y))BL;• SMOOTH PRED: P = (PH + PV )/2.

where w(x) represents the weight based on distance x fromthe boundary, whose values are preset.

AV1 replaces the TM PRED mode which operates as

P = T + L− TL

with a PAETH PRED mode that follows:

P = argmin|x− (T + L− TL)|,∀x ∈ {T, L, TL}.

The non-linearity in the PAETH PRED mode allows theprediction to steer the referencing angle to align with thedirection that exhibits highest correlation.

3) Recursive Intra Prediction: To capture the decayingspatial correlation with reference pixels, a set of linear filtersare designed for luma components that predict a 4 × 2 pixelpatch using the 7 adjacent neighbors, e.g. p0 − p6 for theblue patch in Fig. 7. The predicted pixels serve as referencefor next patch. A total of 5 different sets of linear predictorsare defined in the specification, each represents a differentdecaying pattern of the spatial correlation.

4) Chroma from Luma Prediction: Chroma from lumaprediction models chroma pixels as a linear function of cor-responding reconstructed luma pixels. As depicted in Figure8, the predicted chroma pixels are obtained by adding the DCprediction of the chroma block to a scaled AC contribution,which is the result of multiplying the AC component ofthe downsampled luma block by a scaling factor explicitlysignaled in the bitstream [18].

Fig. 7: Recursive-filter-based intra predictor. Reference pixelsp0-p6 are used to linearly predict the 4× 2 patch in blue. Thepredicted pixels will be used as reference for next 4×2 patchin the current block.

Subsample Average

−

×

+

Reconstructed LumaPixels

“AC” ContributionSignaledScaling Parameters

“DC” PredictionCfLPrediction

Fig. 8: Outline of the operations required to build the CfLprediction [18].

5) Intra Block Copy: AV1 allows intra-frame motion com-pensated prediction, which uses to the previously coded pixelswithin the same frame, namely Intra Block Copy (IntraBC).A motion vector at full pixel resolution is used to locate thereference block. This may imply a half-pixel accuracy motiondisplacement for the chroma components, in which context abilinear filter is used to conduct sub-pixel interpolation. TheIntraBC mode is only available for intra coding frames, andcan be turned on and off by frame header.

Typical hardware decoders pipeline the pixel reconstructionand the post-processing filter stages, such that the post-processing filters are applied to the decoded superblocks,while later superblocks in the same frame are being decoded.Hence an IntraBC reference block is retrieved from the pixelsafter post-processing filters. In contrast, a typical encoderwould process all the coding blocks within a frame, thendecide the post-processing filter parameters that minimize thereconstruction error. Therefore, the IntraBC mode most likelyaccesses the coded pixels prior to the post-processing filtersfor rate-distortion optimization. Such discrepancy hinders theefficiency of IntraBC mode. To circumvent this issue, all thepost-processing filters are disabled if the IntraBC mode isallowed in an intra only coded frame.

In practice, the IntraBC mode is most likely useful forimages that contain substantial amount of text content, orsimilar repeated patterns, in which setting post-processingfilters are less effective. For natural images where pixelslargely form an auto-regressive model, the encoder needs tobe cautious regarding the use of IntraBC mode, as the absence

6

current block

(x0, y0)

mv0

(x1, y1)

reference frame

(x0, y0)

current frame

Fig. 9: Translational motion compensated prediction.

of post-processing filters, may trigger visual artifacts at coarsequantization.

6) Color Palette: In this mode, a color palette rangingbetween 2 to 8 base colors (i.e. pixel value) is built for eachluma/chroma plane, where each pixel gets assigned a colorindex. The number of base colors is an encoder decision thatdetermines the trade-off between fidelity and compactness.The base colors are predictively coded in the bit-streamusing those of neighboring blocks as reference. The colorindexes are coded pixel-by-pixel using a probability modelconditioned on previously coded color indexes. The luma andchroma channels can decide whether to use the palette modeindependently. This mode is particularly suitable for a pixelblock that contains limited pixel variations.

C. Inter Frame Prediction

AV1 supports a rich toolsets to exploit the temporal cor-relation in video signals. These include adaptive filtering intranslational motion compensation, affine motion compensa-tion, and highly flexible compound prediction modes.

1) Translational Motion Compensation: A coding blockuses a motion vector to find its prediction in a reference frame.It first maps its current position, e.g. top-left pixel position(x0, y0) in Figure 9, in the reference frame. It is then displacedby the motion vector to the target reference block whose top-left pixel is located at (x1, y1).

AV1 allows 1/8 pixel motion vector accuracy. A sub-pixelis generated through separable interpolation filters. A typicalprocedure is shown in Figure 10, where one first computes thehorizontal interpolation through all the related rows. A secondvertical filter is applied to the resulting intermediate pixels toproduce the final sub-pixel. Clearly the intermediate pixels(orange) can be reused to produce multiple final sub-pixels(green).

Common block-based encoder motion estimations are con-ducted via measurements of the sum of absolute difference(SAD) or the sum of squared error (SSE) [19]–[21], whichtend to favor a reference block that resembles the DC andlower AC frequency components well, whereas the high fre-quency components are less reliably predicted. An interpola-

Full pixel

intermediate pixel

Sub pixel

Fig. 10: Sub-pixel generation through separable interpolationfilter.

tion filter with a high cutoff frequency would allow more highfrequency components from the reference region to form theprediction, and is suitable for cases where the high frequencycomponents between the reference and the current block arehighly correlated. Conversely an interpolation filter with alow cutoff frequency would largely remove high frequencycomponents that are less relevant to the current block.

An adaptive interpolation filter scheme is used in VP9,where an inter-coded block in VP9 can choose from three8-tap interpolation filters that correspond to different cutofffrequencies in a Hamming window in the frequency domain.The selected interpolation filter is applied to both verticaland horizontal directions. AV1 inherits the interpolation filterselection design and extends it to support independent filterselection for the vertical and horizontal directions, respec-tively. It exploits the potential temporal statistical discrepancybetween the vertical and horizontal directions for improvedprediction quality. Each direction can choose from 3 finiteimpulse response (FIR) filters, namely SMOOTH, REGULAR,and SHARP in ascending order of cutoff frequencies. A heatmap of the correlations between the prediction and the sourcesignals in the transform domain is shown in Figure 11, wherethe prediction and source block pairs are grouped accordingto their optimal 2-D interpolation filters. It is evident that thesignal statistics differ in vertical and horizontal directions andan independent filter selection in each direction captures suchdiscrepancy well.

To reduce the decoder complexity, the SMOOTH and REG-ULAR filters adopt a 6-tap FIR design, which appears to besufficient for a smooth and flat baseband. The SHARP filtercontinues to use an 8-tap FIR design to mitigate the rippleeffect near the cutoff frequency. The filter coefficients thatcorrespond to half-pixel interpolation are

SMOOTH [−2, 14, 52, 52, 14,−2]

REGULAR [2,−14, 76, 76,−14, 2]

SHARP [−4, 12,−24, 80, 80,−24, 12,−4].

whose frequency responses are shown in Figure 12. To furtherreduce the worst case complexity, i.e., all coding blocks arein 4 × 4 luma samples, there are two additional 4-tap filters

7

Fig. 11: A heat map of the correlations between the predictionand the source signals in the transform domain. The motionestimation here is in units of 16×16 block. The prediction andsource blocks are grouped based on their optimal interpolationfilters. The test clip is old town cross 480p. It is evident thatthe groups using SHARP filter tend to have higher correla-tion in high frequency components along the correspondingdirection.

Fig. 12: The frequency responses of the 3 interpolation filtersat half-pixel position.

that are used when the coding block has dimensions of 4 orless. The filter coefficients for half-pixel interpolation are

SMOOTH [12, 52, 52, 12]

REGULAR [−12, 76, 76,−12].

The SHARP filter option is not applicable due to the shortfilter taps.

2) Affine Model Parameters: Besides conventional trans-lational motion compensation, AV1 also supports the affinetransformation model that projects a current pixel at (x, y) to

a prediction pixel at (x′, y′) in a reference frame through[x′

y′

]=

[h11 h12 h13h21 h22 h23

]xy1

. (1)

The tuple (h13, h23) corresponds to a conventional motionvector used in the translational model. Parameters h11 andh22 control the scaling factors in the vertical and horizontalaxes, and in conjunction with the pair h12 and h21 decide therotation angle.

A global affine model is associated with each referenceframe, where each of the four non-translational parametershas 12-bit precision and the translational motion vector iscoded in 15-bit precision. A coding block can choose to use itdirectly provided the reference frame index. The global affinemodel captures the frame level scaling and rotation, and henceprimarily focuses on the settings of rigid motion over the entireframe. In addition, a local affine model at coding block levelwould be desirable to adaptively track the non-translationalmotion activities that vary across the frame. However theoverhead cost of sending the affine model parameters percoding block also introduces additional side-information [22].As a result, various research efforts focus on the estimation ofthe affine model parameters without the extra overhead [23],[24]. A local affine parameter estimation scheme based on theregular translational motion vectors from spatial neighboringblocks has also been developed for AV1.

The translational motion vector (h13, h23) in the local affinemodel is explicitly transmitted in the bit-stream. To estimatethe other four parameters, it hypothesizes that the local scalingand rotation factors can be reflected by the pattern of thespatial neighbors’ motion activities. The codec scans througha block’s nearest neighbors, in the order of top neighbors,left neighbors, top-left neighbor, and top-right neighbor (ifavailable), and finds the ones whose motion vector pointstoward the same reference frame. A maximum of 8 candidatereference blocks are allowed. The scan process terminates oncethat limit is reached. For each selected reference block, itscenter point will first be offset by the center location of thecurrent block to create an original sample position. This offsetversion will then add the motion vector difference between thetwo blocks to form the destination sample position after theaffine transformation. A least square regression is conductedover all the available original and destination sample positionpairs to calculate the affine model parameters.

We use Figure 13 as an example to demonstrate the affineparameter estimation process. The nearest neighbor blocks aremarked by the scan order. For Block k, its center position isdenoted by (xk, yk) and the motion vector is denoted by mvk.The current block is denoted by k = 0. Assume in this caseBlock 1, 2, 5, and 7 share the same reference as the currentblock and are selected as the reference blocks. An originalsample position is formed as

(ak, bk) = (xk, yk)− (x0, y0), (2)

where k ∈ {1, 2, 5, 7}. The corresponding destination sampleposition is obtained by further adding the motion vector

8

mv1 mv2 mv7

mv5

mv0

Blk 1 Blk 2 Blk 3

Blk 4

Blk 5

Blk 6 Blk 7

Current blk(x0, y0)

(x1, y1) (x2, y2) (x7, y7)

(x5, y5)

Fig. 13: An illustration of the local affine parameter estimation.

difference.

(a′k, b′k) = (ak, bk)+(mvk.x,mvk.y)−(mv0.x,mv0.y). (3)

To formulate the least square regression, we denote the sampledata as

P =

a1, b1a2, b2a5, b5a7, b7

, q =

a′1a′2a′5a′7

, and r =

b′1b′2b′5b′7

. (4)

The least square regression gives the affine parameter in (1)as[

h11h12

]= (PTP )−1PT q, and

[h21h22

]= (PTP )−1PT r. (5)

In practice, one needs to ensure that the spatial neighboringblock is relevant to the current block. Hence we discardthe reference block if any component of the motion vectordifference is above 8 pixels in absolute value. Furthermore, ifthe number of available reference blocks is less than 2, theleast square regression problem is ill posed, hence the localaffine model is disabled.

3) Affine Motion Compensation: With the affine modelestablished, we next discuss techniques in AV1 for efficientprediction construction [25] . The affine model is allowedfor block size at 8 × 8 and above. A prediction block isdecomposed into 8 × 8 units. The center pixel of each 8 × 8prediction unit is first determined by the translational motionvector (h13, h23), as shown in Figure 14. The rest of the pixelsat position (x, y) in the green square in Figure 14, are scaledand rotated around the center pixel at (x1, y1) to form theaffine projection (x′, y′) in the dash line following[

x′

y′

]=

[h11 h12h21 h22

] [x− x1y − y1

]+

[x1y1

]. (6)

The affine projection allows 1/64 pixel precision. A setof 8-tap FIR filters (6-tap in certain corner cases) are de-signed to construct the sub-pixel interpolations. A conventionaltranslational model has a uniform sub-pixel offset across theentire block, which allows one to effectively “reuse” mostintermediate outcomes to reduce the overall computation.Typically as introduced in Section IV-C1, to interpolate an

current block

(x0, y0)

mv0

(x1, y1)

reference frame

Fig. 14: Build the affine prediction.

8 × 8 block, a horizontal filter is first applied to generate anintermediate 15× 8 array from a 15× 15 reference region. Asecond vertical filter is then applied to the intermediate 15×8array to produce the final 8 × 8 prediction block. Hence atranslational model requires (15 × 8) × 8 multiplications forthe horizontal filter stage, and (8× 8)× 8 multiplications forthe vertical filter stage, 1472 multiplications in total.

Unlike the translational model, it is reasonable to assumethat each pixel in an affine model has a different sub-pixel off-set, due to the rotation and scaling effect. Directly computingeach pixel would require 64 × 8 × 8 = 4096 multiplications.Observe, however, that the rotation and scaling matrix in (6)can be decomposed into two shear matrices:[

h11 h12h21 h22

]=

[1 0γ 1 + δ

] [1 + α β

0 1

], (7)

where the first term on the right side corresponds to a verticalinterpolation and the second term corresponds to a horizontalinterpolation. This translates building an affine reference blockinto a two-stage interpolation operation. A 15×8 intermediatearray is first obtained through horizontal filtering over a 15×15reference region, where the horizontal offsets are computed as:

horz offset = (1 + α)(x− x1) + β(y − y1). (8)

The intermediate array then undergoes vertical filtering tointerpolate vertical offsets:

vert offset = γ(x− x1) + (1 + δ)(y − y1) (9)

and generates the 8 × 8 prediction block. It thus requires atotal of 1472 multiplications, the same as the translational case.However, it is noteworthy that the actual computational cost ofaffine model is still higher, since the filter coefficients changeat each pixel, whereas the translational model uses a uniformfilter in the horizontal and vertical stage, respectively.

To improve the cache performance AV1 requires the hori-zontal offset in (8) to be within 1 pixel away from (x − x1)and the vertical offset in (9) to be within 1 pixel away from(y−y1), which constrains the reference region within a 15×15pixel array. Consider the first stage that generates a 15 × 8intermediate pixel array. The displacements from its center

9

are (x−x1) ∈ [−4, 4) and (y−y1) ∈ [−7, 8). Hence we havethe constraint on the maximum horizontal offset as

max α(x− x1) + β(y − y1) = 4|α|+ 7|β| < 1. (10)

Similarly (x − x1) ∈ [−4, 4) and (y − y1) ∈ [−4, 4) in thesecond stage, which leads to

4|γ|+ 4|δ| < 1. (11)

A valid affine model in AV1 needs to satisfy both conditionsin (10) and (11).

4) Compound Predictions: The motion compensated pre-dictions from two reference frames (see supported referenceframe pairs in Section II-A) can be linearly combined throughvarious compound modes. The compound prediction is formu-lated by

P (x, y) = m(x, y) ∗R1(x, y) + (64−m(x, y)) ∗R2(x, y),

where the weight m(x, y) is scaled by 64 for integer compu-tation, R1(x, y) and R2(x, y) represent the pixels at position(x, y) in the two reference blocks. P (x, y) will be scaled downby 1/64 to form the final prediction.

Distance Weighted Predictor Let d1 and d2 denote thetemporal distance between the current frame and its two ref-erence frames, respectively. The weight m(x, y) is determinedby the relative values of d1 and d2. Assuming that d1 ≤ d2,the weight coefficient is defined by

m(x, y) =

36, d2 < 1.5d1

44, d2 < 2.5d1

48, d2 < 3.5d1

52, otherwise

(12)

The distribution is symmetric for the case d1 ≥ d2.Average Predictor A special case of the distance weighted

predictor, where the two references are equally weighted, i.e.,m(x, y) = 32.

Difference Weighted Predictor The weighting coefficientis computed per pixel based on the difference between thetwo reference pixels. A binary sign is sent per coding blockto decide which reference block prevails when the pixeldifference is above a certain threshold.

m(x, y) =

{38 + |R1(x,y)−R2(x,y)|

16 , sign = 0

64− (38 + |R1(x,y)−R2(x,y)|16 ), sign = 1

(13)Note that m(x, y) is further capped by [0, 64].

Wedge Mode A set of 16 coefficient arrays have beenpreset for each eligible block size. They effectively split thecoding block into two sections along various oblique angles.The m(x, y) is mostly set to 64 in one section, and 0 in theother, except near the transition edge, where there is a gradualchange from 64 to 0 with 32 at the actual edge.

We use Figure 15 to demonstrate the compound options andtheir effects. The numerous compound modes add substantialencoding complexity in order to realize their potential codinggains. A particular hotspot lies in the motion estimationprocess, because each reference block is associated with itsown motion vector. Simultaneously optimizing both motion

blend where similar

pick 1 where different

Distance Weighted Predictor

DifferenceWeightedPredictor

Wedge

pick mask

Predictor 1 Predictor 2

distance in timedetermines weight

for predictor

Fig. 15: Illustration of the compound prediction modes. Thedistance weighted predictor uniformly combines the two refer-ence blocks. The difference weighted predictor combines thepixels when their values are close, and picks one referencewhen the difference is large. The wedge predictor uses oneof the preset masks to split the block into two sections, eachfilled with one reference block’s pixels.

vectors for a given compound mode makes the search spacegrow exponentially. To solve this problem, we modify themechanism proposed in [26]. The libaom AV1 encoder con-ducts single reference frame motion estimation first over allthe available reference frames for a coding block. If certainreference frames render substantially higher prediction errorthan others, all the compound modes that involve those frameswill be ignored. To conduct the motion search for a compoundmode, an iterative joint search is employed as follows:

1) Use the motion vectors provided by the single referenceframe motion search as the initial points, denoted bymv1 and mv2.

2) Fix mv1 and update mv2 by

minmv2

d(B,P (mv1,mv2)), (14)

where B refers to the original pixel block and P isthe compound prediction associated with motion vectormv1 and mv2. Since mv2 starts with a single referencemotion search result, the search region in this step islimited to be within 4 full pixels.

3) Fix mv2 and update mv1 by

minmv1

d(B,P (mv1,mv2)). (15)

4) Repeat steps 2 and 3 until either d(B,P (mv1,mv2))stops decreasing, or up to 4 times.

Clearly this approach significantly reduces the number ofmotion vector search points for a compound mode.

Other prediction modes supported by AV1 that blend mul-tiple reference blocks include an overlapped block motioncompensation and a combined inter-intra prediction mode,both of which operate on a single reference frame and allowonly one motion vector.

Overlapped Block Motion Compensation The overlappedblock motion compensation mode modifies the original designin [27] to account for variable block sizes [28] . It exploitsthe immediate spatial neighbors’ motion information to im-prove the prediction quality for pixels near its top and left

10

boundaries, where the true motion trajectory correlates withthe motion vectors on both sides.

It first scans through the immediate neighbors above andfinds up to 4 reference blocks that have the same referenceframe as the current block. An example is shown in Figure16(a), where the blocks are marked according to their scanorder. The motion vector of each selected reference block isemployed to generate a motion compensated block that extendsfrom the top boundary towards the center of the current block.Its width is the same as the reference block’s width and itsheight is half of the current block’s height, as shown in Figure16(a). An intermediate blending result is formed as:

Pint(x, y) = m(x, y)R1(x, y) + (64−m(x, y))Rabove(x, y),(16)

where R1(x, y) is the original motion compensated pixel atposition (x, y) using current block’s motion vector mv0, andRabove(x, y) is the pixel from the overlapped reference block.The weight m(x, y) follows a raised cosine function:

m(x, y) = 64 ∗ (1

2sin(

π

H(y +

1

2)) +

1

2), (17)

where y = 0, 1, · · · , H/2 − 1 is the row index, H is thecurrent block height. The weight distribution for H = 16 isshown in Figure 17.

The scheme next processes the immediate left neighborsto extract the available motion vectors and build overlappedreference blocks extending from the left boundary towardsthe center, as shown in Figure 16(b). The final prediction iscalculated by:

P (x, y) = m(x, y)Pint(x, y) + (64−m(x, y))Rleft(x, y),(18)

where Rleft(x, y) is the pixel from the left-side overlappedreference block. The weight m(x, y) is a raised cosine functionof the column index x:

m(x, y) = 64 ∗ (1

2sin(

π

W(x+

1

2)) +

1

2), (19)

where x = 0, 1, · · · , W/2− 1 and W is the current blockwidth.

Compound Inter-Intra Predictor This mode combinesan intra prediction block and a translational inter predictionblock. The intra prediction is selected among the DC, vertical,horizontal, and smooth modes (see Section IV-B2). The com-bination can be achieved through either a wedge mask similarto the compound inter case above, or a preset coefficient setthat gradually reduces the intra prediction weight along itsprediction direction. Examples of the preset coefficients foreach intra mode are shown in Figure 18.

As discussed above, AV1 supports a large variety of com-pound prediction tools. Exercising each mode in the rate-distortion optimization framework fully realizes their potential,at the cost of bringing a significant complexity load for theencoder. The libaom encoder provides various options to tradequality performance for encoder speed. For example, bypass-ing a compound mode if one of the reference frames thatit uses has a substantially higher prediction error than other.However, the efficient selection of the appropriate compound

mv1mv2 mv3

mv0

Current block

H

H / 2mv1 mv2 mv3

(a)

mv4

mv5

mv6

mv0

Current block

W

W / 2

mv4

mv5

mv6

(b)

Fig. 16: Overlapped block motion compensation using top andleft neighboring blocks’ motion information, shown in (a) and(b) respectively.

Fig. 17: Normalized weights for OBMC with H = 16 orW = 16.

coding modes without extensive rate-distortion optimizationsearches remains a challenge.

11

Fig. 18: Normalized weight masks of compound inter-intraprediction for 8x8 blocks.

Current block

(3) TR(1) row scan

(5) row scan

(7) row scan

(4) TL

(2) col scan

(6) col scan(8) col scan

8x8

mv

mv1 mv2

mv3 mv4

Fig. 19: Spatial reference motion vector search pattern. Theindex ahead of each operation represents the processing order.TL stands for the top-left 8× 8 block. TR stands for the top-right 8× 8 block.

D. Dynamic Motion Vector Referencing Scheme

Motion vector coding accounts for a sizable portion ofthe overall bit-rate. Modern video codecs typically adoptpredictive coding for motion vectors and code the differenceusing entropy coding [29], [30]. The prediction accuracy has alarge impact on the coding efficiency. AV1 employs a dynamicmotion vector referencing scheme that obtains candidate mo-tion vectors from the spatial and temporal neighbors and ranksthem for efficient entropy coding.

1) Spatial Motion Vector Reference: A coding block willsearch its spatial neighbors in the unit of 8× 8 luma samplesto find the ones that have the same reference frame indexas the current block. For compound inter prediction modes,this means the same reference frame pairs. The search regioncontains three 8 × 8 block rows above the current block andthree 8 × 8 block columns to the left. The process is shownin Figure 19, where the search order is shown by the index.It starts from the nearest row and column, and interleaves theouter rows and columns. The top-right 8×8 block is includedif available. The first 8 different motion vectors encounteredwill be recorded, along with a frequency count and whetherthey appear in the nearest row or column. They will then beranked as discussed in Section IV-D4.

Note that the minimum coding block size in AV1 is 4× 4.Hence an 8 × 8 unit has up to 4 different motion vectorsand reference frame indexes to search through. This wouldrequire a hardware decoder to store all the motion informationat 4× 4 unit precision for the three 8× 8 block rows above.

Line buffers

Current frame

Fig. 20: The line buffer, shown in the orange color, stores thecoding information associated with an entire row of a frame.The dash line shows superblocks. The information in the linebuffer will be used as above context by later coding blocksacross superblock boundaries. The line buffer is updated asnew coding blocks (in blue color) are processed. In contrast,the green color shows coding information to be used as leftcontext by later blocks, the length of which corresponds to thesize of a superblock.

Hardware decoders typically use a line buffer concept, whichis a dedicated buffer in the static random access memory(SRAM), a fast and expensive unit. The line buffer maintainscoding information corresponding to an entire row of a frame,which will be used as context information for later codingblocks. An example of the line buffer concept is shown inFigure 20. The line buffer size is designed for the worst casethat corresponds to the maximum frame width allowed bythe specification. To make the line buffer size economicallyfeasible, AV1 adopts a design that only accesses 4× 4 blockmotion information in the immediate above row (the greenregion in Figure 19). For the rest of the rows, the codec onlyuses the motion information for 8× 8 units. If an 8× 8 blockis coded using 4 × 4 blocks, the bottom-right 4 × 4 block’sinformation will be used to represent the entire 8×8 block, asshown in Figure 19. This halves the amount of space neededfor motion data in the line buffer.

The storage of the coding context to the left, on the otherhand, depends on the superblock size and is agnostic tothe frame size. It has far less impact on the SRAM space.However, we keep its design symmetric to the above contextto avoid the motion vector ranking system described in SectionIV-D4 favoring either side.

2) Motion Field Motion Vector Reference: Common prac-tice extracts the temporal motion vector by referring to thecollocated blocks in the reference frames [30]. Its efficacyhowever is largely limited to capture motion trajectories atlow velocities. As illustrated in Figure 21, when the motionvelocity is high, the collocated block might be irrelevant tothe current block. To reliably track the motion trajectory forefficient motion vector prediction, AV1 uses a motion fieldapproach [31] .

12

Current framePrevious frame

Current block(Block_row, Block_col)

collocated block(Block_row, Block_col)

Motion trajectory 2 - high velocity

Motion trajectory 1 - low velocity

Fig. 21: An illustration of the temporal motion vector refer-encing under different motion velocities.

A motion field is created for each reference frame ahead ofprocessing the current frame. First we build motion trajectoriesbetween the current frame and the previously coded framesby exploiting motion vectors from previously coded framesthrough either linear interpolation or extrapolation. The motiontrajectories are associated with 8 × 8 blocks in the currentframe. Next the motion field between the current frame and agiven reference frame can be formed by extending the motiontrajectories from the current frame towards the referenceframe.

Interpolation The motion vector pointing from a referenceframe to a prior frame crosses the current frame. An exampleis shown in Figure 22. The frames are drawn in display order.The motion vector ref mv at block (ref blk row, ref blk col)in the reference frame (shown in orange) goes through thecurrent frame. The distance that ref mv spans is denoted byd1. The distance between the current frame and the referenceframe where ref mv originates is denoted by d3. The inter-section is located at block position:

blk row = ref blk row + ref mv.row · d3d1

(20)

blk col = ref blk col + ref mv.col · d3d1. (21)

The motion field motion vector that extends from block(blk row, blk col) in the current frame towards a referenceframe along the motion trajectory, e.g., mf mv in blue color,is calculated as

mf mv.row = −ref mv.row · d2d1

(22)

mf mv.col = −ref mv.col · d2d1, (23)

where d2 is the distance between the current frame and thetarget reference frame that the motion field is built for.

Extrapolation The motion vector from a reference does notcross the current frame. An example is shown in Figure 23.The motion vector ref mv (in orange) points from reference

ref_mv

Prior frame 1 Reference frame 1Current frame Reference frame 2

mf_mv

d2d1

(blk_row, blk_col)

d3

(ref_blk_row, ref_blk_col)

Fig. 22: Building motion trajectory through motion vectorinterpolation.

ref_mv

Prior frame 1 Reference frame 2Reference frame 1 Current frame

mf_mv

d2d1

(blk_row, blk_col)

d3

(ref_blk_row, ref_blk_col)

Fig. 23: Building motion trajectory through motion vectorextrapolation.

frame 1 to a prior frame 1. It is extended towards the currentframe and they meet at block position:

blk row = ref blk row - ref mv.row · d3d1

(24)

blk col = ref blk col - ref mv.col · d3d1. (25)

Its motion field motion vector towards reference frame 2,mf mv (in blue), is given by

mf mv.row = −ref mv.row · d2d1

(26)

mf mv.col = −ref mv.col · d2d1, (27)

where d2 is the distance between the current frame andreference frame 2 in Figure 23. Note that the signs in both(23) and (27) depend on whether the two reference frames areon the same side of the current frame.

Typically interpolation provides better estimation accuracythan the extrapolation. Therefore when a block has possible

13

motion trajectories originated from both, the extrapolated onewill be discarded. A coding block uses the motion field of allits 8× 8 sub-blocks as its temporal motion vector reference.

3) Hardware Constraints: The motion information, includ-ing the motion vector and the reference frame index, needsto be stored for later frames to build their motion fields. Toreduce the memory footprint, the motion information is storedin units of 8× 8 blocks. If a coding block is using compoundmodes, only the first motion vector is saved. The referenceframe motion information is commonly stored in the dynamicrandom access memory (DRAM), a relatively cheaper andslower unit as compared to SRAM, in hardware decoders. Itneeds, however, to be transferred to SRAM for computingpurposes. The bus between DRAM and SRAM is typically32-bit wide. To facilitate efficient data transfer, a number ofdata format constraints are employed. We limit the codec touse motion information from up to 4 reference frames (out of7 available frames) to build the motion field. Therefore only 2bits are needed for the reference frame index. Furthermore, amotion vector with any component magnitude above 212 willbe discarded. As a result, the motion vector and referenceframe index together can be represented by a 32-bit unit.

As mentioned in Section IV-A1, hardware decoders processframes in 64×64 block units, which makes the hardware costinvariant to the frame size. In contrast, the above motion fieldconstruction can potentially involve any motion vector in thereference frame to build the motion field for a 64× 64 block,which makes the hardware cost grow as the frame resolutionscales up.

To solve this problem, we constrain the maximum dis-placement between (ref blk row, ref blk col) and (blk row,blk col) during the motion vector projection. Let (base row,base col) denote the top-left block position of the 64 × 64block that contains (ref blk row, ref blk col):

base row = (ref blk row >> 3) << 3 (28)base col = (ref blk col >> 3) << 3. (29)

The maximum displacement constraints are:

blk row ∈ [base row, base row + 8) (30)blk col ∈ [base col− 8, base col + 16). (31)

Note that all the indexes here are in 8× 8 luma sample blockunits. Any projection in (21) or (25) that goes beyond this limitwill be discarded. This design localizes the reference regionin the reference frame used to produce the motion field fora 64 × 64 pixel block to be a 64 × (64 + 2 × 64) block, asshown in Figure 24. It allows the codec to load the necessaryreference motion vectors per 64 × 64 block from DRAM toSRAM, and process the linear projection ahead of decodingeach 64x64 block. Note that we allow the width value to belarger than the height, since the shaded portion of the referencemotion vector array can be readily re-used for decoding thenext 64× 64.

4) Dynamic Motion Vector Reference List: Having estab-lished the spatial and temporal reference motion vectors, wewill next discuss the scheme to use them for efficient motionvector coding. The spatial and temporal reference motion

Fig. 24: The constrained projection localizes the referencingregion needed to produce the motion field for a 64×64 block.The colocated block in the reference frame is at the samelocation as the processing block in the current frame. Theblue region is the extended block whose motion vectors areused to estimate the motion field for the current 64×64 block.

vectors are classified into two categories based on where theyappear: the nearest spatial neighbors and the rest. Statisticallythe motion vectors from immediate above, left, and top-right blocks tend to have higher correlation with the currentblock than the rest, and hence are considered with higherpriority. Within each category, the motion vectors are ranked indescending order of their appearance counts within the spatialand temporal search range. A motion vector candidate withhigher appearance count is considered to be “popular” in thelocal region, i.e., a higher prior probability. The two categoriesare concatenated to form a ranked list.

The first 4 motion vectors in this ranked list will be used ascandidate motion vector predictors. The encoder will pick theone that is closest to the desired motion vector and send itsindex to the decoder. It is not uncommon for coding blocks tohave fewer than 4 candidate motion vectors, due to either thehigh flexibility in the reference frame selection, or a highlyconsistent motion activity in the local region. In such context,the candidate motion vector list will be shorter than 4, whichallows the codec to save bits spent on identifying the selectedindex. The dynamic candidate motion vector list is in contrastto the design in VP9, where one always constructs 2 candidatemotion vectors. If not enough candidates are found, the VP9codec will fill the list with zero vectors. AV1 also supports aspecial inter mode that makes the inter predictor use the framelevel affine model as discussed in Section IV-C2.

The motion vector difference will be entropy coded. Sincea significant portion of the coding blocks will find a zeromotion vector difference, the probability model is designed toaccount for such bias. AV1 allows a coding block to use 1 bitto indicate whether to directly use the selected motion vectorpredictor as its final motion vector, or to additionally code thedifference. The probability model for this entropy coded bit isconditioned on two factors: whether its spatial neighbors havea non-zero motion vector difference and whether a sufficientnumber of motion vector predictors are found. For compoundmodes, where two motion vectors need to be specified, thisextends to 4 cases that cover where either block, both, orneither one have a zero difference motion vector. The non-zerodifference motion vector coding is consistent in all cases.

14

NxN NxN/2 NxN/4

R R R

Fig. 25: The transform block partition for square and rectangu-lar inter blocks. R denotes the recursive partition point. Eachcoding block allows a maximum 2 level recursive partition.

E. Transform Coding

Transform coding is applied to the prediction residual toremove the potential spatial correlations. VP9 uses a univariatetransform block size design, where all the transform blockswithin a coding block share the same transform size. Foursquare transform sizes are supported by VP9, 4 × 4, 8 × 8,16× 16, and 32× 32. A set of separable 2-D transform types,constructed by combinations of 1-D discrete cosine transform(DCT) and asymmetric discrete sine transform (ADST) kernels[32], [33] , are selected based on the prediction mode. AV1inherits the transform coding scheme in VP9 and extends itsflexibility in terms of both the transform block sizes and thekernels.

1) Transform Block Size: AV1 extends the maximum trans-form block size to 64×64. The minimum transform block sizeremains 4 × 4. In addition, rectangular transform block sizesat N×N/2, N/2×N , N×N/4, and N/4×N are supportedto complement the rectangular coding block sizes in SectionIV-A.

A recursive transform block partition approach is adoptedin AV1 for all the inter coded blocks to capture localizedstationary regions for transform coding efficiency. The initialtransform block size matches the coding block size, unless thecoding block size is above 64×64, in which case the 64×64transform block size is used. For the luma component, upto 2 levels of transform block partitioning are allowed. Therecursive partition rules for N ×N , N ×N/2, and N ×N/4coding blocks are shown in Figure 25.

The intra coded block inherits the univariate transform blocksize approach. Similar to the inter block case, the maximumtransform block size matches the coding block size, and cango up to 2 levels down for the luma component. The availableoptions for square and rectangular coding block sizes areshown in Figure 26.

The chroma components tend to have much less variationsin their statistics. Therefore the transform block is set to usethe largest available size.

2) Transform Kernels: Unlike VP9 where each codingblock has only one transform type, AV1 allows each transform

NxN NxN/2 NxN/4

Fig. 26: The transform block size options for square andrectangular intra blocks.

block to choose its own transform kernel independently. The2-D separable transform kernels are extended to combinationsof four 1-D kernels: DCT, ADST, flipped ADST (FLIPADST),and identity transform (IDTX), resulting in a total of 162-D transform kernels. The FLIPADST is a reverse of theADST kernel. The kernels are selected based on statisticsand to accommodate various boundary conditions. The DCTkernel is widely used in signal compression and is knownto approximate the optimal linear transform, Karhunen-Loevetransform (KLT), for consistently correlated data. The ADST,on the other hand, approximates the KLT where one-sidedsmoothness is assumed, and therefore is naturally suitablefor coding some intra prediction residuals. Similarly theFLIPADST captures one-sided smoothness from the oppositeend. The IDTX is further included to accommodate situationswhere sharp transitions are contained in the block and neitherDCT nor ADST are effective. Also, the IDTX, combined withother 1-D transforms, provides the 1-D transforms themselves,therefore allowing for better compression of horizontal andvertical patterns in the residual [34] . The waveforms corre-sponding to the four 1-D transform kernels are presented inFigure 27 for dimension N = 8.

Even with modern single instruction multiple data (SIMD)architectures, the inverse transform accounts for a significantportion of the decoder computational cost. The butterfly struc-ture [35] allows substantial reduction in multiplication oper-ations over plain matrix multiplication, i.e., a reduction fromO(N2) to O(NlogN), where N is the transform dimension.Hence it is highly desirable for large transform block sizes.Note that since the original ADST derived in [33] cannotbe decomposed for the butterfly structure, a variant of it, asintroduced in [36] and also as shown in Figure 27, is adoptedby AV1 for transform block sizes of 8× 8 and above.

When the transform block size is large, the boundary effectsare less pronounced, in which setting the transform codinggains of all sinusoidal transforms largely converge [33] .Therefore only the DCT and IDTX are employed for transformblocks at dimension 32× 32 and above.

3) Encoder optimization: The extension to transform blockpartitioning and the additional kernels also introduces added

15

Fig. 27: Transform kernels of DCT, ADST, FLIPADST and IDTX for dimension N = 8. The discrete basis values are displayedas red circles, with blue lines indicating the associated sinusoidal function. The bases of DCT and ADST (a variant with afast butterfly structured implementation) take the form of cos( (2n+1)kπ

2N ) and sin( (2n+1)(2k+1)π4n ) respectively, where n and k

denote time index and the frequency index, taking values from {0, 1, ..., N −1}. FLIPADST utilizes the reversed ADST bases,and IDTX denotes the identity transformation.

search routes and comparisons at the encoder. The libaom en-coder leverages the trade-off between encoder complexity andcompression efficiency, and provides various relative tools forapplications with different practical constraints. For example,the encoder can choose to separate the search of transformblock size and transform type, wherein a fixed transform typeis used in partition search followed by refinement of thetransform type after the block size determination.

Moreover, information theory based methods, as well asmachine-learning based models, have been developed (e.g.early pruning of the partition process, precluding certaintransform kernels for the block, etc.) to provide a better speedoptimization. One such example in the libaom encoder focuseson the selection of transform kernels without the need for thecostly entropy coding process. The prediction residuals arefirst transformed with a candidate transform kernel, followedby simple quantization of the transform coefficients. Insteadof calculating the rate-cost associated with the kernel usingthe methods presented in Section V-C, the encoder estimatesthe rate of the transform coefficient xk at location k based onthe assumption that it follows the Laplace distribution:

fk(xk) =1

2bkexp(−|xk|

bk), (32)

where fk is the probability density function and bk denotesthe Laplace distribution parameter for transform coefficientlocation k.

Under the assumption of high definition quantization, usinga uniform quantizer with quantization step size ∆, the proba-bility associated with quantization level lk, Pk(lk), translatesto:

Pk(lk) = fk(lk∆)∆. (33)

The arithmetic coding algorithm asymptotically needs−log2(P ) bits for a symbol with probability P. Therefore, forlk 6= 0, the associated rate is rk = −log2(2Pk(|lk|)) + 1 =−log2(Pk(|lk|)), where the factor 2 relates to the two casesfor lk > 0 and lk < 0, and the added 1 bit is used to signalthe sign. Similarly, for lk = 0, since no sign bit is needed,rk = −log2(Pk(0)). With (32) and (33), it can be shown that:

rk = log2(2bk) + log2(e)|lk|∆bk− log2(∆), (34)

where e denotes the natural logarithm base.Note that, as also shown in Section V-C, each transform

coefficient is coded conditioned on its neighboring coefficients.Therefore, the Laplace distribution parameter, bk, should beestimated adaptively for each block. In order to remove thedependency on the other coefficients to achieve acceleration,one could use bk = lk∆ as an estimate for bk when lk 6= 0,resulting in

rk = log2(|lk|) + log2(e), (35)

which depends only on the quantization level itself. In thelibaom encoder, the result in (35) is used with a small bias toaccount for the potential discrepancy between the neighboring

16

Fig. 28: The quantization parameter and quantization step sizemaps for DC and AC coefficients.

coefficients and the current coefficient. Moreover, for lk = 0,a constant bk is used. The estimated rate-distortion (R-D) costof the transform coefficients is then J = D + λΣkrk, whereD is the sum of quantization error calculated in the transformdomain, and λ is the R-D optimization parameter.

The estimated R-D costs of each candidate transform kernelare compared to provide a much narrower collection of can-didates, whose accurate R-D costs are then calculated to findthe final winner.

F. Quantization

The transform coefficients are quantized and the quantiza-tion indexes are entropy coded. The quantization parameter(QP) in AV1 ranges between 0 and 255. At a given QP, thequantization step size for DC coefficient is smaller than thatfor AC coefficient. The mapping from QP to quantization stepsize for both DC and AC coefficients is drawn in Figure 28.The lossless coding mode is achieved when QP is 0.

AV1 assigns a base QP for a coding frame, denoted byQPbase. The QP values for the DC and AC coefficients in bothluma and chroma components are shown in Table I. ∆QPp,bare offset values transmitted in the frame header, where p ∈{Y,U, V } denotes the plane and b ∈ {DC,AC} denotes theDC or the AC transform coefficients.

Recognizing the coding blocks within a frame may have dif-ferent rate-distortion trade-offs, AV1 further allows QP offsetat both superblock and coding block levels. The resolution ofsuperblock level QP offset is decided by the frame header. Theavailable options are 1, 2, 4, and 8. The coding block level QPoffset can be achieved through segmentations. AV1 allows aframe to classify its coding blocks into up to 8 segments, eachhas its own QP offset decided by the frame header. A codingblock decides and sends its segment index to the decoder.

Therefore, the effective QP for AC coefficients in a codingblock, QPcb, is given by

QPcb = clip(QPframe + ∆QPsb + ∆QPseg, 1, 255), (36)

TABLE I: Frame level QP values (QPframe) for Y/U/Vplanes.

AC DCY QPbase QPbase + ∆QPY,DC

U QPbase + ∆QPU,AC QPbase + ∆QPU,DC

V QPbase + ∆QPV,AC QPbase + ∆QPV,DC

where ∆QPsb and ∆QPseg are the QP offsets from thesuperblock and the segment, respectively. The clip functionensures it stays within a valid range. The QP is not allowed tochange from a non-zero value to zero, since zero is reservedfor lossless coding.

The decoder rebuilds the quantized samples using a uni-form quantizer. Given the quantization step size ∆ and thequantization index k, the reconstructed sample is k∆.

V. ENTROPY CODING SYSTEM

AV1 employs an M-ary symbol arithmetic coding methodto compress the syntax elements, where integer M ∈ [2, 14].The probability model is updated per symbol coding.

A. Probability Model

Consider an M-ary random variable whose probability massfunction (PMF) at time stamp n is defined as

Pn =

p1(n)p2(n)· · ·

pM (n)

, (37)

and the cumulative distribution function (CDF) given by

Cn =

c1(n)c2(n)· · ·

cM−1(n)1

, (38)

where ck(n) =∑ki=1 pi(n). When the symbol is coded, a

new outcome k ∈ {1, 2, · · · ,M} is observed. The probabilitymodel is then updated as

Pn = Pn−1(1− α) + αek, (39)

where ek is an indicator vector whose k-th element is 1 andthe rest are 0, and α is the update rate. At element level, wehave

pm(n) =

{pm(n− 1) · (1− α) + α, m = k

pm(n− 1) · (1− α), otherwise(40)

To update the CDF, we first consider cm(n) where m < k:

cm(n) =

m∑i=1

pi(n) =

m∑i=1

pi(n− 1) · (1− α)

= cm(n− 1) · (1− α).

17

For m ≥ k cases, we have

1− cm(n) =

M∑i=m+1

pi(n)

=

M∑i=m+1

pi(n− 1) · (1− α)

= (1− cm(n− 1)) ∗ (1− α),

where the second equation follows (40) and m + 1 > k.Rearranging the terms, we have

cm(n) = cm(n− 1) + α · (1− cm(n− 1)). (41)

In summary, the CDF is updated as

cm(n) =

{cm(n− 1) · (1− α), m < k

cm(n− 1) + α · (1− cm(n− 1)), m ≥ k(42)

AV1 stores M-ary symbol probabilities in the form of CDFs.The elements in (38) are scaled by 215 for integer precision.The arithmetic coding directly uses the CDFs to compresssymbols [37].

The probability update rate associated with a symbol adaptsbased on the count of this symbol’s appearance within a frame:

α =1

23+(count>15)+(count>32)+min(log2(M),2), (43)

which allows higher adaptation rate at the beginning of eachframe. The probability models are inherited from one of thereference frames whose index is signaled in the bit-stream.

B. Arithmetic Coding

The M-ary symbol arithmetic coding largely follows [37]with all the floating-point data scaled by 215 and representedby 15-bit unsigned integers. We re-iterate the decoding processusing integer representations here and discuss our designmodifications that improve throughput capacity of hardwaredecoders. Let R denote the arithmetic coder’s current intervallength, and V alue denote the code string value. The originaldecoding process is depicted as Algorithm 1.

Algorithm 1 The original arithmetic decoder operations.

low ← Rfor k = 1; V alue < low; k = k + 1 doup← lowf ← 215 − cklow ← (R× f) >> 15

end forR← up− lowV alue← V alue− low

Note that the R × f term in the for-loop is a product oftwo 15-bit integers, and requires 29 bits. To improve hardwarethroughput, it is desirable to limit this to 16 bits, however,reducing the CDF model precision would lead to less accu-rate probability model estimation and hurt the compressionperformance. Hence AV1 adopts a dual model approach, where

Fig. 29: The probability model is updated and maintained in15-bit precision, whilst only the most significant 9 bits areused by the arithmetic coder.

the probability model CDF is updated and maintained an 15-bit precision, but when it is used for entropy coding, onlythe most significant 9 bits are fed into the arithmetic coder,as shown in Figure 29. In addition, the interval length R isscaled down by 1/256 prior to the multiplication. The modifieddecoding process is shown in Algorithm 2, where the product(R >> 8)× f) fits into 16 bits.

Algorithm 2 The modified arithmetic decoder operations.

low ← Rfor k = 1; V alue < low; k = k + 1 doup← lowf ← 29 − (ck >> 6)low ← ((R >> 8)× f) >> 1

end forR← up− lowV alue← V alue− low

C. Level Map Transform Coefficient Coding System

The transform coefficient entropy coding system is anintricate and performance critical component in video codecs.We discuss its design in AV1 that decomposes it into a seriesof symbol codings.

1) Scan Order: A 2-D quantized transform coefficient ma-trix is first mapped into an 1-D array for sequential processing.The scan order depends on the transform kernel (see SectionIV-E2). A column scan is used for 1-D vertical transform and arow scan is used for 1-D horizontal transform. In both settings,we consider that the use of 1-D transform indicates strongcorrelation along the selected direction and weak correlationalong the perpendicular direction. A zig-zag scan is used forboth 2-D transform and identity matrix (IDTX), as shown inFigure 30.

2) Symbols and Contexts: The index of the last non-zerocoefficient in the scan order is first coded. The coefficients arethen processed in reverse scan order. The range of a quantizedtransform coefficient is [−215, 215). In practice, the majorityof quantized transform coefficients are concentrated close tothe origin. Hence AV1 decomposes a quantized transformcoefficients into 4 symbols:• Sign bit: When it is 1, the transform coefficient is

negative; otherwise it is positive.• Base range (BR): The symbol contains 4 possible out-

comes {0, 1, 2, > 2}, which are the absolute values of thequantized transform coefficient. An exception is for the

18

0 1 5 6

2 4 7 12

3 8 11 13

9 10 14 15

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Zig-zag scan Column scan Row scan

Fig. 30: The scan order is decided by the transform kernel.An example is drawn for 4 × 4 transform blocks. The indexrepresents the scan order. Left: zig-zag scan for 2-D transformblock. Middle: column scan for 1-D vertical transform. Right:row scan for 1-D horizontal transform.

last non-zero coefficient, where BR∈ {1, 2, > 2}, since 0has been ruled out.

• Low range (LR): It contains 4 possible outcomes{0, 1, 2, > 2} that correspond to the residual value overthe previous symbols’ upper limit.

• High range (HR): The symbol has a range of [0, 215)and corresponds to the residual value over the previoussymbols’ upper limit.

To code a quantized transform coefficient V , one firstprocesses its absolute value. As shown in Figure 31, if|V | ∈ [0, 2], the BR symbol is sufficient to signal it and thecoding of |V | is terminated. Otherwise the outcome of the BRsymbol will be “> 2”, in which case an LR symbol is usedto signal |V |. If V ∈ [3, 5], this LR symbol will be able tocover its value and complete the coding. If not, a second LR isused to further code |V |. This is repeated up to 4 times, whicheffectively covers the range [3, 14]. If |V | > 14, an additionalHR symbol is coded to signal (|V | − 14).

The probability model of symbol BR is conditioned on thepreviously coded coefficients in the same transform block.Since a transform coefficient can have correlations with multi-ple neighboring samples [38], we extend the reference samplesfrom two spatially nearest neighbors in VP9 to a regionthat depends on the transform kernel as shown in Figure32. For 1-D transform kernels, it uses 3 coefficients afterthe current sample along the transform direction. For 2-Dtransform kernels, up to 5 neighboring coefficients in theimmediate right-bottom region are used. In both cases, theabsolute values of the reference coefficients are added and thesum is considered as the context for the probability model ofBR.

Similarly, the probability model of symbol LR is designedas shown in Figure 33, where the reference region for 2-Dtransform kernels is reduced to the nearest 3 coefficients. Thesymbol HR is coded using Exp-Golomb code [39].

The sign bit is only needed for non-zero quantized transformcoefficients. Since the sign bits of AC coefficients are largelyuncorrelated, they are coded in raw bits. To improve hard-ware throughput, all the sign bits of AC coefficients withina transform block are packed together for transmission inthe bit-stream, which allows a chunk of data to bypass theentropy coding route in hardware decoders. The sign bit ofthe DC coefficient, on the other hand, is entropy coded using

0 - 2 3 - 14 15+Abs(qcoeff)

Symbols BR LR1 LR2 LR3 LR4 HR

Fig. 31: The absolute value of a quantized transform coeffi-cient V is decomposed into BR, LR, and HR symbols.

0 1 5 6

2 4 7 12

3 8 11 13

9 10 14 15

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15


Fig. 32: Reference region for symbol BR. Left: A coefficient(in orange) in a 2-D transform block uses 5 previouslyprocessed coefficients (in green) to build the context for itsconditional probability model. Middle and Right: A coefficient(in orange) in a 1-D transform block uses 3 previouslyprocessed coefficients (in green) along the transform directionto build the context for its conditional probability model.

0 1 5 6

2 4 7 12

3 8 11 13

9 10 14 15

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15


Fig. 33: Reference region for symbol LR. Left: A coefficient(in orange) in a 2-D transform block uses 3 previouslyprocessed coefficients (in green) to build the context for itsconditional probability model. Middle and Right: A coefficient(in orange) in 1-D transform block uses 3 previously processedcoefficients (in green) along the transform direction to buildthe context for its conditional probability model.

a probability model conditioned on the sign bits of the DCcoefficients in the above and left transform blocks.

VI. POST-PROCESSING FILTERS

AV1 allows 3 optional in-loop filter stages: a deblockingfilter, a constrained directional enhancement filter (CDEF), anda loop restoration filter, as illustrated in Figure 34. The filteredoutput frame is used as a reference frame for later frames. Anormative film grain synthesis stage can be optionally appliedprior to display. Unlike the in-loop filter stages, the results ofthe film grain synthesis stage do not influence the predictionfor subsequent frames. It is hence referred to as out-of-loopfilter.

A. Deblocking Filter

The deblocking filter is applied across the transform blockboundaries to remove blocky artifacts caused by the quanti-zation error. The logic for the vertical and horizontal edges

19

Deblocking Filter

Constrained Directional Enhancement Filter

Loop Restoration Filter

Film Grain Synthesis

Decoded Frame

Reference Frame Buffer

Display Picture

Fig. 34: AV1 allows 3 optional in-loop filter stages including adeblocking filter, a constrained directional enhancement filter,and a loop restoration filter. A normative film grain synthesisstage is supported for the displayed picture.

tx_width1 tx_width2

tx_width3

filter 1

filter 2

Fig. 35: The filter length is decided by the minimum transformblock sizes on both sides.

is fairly similar. We use the vertical edge case to present thedesign principles.

1) Filter Length: AV1 supports 4-tap, 8-tap, and 14-tapFIR filters for the luma components, and 4-tap and 6-tapFIR filters for chroma components. All the filter coeffi-cients are preset in the codec. The filter length is decidedby the minimum transform block sizes on both sides. Forexample, in Figure 35 the length of filter1 is given bymin(tx width1, tx width2), whereas the length of filter2 isgiven by min(tx width1, tx width3). If the transform blockdimension is 16 or above on both sides, the filter length is setto be 14.

Note that this selected filter length is the maximum filterlength allowed for a given transform block boundary. The finalfilter further depends on a flatness metric discussed next.

2) Boundary Conditions: The FIR filters used by the de-blocking stage are low-pass filters. To avoid blurring an actualedge in the original image, an edge detection is conducted todisable the deblocking filter at transitions that contain a highvariance signal. We use notations shown in Figure 36, wherethe dashed line shows the transform block boundary and p0-p6 and q0-q6 are the pixels on the two sides. We considerthe transition along the lines p6 to q6 high variance and hencedisable the deblocking filter, if any of the following conditionsis true:• |p1 − p0| > T0• |q1 − q0| > T0• 2|p0 − q0|+ |p1−q1|

2 > T1

If the filter length is 8 or 14, two additional samples arechecked to determine if the transition contains a high variance

q0 q1 q2 q3 q4 q5 q6p0p1p2p3p4p5p6

Transform block boundary

Detect actual edge transition

Fig. 36: Pixels at a transform block boundary. The dashed lineshows the transform block boundary. p0-p6 and q0-q6 are thepixels on the two sides.

signal:

• |p3 − p2| > T0• |q3 − q2| > T0

The thresholds T0 and T1 can be decided on a superblock bysuperblock basis. A higher threshold allows more transformblock boundaries to be filtered. In AV1 these thresholds canbe independently set in the bit-stream for the vertical andhorizontal edges in the luma component and for each chromaplane.

To avoid the ringing artifacts, AV1 further requires that along filter is only used when both sides are “flat”. For the 8-tap filter, this requires |qk − q0| ≤ 1 and |pk − p0| ≤ 1 wherek ∈ {1, 2, 3}. For the 14-tap filter, the condition extends tok ∈ {1, 2, · · · , 6}. If any flatness condition is false, the codecreverts to a shorter filter for that boundary.

B. Constrained Directional Enhancement Filter

The constrained directional enhancement filter (CDEF) al-lows the codec to apply a non-linear deringing filter alongcertain (potentially oblique) directions [40]. It operates in 8×8units. There are 8 preset directions available as drawn in Figure37. The decoder uses the reconstructed pixels to select theprevalent direction index by minimizing

E2d =

∑k

∑p∈Pd,k

(xp − µd,k)2, (44)

where xp is the value of pixel p, Pd,k are the pixels in line kfollowing direction d, and µd,k is the mean value of Pd,k:

µd,k =1

|Pd,k|∑p∈Pd,k

xp. (45)

A primary filter is applied along the selected direction,whilst a secondary filter is applied along the direction oriented45° off the primary direction. The filter operation for pixelp(x, y) is formulated by

p(x, y) = p(x, y) +∑m,n

wpd,m,nf(p(m,n)− p(x, y), Sp, D)

+∑m,n

wsd,m,nf(p(m,n)− p(x, y), Ss, D),

where wpd,m,n and wsd,m,n are the filter coefficients associatedwith the primary and secondary filters, respectively, as shownin Figure 38 and 39. Sp and Ss are the strength indexes for the

20

Fig. 37: The 8 preset directions in CDEF [40]. All the pixelsin a line following direction d ∈ {0, · · · , 7} in an 8× 8 blockare marked by a line number.

Fig. 38: The primary filter along direction d ∈ {0, · · · , 7},where a = 2 and b = 4 are for even strength indexes, anda = b = 3 are for odd strength indexes [40].

primary and secondary filters, and D is the dumping factor.The f() is a piece-wise linear function:

f(diff, S,D) ={min(diff,max(0, S − b diff

2D−blog2Sc c)), if diff > 0

max(diff,min(0, d diff2D−dlog2Se e))− S, otherwise

that rules out reference pixels whose values are far awayfrom p(x, y). Note that the reference pixels p(m,n) are thereconstructed pixels after the deblocking filter is applied, butbefore application of the CDEF filter.

Up to 8 groups of filter parameters, which include theprimary and secondary filter strength indexes of luma andchroma components, are signaled in the frame header. Each64 × 64 block selects one group from the presets to controlits filter operations.

C. Loop Restoration Filter

The loop restoration filter is applied to units of either64 × 64, 128 × 128, or 256 × 256 pixel blocks, named loop

Fig. 39: The secondary filter is applied along the direction45°off the corresponding primary direction d [40].

[4 bits]

[6 bits]

[5 bits]

Fig. 40: The bit precision for Wiener Filter parameters.

restoration units (LRU). Each unit can independently selecteither to bypass filtering, to use a Wiener filter, or to use aself-guided filter [41]. It is applied to the reconstructed pixelsafter any prior post filtering stages.

1) Wiener Filter: A 7×7 separable Wiener filter is appliedthrough the LRU. The filter parameters for the vertical andhorizontal filters are decided by the encoder and signaled in thebit-stream. Due to symmetric and normalization constraints,only 3 coefficients need to be sent for each filter. Also notethat the Wiener filters are expected to have a higher weightmagnitude towards the origin, so the codec reduces the numberof bits spent on higher tap coefficients, as shown in Figure 40.

2) Self-Guided Filter: The scheme applies simple filters tothe reconstructed pixels, X , to generate two denoised versions,X1 and X2, which largely preserve the edge transition. Theirdifferences from the reconstructed pixels, (X1−X) and (X2−X), are used to span a sub-space, upon which we project thedifferences between the reconstructed pixels and the originalpixels, (Xs − X), as shown in Figure 41. The least-squareregression parameters obtained by the encoder are signaled tothe decoder, which are used to build a linear approximation of(Xs−X) based on the known bases (X1−X) and (X2−X).

In particular, a radius r and a noise variance e are used togenerated the denoised versions of the LRU as follows:

1) Obtain the mean µ and variance σ2 of pixels in a(2r + 1)× (2r + 1) window around every pixel x.

2) Compute the denoised pixel as

x =σ2

σ2 + ex+

e

σ2 + eµ. (46)

21

X1(r1, e1)-X

X2(r2, e2)-X

Xr-X = α(X1-X) + β(X2-X)[Final output]

(Xs - X)

Fig. 41: Project the gap between the source pixels Xs andreconstructed pixels X on to a sub-space spanned by simpledenoising results, X1−X and X2−X . The parameters in redare the ones configurable through bit-stream syntax.

The pair (r, e) effectively controls the denoising filter strength.Two sets of denoised pixels, denoted in the vector form X1

and X2, are generated using (r1, e1) and (r2, e2), which areselected by the encoder and are signaled in the bit-stream.Let X denote the vector formed by the reconstructed pixelsand Xs the vector of source pixels. The self-guided filter isformulated by

Xr = X + α(X1 −X) + β(X2 −X). (47)

The parameters (α, β) are obtained by the encoder using leastsquare regression: [

αβ

]= (ATA)−1AT b, (48)

where

A =

[X1 −XX2 −X

]and b = Xs −X.

The parameters (α, β) are sent to the decoder to formulate(47).

D. Frame Super-Resolution

When the source input is down scaled from the originalvideo signal, a frame super-resolution is natively supported aspart of the post-processing filtering that converts the recon-structed frame to the original dimension. As shown in Figure42, the frame super-resolution consists of an up-sampling stageand a loop restoration filter [42].

The up-sampling stage is applied to the reconstructed pixelsafter the CDEF filter. As mentioned in Section II-C, the down-sampling and up-sampling operations only apply to the hori-zontal direction. The up-sampling process for a row of pixelsin a frame is shown in Figure 43. Let B denote the analogframe width. The down-sampled frame contains D pixels in arow, and the up-scaled frame contains W pixels in a row. Theirsampling positions are denoted by Pk and Qm respectively,where k ∈ {0, 1, · · · , D − 1} and m ∈ {0, 1, · · · ,W − 1}.

The offset from P0 to Q0 is given by

Q0 − P0 =B

2W− B

2D=B(D −W )

2WD.

The space between Qm and Qm+1 is given by

Qm+1 −Qm =B

W.

Deblocking Filter CDEF Up-sampling

Loop Restoration Filter

Frame Super-resolution

Reference Frame Buffer

Fig. 42: The frame super-resolution up-samples the recon-structed frame to the original dimension. It comprises a linearup-sampling and a loop restoration filter.

B

...

...

B/2D B/D B/D B/2D

B/2W B/W B/W B/2W

Down-sampled pixels

Up scaled pixels

PD-2 PD-1P1P0

QW-1QW-2Q0 Q1

Fig. 43: Frame super-resolution sampling positions. The ana-log frame width is denoted by B. The down-sampled framecontains D pixels in a row, which are used to interpolate Wpixels for a row in the up-scaled frame.

To map Qm into sub-pixel positions in the down-sampledpixel row, we normalize the relative distance by B

D , whichcorresponds to one full-pixel offset in the down-sampledframe. Therefore the initial offset for Q0 is D−W

2W . The offsetfor the subsequent Qm is given by D−W

2W + m∆Q, where∆Q = D

W .In practice, these offsets are calculated at 1

16384 pixelprecision. They are rounded to the nearest 1

16 -pixel positionfor interpolation filter. An 8-tap FIR filter is used to generatethe sub-pixel interpolation.

Note that the rounding error

e = round(∆Q)−∆Q (49)

is built up in the offset for Qm, i.e., D−W2W +m(∆Q+e), as mincreases from 0 to W − 1. Here the function round() maps avariable to the nearest sample in 1

16384 resolution. This wouldmake the left-most pixel in a row have minimum roundingerror in the offset calculation, whereas the right-most pixelhas the maximum rounding error. To resolve such spatial bias,the initial offset for Q0 is further adjusted by − eW2 , whichmakes the left- and right-most pixels have equal magnitudeof rounding error, and the middle pixel QW/2 close to zerorounding error. In summary the initial offset for Q0 is givenby

Q0 offset =D −W

2W− eW

2. (50)

The offset for a subsequent Qm is

Qm offset =D −W

2W− eW

2+m round(∆Q). (51)

The loop restoration filter in Section VI-C is then appliedto the up-sampled frame to further recover the high frequencycomponents. It is experimentally shown in [42] that theloop restoration filter whose parameters are optimized by the

22

L L1

L

1

current pixel

Fig. 44: The reference region (in blue) is used by the ARmodel to generate the grain at a current sample (in orange).The reference region includes a (2L + 1) × L block aboveand an L× 1 block to the left. The total number of referencesamples is 2L(L+ 1).

encoder can substantially improve the objective quality of theup-sampling frame.

E. Film Grain Synthesis

Film grain is widely present in creative content, such asmovie and TV materials. Due to its random nature, the filmgrain is very difficult to compress using conventional codingtools that exploit signal correlations. AV1 provides a film grainsynthesis option that builds a synthetic grain and adds it tothe decoded picture prior to its display. This allows one toremove the film grain from the source video signal prior tocompression. A set of model parameters are sent to the decoderto create a synthetic grain that mimics the original film grain.

AV1 adopts an auto-regressive (AR) model to build the grainsignal [43]. The grain samples are generated in raster scanorder. A grain sample in luma plane is generated using a (2L+1)× L block above and an L× 1 block to the left, as shownin Figure 44, which involves 2L(L + 1) reference samples,where L ∈ {0, 1, 2, 3}. The AR model is given by

G(x, y) =∑

m,n∈Sref

am,nG(x−m, y − n) + z, (52)

where Sref is the reference region and z is a pseudo randomvariable that is drawn from a zero-mean unit-variance Gaus-sian distribution. The grain samples for chroma componentsare generated similar to (52) with one additional input fromthe collocated grain sample in the luma plane. The modelparameters associated with each plane are transmitted throughthe bit-stream to formulate the desired grain patterns.

The AR process is used to generate a template of grainsamples corresponding to a 64×64 pixel block. Patches whosedimensions correspond to a 32× 32 pixel block are drawn atpseudo random positions within this template and are appliedto the reconstructed video signal.

The final luma pixel at position (x, y) is given by

P (x, y) = P (x, y) + f(P (x, y))G(x, y), (53)

where P (x, y) is the decoded pixel value and f(P (x, y))scales the grain sample according to the collocated pixel

intensity. The f() is a piece-wise linear function and isconfigured by the parameters sent through the bit-stream. Thegrain samples applied to the chroma components are scaledbased on the chroma pixel value as well as the collocatedluma pixel values. A chroma pixel is given by

Pu(x, y) = Pu(x, y) + f(t)Gu(x, y),

t = buPu(x, y) + duP (x, y) + hu,

where P (x, y) denotes the average of the collocated lumapixels. The parameters bu, du, and hu are signaled in the bit-stream for each chroma plane.

The film grain synthesis model parameters are decided ona frame by frame basis and are signaled in the frame header.AV1 also allows a frame to re-use the previous frame’s modelparameter set and bypass sending a new set in the frameheader.

VII. PROFILE AND LEVEL DEFINITION

AV1 defines profiles and levels to specify the decodercapability. Three profiles define support for various bit-depthand chroma sampling formats, namely Main, High and Pro-fessional. The capability required for each profile is presentedin Table II.

TABLE II: Capability Comparisons of AV1 Profiles

Proflile Bit-depth Chroma sampling8 10 12 4:0:0 4:2:0 4:2:2 4:4:4

Main X X X X

High X X X X X

Professional X X X X X X X

Levels are defined to specify the upper limit of decoderperformance in terms of frame rate, resolution, and otherperformance characteristics, as presented in Table III. Note thatsome levels are not shown because they have not been formallydefined yet (e.g. level 7 and above, level 2.2, etc.). Exampleframe rate and resolution for each level are also included forreference. For further details and updated definitions, pleaserefer to the AV1 specification [9] .

VIII. PERFORMANCE EVALUATION

We compared the peak compression performance of libvpxVP9 [44] and libaom AV1 [6]. The source code of libvpx VP9can be accessed at [44]. The experiment used the hash version1e892e63. The source code of libaom AV1 can be found at[6]. The experiment used the hash version fa815c62.

Both codecs used the default 2-pass encoding mode andvariable bit-rate control, and ran at the highest compressionperformance mode, i.e., –cpu-used=0. To achieve the peakcompression performance, both VP9 and AV1 encoder allowedadaptive GOP size, where the decisions were made based onthe first pass encoding statistics. The quantization parameteroffsets between different frames within a GOP were alsoadaptively optimized based on the first pass coding statistics.The test sets included video resolutions ranging from 480p to1080p. All the clips were coded using their first 150 frames.

23

TABLE III: AV1 Level Definitions

Level MaxPicSize MaxHSize MaxVSize MaxDisplayRate MaxDecodeRate MaxHeaderRate(samples) (samples) (samples) (samples/sec) (samples/sec) (/sec)

2 147,456 2,048 1,152 4,423,680 5,529,600 1502.1 278,784 2,816 1,584 8,363,520 10,454,400 1503 665,856 4,352 2,448 19,975,680 24,969,600 1503.1 1,065,024 5,504 3,096 31,950,720 39,938,400 1504 2,359,296 6,144 3,456 70,778,880 77,856,768 3004.1 2,359,296 6,144 3,456 141,557,760 155,713,536 3005 8,912,896 8,192 4,352 267,386,880 273,715,200 3005.1 8,912,896 8,192 4,352 534,773,760 547,430,400 3005.2 8,912,896 8,192 4,352 1,069,547,520 1,094,860,800 3005.3 8,912,896 8,192 4,352 1,069,547,520 1,176,502,272 3006 35,651,584 16,384 8,704 1,069,547,520 1,176,502,272 3006.1 35,651,584 16,384 8,704 2,139,095,040 2,189,721,600 3006.2 35,651,584 16,384 8,704 4,278,190,080 4,379,443,200 3006.3 35,651,584 16,384 8,704 4,278,190,080 4,706,009,088 300

Level MainMbps HighMbps MainCR HighCR MaxTiles MaxTileCols Example(Mbps) (Mbps)

2 1.5 - 2 - 8 4 [email protected] 3 - 2 - 8 4 640x360@30fps3 6 - 2 - 16 6 [email protected] 10 - 2 - 16 6 1280x720@30fps4 12 30 4 4 32 8 [email protected] 20 50 4 4 32 8 1920x1080@60fps5 30 100 6 4 64 8 [email protected] 40 160 8 4 64 8 [email protected] 60 240 8 4 64 8 [email protected] 60 240 8 4 64 8 3840x2160@120fps6 60 240 8 4 128 16 [email protected] 100 480 8 4 128 16 [email protected] 160 800 8 4 128 16 [email protected] 160 800 8 4 128 16 7680x4320@120fps

The BD-rate reductions in average PSNR, overall PSNR, andSSIM are shown in Table IV-V.

Note that the results are intended for reference only. Differ-ent encoder implementations might have different performanceresults. An extensive codec performance evaluation undervarious encoder constraints is beyond the scope of this paper.Readers are referred to [8] for more comparison results underencoder constraints.

IX. CONCLUSION

This paper provides a technical overview of the AV1 codec.It outlines the design theories of the compression techniquesand the considerations for hardware feasibility, which togetherdefine the current state of AV1.

REFERENCES

[1] “Cisco annual internet report (20182023) white paper.”[Online]. Available: https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html

[2] D. Mukherjee, J. Han, J. Bankoski, R. Bultje, A. Grange, J. Koleszar,P. Wilkins, and Y. Xu, “A technical overview of vp9the latest open-source video codec,” SMPTE Motion Imaging Journal, vol. 124, no. 1,pp. 44–54, 2015.

[3] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efficiency video coding (hevc) standard,” IEEE Transactionson circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.

[4] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the h. 264/avc video coding standard,” IEEE Transactions on circuitsand systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.

[5] “Alliance for open media.” [Online]. Available: https://aomedia.org/[6] “Libaom av1 repository.” [Online]. Available: https://aomedia.

googlesource.com/aom/[7] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker,

C. Chen, H. Su, U. Joshi et al., “An overview of core coding tools in theav1 video codec,” in 2018 Picture Coding Symposium (PCS). IEEE,2018, pp. 41–45.

[8] Y. Chen, D. Mukherjee, J. Han, A. Grange, Y. Xu, S. Parker, C. Chen,H. Su, U. Joshi, C.-H. Chiang et al., “An overview of coding tools inav1: the first video codec from the alliance for open media,” APSIPATransactions on Signal and Information Processing, vol. 9, 2020.

[9] P. de Rivaz and J. Haughton, “Av1 bitstream & decodingprocess specification.” [Online]. Available: https://aomediacodec.github.io/av1-spec/av1-spec.pdf

[10] “Libaom performance tracker.” [Online]. Available: https://datastudio.google.com/reporting/a84c7736-99c3-4ff5-a9df-92deae923294

[11] “Libaom contributor list.” [Online]. Available: https://aomedia.googlesource.com/aom/+/refs/heads/master/AUTHORS

[12] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statisticallearning. Springer series in statistics New York, 2001, vol. 1, no. 10.

[13] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of hierarchical b pic-tures and mctf,” in 2006 IEEE International Conference on Multimediaand Expo. IEEE, 2006, pp. 1929–1932.

[14] T. M. Cover and J. A. Thomas, Elements of information theory. JohnWiley & Sons, 2012.

https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html



https://aomedia.org/

https://aomedia.googlesource.com/aom/

https://aomedia.googlesource.com/aom/

https://aomediacodec.github.io/av1-spec/av1-spec.pdf

https://aomediacodec.github.io/av1-spec/av1-spec.pdf

https://datastudio.google.com/reporting/a84c7736-99c3-4ff5-a9df-92deae923294

https://datastudio.google.com/reporting/a84c7736-99c3-4ff5-a9df-92deae923294

https://aomedia.googlesource.com/aom/+/refs/heads/master/AUTHORS

https://aomedia.googlesource.com/aom/+/refs/heads/master/AUTHORS

24

TABLE IV: Peak compression performance comparison - midresolution.

Clip Avg PSNR Ovr PSNR SSIM

aspen 480p -24.429 -36.740 -28.111blue sky 480 -38.814 -41.394 -45.422

BQMall 832x480 -34.627 -35.512 -37.650PartyScene 832x480 -31.767 -32.172 -33.822

city 4cif -33.399 -33.466 -33.619controlled burn 480p -34.615 -34.872 -33.145

crew 4cif -24.740 -26.766 -25.670ice 4cif -27.581 -27.940 -33.115

crowd run 480 -25.263 -27.103 -32.555into tree 480p -31.687 -32.839 -30.191

old town cross 480p -32.023 -32.168 -31.417shields 640x360 -29.222 -28.438 -26.899snow mnt 480p -33.929 -34.350 -36.672

soccer 4cif -34.853 -31.280 -36.402speed bag 480p -39.336 -40.356 -45.057

station2 480 -59.016 -59.000 -58.361tears of steel1 480p -27.333 -28.389 -30.022

touchdown pass 480p -25.580 -26.267 -23.294Flowervase 832x480 -33.703 -35.879 -37.248ducks take off 480p -39.624 -39.329 -40.586Mobisode2 832x480 -44.239 -43.041 -45.296

Netflix Narrator 850x480 -36.380 -37.432 -32.555netflix driving -29.741 -30.988 -29.982

TrafficFlow 854x480 -35.742 -35.902 -32.584BalloonFestival 854x480 -35.231 -37.152 -33.646

ShowGirl2Teaser 854x480 -25.258 -29.020 -27.089sintel shot 854x480 -37.219 -29.721 -33.997

BasketballDrillText 832x480 -31.614 -32.340 -30.280BasketballDrill 832x480 -32.042 -32.848 -33.697

RaceHorses 832x480 -29.273 -30.212 -33.308park joy 480p -28.463 -32.992 -35.224Keiba 832x480 -30.908 -30.970 -33.800

harbour 4cif -28.970 -29.052 -30.100west wind easy 480p -23.390 -24.918 -20.922rush field cuts 480p -30.938 -33.153 -28.001

netflix barscene -37.174 -39.343 -34.484Campfire 854x480 -31.927 -34.238 -43.445CatRobot 854x480 -38.715 -40.088 -38.874

DaylightRoad2 854x480 -29.814 -30.708 -29.196Tango 854x480 -33.255 -34.784 -33.706

Market3Clip4000r2 854x480 -34.664 -34.813 -34.627red kayak 480p -15.343 -20.011 -20.094

netflix aerial -42.533 -45.936 -46.874netflix foodmarket -29.709 -30.156 -29.232netflix ritualdance -26.329 -28.999 -33.673

netflix rollercoaster -34.911 -35.083 -34.093netflix squareandtimelapse -31.716 -32.983 -31.106

netflix tunnelflag -36.557 -37.048 -45.907Drums 854x480 -37.132 -37.785 -37.800

ToddlerFountain 854x480 -22.928 -24.232 -27.958

OVERALL -32.473 -33.604 -34.016

[15] J. Han, T. Kopp, and Y. Xu, “An estimation-theoretic approach tovideo denoiseing,” in 2015 IEEE International Conference on ImageProcessing (ICIP). IEEE, 2015, pp. 4273–4277.

TABLE V: Peak compression performance comparison - highresolution.

Clip Avg PSNR Ovr PSNR SSIM

basketballdrive 1080p -32.958 -34.903 -37.227cactus 1080p -36.183 -36.777 -34.813

touchdown pass 1080p -30.706 -31.097 -29.173tractor 1080p -35.557 -35.767 -39.988

Drums 1280x720 -35.994 -36.546 -36.105ToddlerFountain 1280x720 -23.432 -24.774 -28.332

crowd run 1080p -27.888 -28.668 -33.347Netflix Crosswalk 2048x1080 -35.853 -38.410 -37.095

Netflix FoodMarket 2048x1080 -29.398 -29.569 -30.209Netflix SquareTime 2048x1080 -34.693 -35.876 -32.739

city 720p -36.892 -37.002 -36.948controlled burn 1080p -39.378 -37.935 -40.288

johnny 720p -39.501 -39.608 -40.175night 720p -35.462 -36.719 -33.957

sunflower 720p -41.300 -42.133 -43.926vidyo4 720p -37.519 -38.220 -35.236

TrafficFlow 1280x720 -34.222 -34.160 -33.376ducks take off 1080p -39.005 -39.671 -42.798

BalloonFestival 1280x720 -36.463 -38.031 -38.083ShowGirl2Teaser 1280x720 -26.728 -29.301 -29.140Netflix Dancers 1280x720 -36.058 -36.343 -34.236

aspen 1080p -40.357 -41.650 -44.337crew 720p -25.734 -27.666 -28.190

CSGO 1080p -36.230 -35.010 -35.787dinner 1080p -37.161 -38.932 -39.467

ped 1080p -30.589 -30.714 -35.628Campfire 1280x720 -33.337 -35.992 -44.824CatRobot 1280x720 -39.963 -41.089 -39.616

factory 1080p -33.753 -32.999 -38.161DaylightRoad2 1280x720 -29.620 -30.246 -28.915RollerCoaster 1280x720 -31.609 -32.200 -29.301

Tango 1280x720 -34.008 -35.656 -32.729Market3Clip4000r2 1280x720 -36.550 -36.667 -38.170

Netflix FoodMarket2 1280x720 -31.598 -31.698 -32.339Netflix Aerial 2048x1080 -35.110 -35.042 -36.233Netflix Boat 2048x1080 -36.368 -38.050 -32.492

Netflix DrivingPOV 2048x1080 -28.247 -27.852 -29.423Netflix PierSeaside 2048x1080 -43.643 -43.595 -39.784Netflix TunnelFlag 2048x1080 -39.544 -40.504 -42.802

in to tree 1080p -29.264 -29.494 -29.244kristenandsara 720p -36.422 -37.100 -35.449

parkjoy 1080p -31.282 -34.829 -45.045old town cross 720p -35.365 -35.590 -35.063

red kayak 1080p -17.367 -20.445 -19.396riverbed 1080p -20.188 -24.161 -22.801

rush field cuts 1080p -33.088 -34.595 -32.759rush hour 1080p -24.342 -24.297 -29.184

shields 720p -35.174 -35.096 -36.111station2 1080p -56.422 -56.907 -57.702tennis 1080p -29.062 -30.400 -33.595

OVERALL -33.932 -34.800 -35.435

[16] C. Chen, J. Han, and Y. Xu, “Video denoising for the hierarchical codingstructure in video coding,” in 2020 Data Compression Conference(DCC). IEEE, 2020.

25

[17] C.-H. Chiang, J. Han, and Y. Xu, “A multi-pass coding mode searchframework for av1 encoder optimization,” in 2019 Data CompressionConference (DCC). IEEE, 2019, pp. 458–467.

[18] L. Trudeau, N. Egge, and D. Barr, “Predicting chroma from luma inav1,” in 2018 Data Compression Conference (DCC). IEEE, 2018, pp.374–382.

[19] M. Jakubowski and G. Pastuszak, “Block-based motion estimationalgorithmsa survey,” Opto-Electronics Review, vol. 21, no. 1, pp. 86–102, 2013.

[20] I. Patras, E. A. Hendriks, and R. L. Lagendijk, “Probabilistic confidencemeasures for block matching motion estimation,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 17, no. 8, pp. 988–995,2007.

[21] J. Vanne, E. Aho, T. D. Hamalainen, and K. Kuusilinna, “A high-performance sum of absolute difference implementation for motion esti-mation,” IEEE transactions on circuits and systems for video technology,vol. 16, no. 7, pp. 876–883, 2006.

[22] H. Jozawa, K. Kamikura, A. Sagata, H. Kotera, and H. Watanabe, “Two-stage motion compensation using adaptive global mc and local affinemc,” IEEE Transactions on Circuits and Systems for video technology,vol. 7, no. 1, pp. 75–85, 1997.

[23] H.-K. Cheung and W.-C. Siu, “Local affine motion prediction for h.264 without extra overhead,” in Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems. IEEE, 2010, pp. 1555–1558.

[24] R. C. Kordasiewicz, M. D. Gallant, and S. Shirani, “Affine motionprediction based on translational motion vectors,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 17, no. 10, pp. 1388–1394, 2007.

[25] S. Parker, Y. Chen, D. Barker, P. De Rivaz, and D. Mukherjee, “Globaland locally adaptive warped motion compensation in video compres-sion,” in 2017 IEEE International Conference on Image Processing(ICIP). IEEE, 2017, pp. 275–279.

[26] S.-W. Wu and A. Gersho, “Joint estimation of forward and backwardmotion vectors for interpolative prediction of video,” IEEE Transactionson Image Processing, vol. 3, no. 5, pp. 684–687, 1994.

[27] M. T. Orchard and G. J. Sullivan, “Overlapped block motion compen-sation: An estimation-theoretic approach,” IEEE Transactions on ImageProcessing, vol. 3, no. 5, pp. 693–699, 1994.

[28] Y. Chen and D. Mukherjee, “Variable block-size overlapped blockmotion compensation in the next generation open-source video codec,”in 2017 IEEE International Conference on Image Processing (ICIP).IEEE, 2017, pp. 938–942.

[29] G. Laroche, J. Jung, and B. Pesquet-Popescu, “Rd optimized coding formotion vector predictor selection,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 18, no. 9, pp. 1247–1257, 2008.

[30] J.-L. Lin, Y.-W. Chen, Y.-W. Huang, and S.-M. Lei, “Motion vectorcoding in the hevc standard,” IEEE Journal of selected topics in SignalProcessing, vol. 7, no. 6, pp. 957–968, 2013.

[31] J. Han, J. Feng, Y. Teng, Y. Xu, and J. Bankoski, “A motion vectorentropy coding scheme based on motion field referencing for videocompression,” in 2018 25th IEEE International Conference on ImageProcessing (ICIP). IEEE, 2018, pp. 3618–3622.

[32] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, 1974.

[33] J. Han, A. Saxena, V. Melkote, and K. Rose, “Jointly optimized spatialprediction and block transform for video and image coding,” IEEETransactions on Image Processing, vol. 21, no. 4, pp. 1874–1884, 2011.

[34] F. Kamisli and J. S. Lim, “1-d transforms for the motion compensationresidual,” IEEE Transactions on Image Processing, vol. 20, no. 4, pp.1036–1046, 2010.

[35] W.-H. Chen, C. Smith, and S. Fralick, “A fast computational algorithmfor the discrete cosine transform,” IEEE Transactions on communica-tions, vol. 25, no. 9, pp. 1004–1009, 1977.

[36] J. Han, Y. Xu, and D. Mukherjee, “A butterfly structured design of thehybrid transform coding scheme,” in 2013 Picture Coding Symposium(PCS). IEEE, 2013, pp. 17–20.

[37] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for datacompression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540,1987.

[38] J. Han, C.-H. Chiang, and Y. Xu, “A level-map approach to transformcoefficient coding,” in 2017 IEEE International Conference on ImageProcessing (ICIP). IEEE, 2017, pp. 3245–3249.

[39] S. W. Golomb, “Run-length encodings,” IEEE Transactions on Informa-tion Theory, vol. 12, no. 3, pp. 399–401, 1966.

[40] S. Midtskogen and J.-M. Valin, “The av1 constrained directional en-hancement filter (cdef),” in 2018 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.1193–1197.

[41] D. Mukherjee, S. Li, Y. Chen, A. Anis, S. Parker, and J. Bankoski,“A switchable loop-restoration with side-information framework for theemerging av1 video codec,” in 2017 IEEE International Conference onImage Processing (ICIP). IEEE, 2017, pp. 265–269.

[42] U. Joshi, D. Mukherjee, Y. Chen, S. Parker, and A. Grange, “In-loopframe super-resolution in av1,” in 2019 Picture Coding Symposium(PCS). IEEE, 2019.

[43] A. Norkin and N. Birkbeck, “Film grain synthesis for av1 video codec,”in 2018 Data Compression Conference (DCC). IEEE, 2019, pp. 3–12.

[44] “Libvpx vp9 repository.” [Online]. Available: https://chromium.googlesource.com/webm/libvpx

https://chromium.googlesource.com/webm/libvpx

https://chromium.googlesource.com/webm/libvpx

Documents

A Technical Overview of AV1 · 2020. 8. 17. · AV1 codec, libaom [6], has since been developed as a refer-ence codec for various production use cases including VoD, video conferencing