BINARY SOURCE CODING WITH SIDE INFORMATION · Slepian-Wolf (SW) coding, which is concerned with separate lossless compression of correlated sources with joint decoding, forms the

Thesis submitted in fulfilment of the requirements for the award of the degree of

Doctor of Engineering Sciences (Doctor in de ingenieurswetenschappen) from Vrije

Universiteit Brussel and Doctor of Engineering (Doctor in de ingenieurswetenschappen)

from Universiteit Gent

BINARY SOURCE CODING WITH SIDE INFORMATION

ANDREI TUDOR SECHELEA February 2017

Supervisors: Prof. dr. ir. Adrian Munteanu Prof. dr. ir. Aleksandra Pizurica

Faculty of Engineering Department of Electronics and Informatics

Examining Committee

Prof. Dr. ir. Adrian Munteanu – Vrije Universiteit Brussel – Promoter

Prof. Dr. ir. Aleksandra Pizurica – Universiteit Gent – Promoter

Prof. Dr. ir. Yves Rolain – Vrije Universiteit Brussel – Committee chair (VUB)

Prof. Dr. ir. Rik Van de Walle – Universiteit Gent – Committee chair (UGENT)

Prof. Dr. ir. Rik Pintelon – Vrije Universiteit Brussel – Committee vice-chair

Prof. Dr. ir. Nikos Deligiannis – Vrije Universiteit Brussel – Committee secretary

Prof. Dr. ir. Jan Cornelis – Vrije Universiteit Brussel – Member

Prof. Dr. ir. Heidi Steendam - Universiteit Gent – Member

Prof. Dr. ir. Samuel Cheng – University of Oklahoma – Member

Dr. ir. Jurgen Slowack – BARCO – Member

”The journey not the arrival matters”–

Table of contents

Acknowledgments v

Summary vii

Acronyms xv

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Point-to-point Communications . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Channel Coding Theorem . . . . . . . . . . . . . . . . . . . . 21.2.2 Lossless Source Coding Theorem . . . . . . . . . . . . . . . . 31.2.3 Lossy Source Coding Theorem . . . . . . . . . . . . . . . . . 31.2.4 Source-Channel Separation Theorem . . . . . . . . . . . . . . 41.2.5 Optimality of Transmission . . . . . . . . . . . . . . . . . . . 4

1.3 Distributed Compression . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.1 Distributed Lossless Compression . . . . . . . . . . . . . . . . 51.3.2 Distributed Lossy Compression . . . . . . . . . . . . . . . . . 5

1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4.1 Asymmetric vs Symmetric Correlation Models in Distributed

Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4.2 Rate-Distortion Function for Binary Wyner-Ziv Coding . . . 8

1.5 Outline and Major Contributions . . . . . . . . . . . . . . . . . . . . 9

2 Rate Distortion Theory 112.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Sources and Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 The Rate-Distortion Function . . . . . . . . . . . . . . . . . . . . . . 172.5 Blahut-Arimoto Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 252.6 Distributed Lossless Compression – Slepian-Wolf Theorem . . . . . . 26

i

Table of contents

2.6.1 Distributed Lossless Source Coding . . . . . . . . . . . . . . . 27

2.6.2 Lossless Source Coding with Coded Side Information . . . . . 28

2.7 Lossy Compression with Side Information - Wyner–Ziv Theorem . . 29

2.7.1 Maximum distortion . . . . . . . . . . . . . . . . . . . . . . . 32

2.7.2 Example - Doubly symmetric binary source . . . . . . . . . . 33

2.8 Rate loss in Wyner-Ziv coding . . . . . . . . . . . . . . . . . . . . . 34

2.9 Achievability of Slepian-Wolf and Wyner-Ziv coding . . . . . . . . . 35

2.9.1 The Asymptotic Equipartition Property . . . . . . . . . . . . 35

2.9.2 Consequences of the AEP: Data Compression . . . . . . . . . 36

2.9.3 Jointly Typical Sequences . . . . . . . . . . . . . . . . . . . . 37

2.9.4 Achievability of Slepian-Wolf Coding . . . . . . . . . . . . . . 38

2.9.5 Achievability of Wyner-Ziv Coding . . . . . . . . . . . . . . . 38

2.10 Practical Code Constructions for the Binary Wyner-Ziv Problem . . 40

2.10.1 Binary Block Codes for Binary Slepian-Wolf and Wyner-ZivCoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.10.2 Practical Code Constructions . . . . . . . . . . . . . . . . . . 41

2.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Distributed Video Coding Overview 43

3.1 Distributed Video Coding . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.1 Brief Overview of Classical Video Compression . . . . . . . . 44

3.1.2 DVC Codec Architecture . . . . . . . . . . . . . . . . . . . . 49

3.2 Feedback Channels in DVC . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 Side-information Generation Techniques . . . . . . . . . . . . . . . . 52

3.3.1 Motion-Compensated Interpolation . . . . . . . . . . . . . . . 52

3.3.2 Hash-Based Motion Estimation . . . . . . . . . . . . . . . . . 55

3.4 Correlation Channel Modeling . . . . . . . . . . . . . . . . . . . . . . 57

3.5 DVC systems for Capsule Endoscopy and 1K pixel video applications 61

3.5.1 Codec Description - Capsule Endoscopy . . . . . . . . . . . . 61

3.5.2 Codec Description - 1-K pixel camera . . . . . . . . . . . . . 62

3.5.3 Predictive vs DVC coding . . . . . . . . . . . . . . . . . . . . 63

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Binary Rate-distortion with Encoder-Decoder Side Information 69

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Problem definition - System overview . . . . . . . . . . . . . . . . . . 69

4.3 The Rate-Distortion Function - the Uniform Source Case . . . . . . 70

4.4 The Rate-Distortion Function - General Case . . . . . . . . . . . . . 73

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

ii

Table of contents

5 Binary Rate-distortion with Decoder Side Information: WZ Cod-ing 775.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Problem definition - System overview . . . . . . . . . . . . . . . . . . 795.3 The Rate-Distortion Function - the Uniform Source Case . . . . . . 79

5.3.1 Expression of the Rate . . . . . . . . . . . . . . . . . . . . . . 805.3.2 Expression of the Distortion . . . . . . . . . . . . . . . . . . . 805.3.3 Symmetry considerations . . . . . . . . . . . . . . . . . . . . 815.3.4 Existence of an unique solution . . . . . . . . . . . . . . . . . 825.3.5 A Numerical Algorithm . . . . . . . . . . . . . . . . . . . . . 85

5.4 The Rate-Distortion Function - the General Case . . . . . . . . . . . 885.4.1 Expression of the Rate . . . . . . . . . . . . . . . . . . . . . . 885.4.2 Derivation of the Distortion . . . . . . . . . . . . . . . . . . . 895.4.3 Symmetry Observations . . . . . . . . . . . . . . . . . . . . . 905.4.4 Possible Values for the Distortion Function . . . . . . . . . . 915.4.5 Rate-Distortion Bound - numerical algorithm . . . . . . . . . 95

5.5 From BSC to the Z-channel - examples . . . . . . . . . . . . . . . . . 1045.6 Tightness of the Bound . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.6.1 Comparison with the Blahut-Arimoto Algorithm . . . . . . . 1095.6.2 Rate-distortion bound for ternary auxiliary variable U . . . . 110

5.7 An Analytical Approximation of the Rate-Distortion Function . . . . 1135.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 Rate Loss 1176.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.2 Rate-loss for Binary Wyner-Ziv (WZ) Coding . . . . . . . . . . . . . 120

6.2.1 Upper Bound on the Rate-Loss in the WZ Problem . . . . . 1206.2.2 No-rate-loss Cases . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3 No Rate-loss: the Z-channel Case . . . . . . . . . . . . . . . . . . . . 1236.3.1 Source Coding with Encoder-Decoder Side Information . . . 1246.3.2 Source Coding with Decoder Side Information . . . . . . . . 1246.3.3 No Rate-loss Proof . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4 Encoding Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.5 Rate Loss: the General Case . . . . . . . . . . . . . . . . . . . . . . . 129

6.5.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1346.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7 Epilogue 1377.1 General Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

iii

Table of contents

List of publications 143

References 145

iv

Acknowledgments

Pursuing a PhD is a valuable experience in itself, unique for each individual andnot quantifiable in templates. I would like to warmly thank the people that definedthe coordinates of my own journey.

First of all, I would like to thank my promoters Prof. Adrian Munteanu andProf. Aleksandra Pizurica for giving me the opportunity of following a joint PhDprogram between Vrije Universiteit Brussel (VUB) and Universiteit Gent (UGent).I am grateful for their continuous guidance and support throughout these years.

I would like to thank Prof. Nikos Deligiannis and Prof. Samuel Cheng for theclose collaboration we had and for making me understand the value of my work.

I am grateful to the members of my examination committee: Prof. Yves Rolain,Prof. Rik van de Walle, Prof. Rik Pintelon, Prof. Jan Cornelis, Prof. Nikos Deli-giannis, Prof. Heidi Steendam, Prof. Samuel Cheng and Dr. Jurgen Slowack, forspending valuable time in reading and amending this dissertation.

I would like to thank all my friends in the ETRO department research groupfor creating the amazing environment we had within, but also outside workingperimeters. It was a pleasure to meet each of you, and hope to keep ties just asstrong now when our roads part.

I wish to thank my family for their unconditional support.Finally, I would like to acknowledge the financial support offered by the Agency

for Innovation by Science and Technology (IWT).

Andrei SecheleaBrussels, January 2017

v

vi

Summary

The goal of an information transmission system is to get a message passed fromsource to destination. The realization that information can be consistently repre-sented in binary form, i.e. using only 0s and 1s, led to mathematical formulationsdescribing the characteristics of such a transmission system. Their design impliesfinding an appropriate trade-off between costs of transmission and quality of thereceived message. Point-to-point communications assume a setup formed from onesender which encodes the message to be sent, and one receiver that will decode themessage. In practical applications however, it is often the case that multiple sourcesmust send information to a unique destination.

Slepian-Wolf (SW) coding, which is concerned with separate lossless compressionof correlated sources with joint decoding, forms the basis of distributed source coding(DSC) and can be used to exploit the correlation among quantized sources in lossyDSC problems such as Wyner-Ziv (WZ) coding and multiterminal (MT) sourcecoding. The SW theory states that the coding performances of a system performingindependent encoding and joint decoding of two correlated sources are the samewith the case when the two sources are jointly encoded and decoded. A practicaldevelopment of the above information theoretical facts is the distributed videocoding (DVC) framework which allows to move the computation complexity fromthe encoder which is usually facing computational constraints to the decoder. Thisrecent field of research has shown increased interest for a wide range of applications,such as video surveillance, real-time streaming from multiple cameras, and immersivecommunications in general.

The ultimate goal of this dissertation is to provide a complete analysis of thebinary Wyner-Ziv coding problem. As a preamble, the introductory chapters ofthe manuscript situate the topic of research within the generic information theoryspectrum and introduce the fundaments and concepts required for the understandingof the main contributions. We begin by making a clear distinction between point-to-point and multi-terminal coding scenarios; as such the already existing theoreticalanalysis regarding the generic SW and WZ setups are detailed in Chapter 2.Following the path of our research all the way back to its initial justifying argument,

vii

Summary

i.e., asymmetric correlation models in DVC, Chapter 3 explores the basics ofpredictive video coding in comparison to the classical DVC architectures. We capturethe essential difference between the two paradigms, namely, the presence/absenceof side-information at the encoder, and justify the importance of our subsequentderivations. The following three chapters are dedicated to a detailed exposition of thecontributions of the thesis. Chapter 4 presents the analytical derivation of the rate-distortion function for binary source coding with encoder-decoder side-information.This corresponds to predictive coding, and its derivation is of high importance, asthis represents the absolute lower bound for WZ coding. Chapter 5 presents theactual derivation of the proposed rate-distortion bound for the binary WZ codingsetup. The input of the problem assumes a binary source generic source, which maybe non-uniform, the side information is known to be obtained considering a genericbinary correlation channel which is binary asymmetric (BAC) and the Hammingdistance is the distortion metric. This problem is proven not to admit an analyticalsolution, and as a consequence, the proposed solution is numerical. We describestrategies to achieve all rate-distortion points on the proposed bound, by identifyingthe correct minimizing parameters and reconstruction functions. By comparing ourbound to the output of the Blahut-Arimoto algorithm, we conjecture our bound tobe tight. Moreover, in order to mitigate the numerical nature of the solution, wepropose an analytical approximation which can be used with a negligible estimationerror from the real bound, typically in the order of 10-3 bits per sample (bps).

Chapter 6 uses the previously established bounds to assess the rate-loss of WZcoding when compared to predictive coding. A first step is to identify the inputdistributions leading to the highest encoding rates. Subsequently the variation ofthe rate-loss, is described, underlining the extreme values, i.e. maximum rate-lossand no-rate loss. The maximum rate needed to encode as well as the maximumrate loss of Wyner-Ziv coding relative to predictive coding correspond to uniformsources and symmetric correlations. Importantly, we show that the upper boundon the rate-loss established in literature is not tight and the maximum loss isactually significantly lower, i.e., 0.076 bps vs 0.22 bps. Finally, we prove that the onlybinary correlation channel that incurs no rate-loss for Wyner-Ziv coding comparedto predictive coding is the Z-channel. The no-rate-loss property of binary WZ codingunder Z-channel correlations is a surprising result and also fundamental from theresearch perspective, as it is only the third known instance of WZ system exhibitingthis property.

Chapter 7 draws the conclusions of this work and enumerates possible researchdirections that would complement and extend the present manuscript.

Summing up, the thesis proposes an in-depth theoretical analysis of the problemof lossy compression of binary sources in the presence of correlated side information.Two cases are considered: side-information available to both encoder and decoderpredictive coding or side-information available only to the decoder WZ coding.

viii

Summary

The derivation of the corresponding rate-distortion bounds and the description ofthe resulting WZ incurred rate-loss resulted in the publication of two ISI-rankedjournals and four peer-reviewed conferences.

ix

x

Samenvatting

Het doel van een informatie transmissiesysteem is om een bericht van een bronnaar een bestemming te krijgen. Het idee dat informatie systematisch voorgesteldkan worden in binaire vorm, i.e. gebruikmakend van 0en en 1en, leidde tot dewiskundige beschrijving van de karakteristieken van zulke transmissiesystemen. Hetontwerp van deze systemen houdt in dat een geschikte balans moet gevonden wordentussen de transmissiekost en de kwaliteit van de ontvangen informatie. Point-to-point communicaties onderstellen een scenario waarin n verzender het te verzendenbericht codeert en n ontvanger het bericht decodeert. In praktische toepassingenis het echter vaak het geval dat meerdere bronnen hun informatie naar een uniekebestemming verzenden.

Slepian-Wolf (SW) codering behandelt de afzonderlijke verliesloze compressievan gecorreleerde bronnen met gemeenschappelijke decodering. Deze techniek vormtde basis van gedistribueerde broncodering (DSC) en kan gebruikt worden omde correlatie te benutten die bestaat tussen gecorreleerde gequantizeerde infor-matiebronnen in verlieshebbende DSC problemen zoals Wyner-Ziv (WZ) coderingen multiterminal (MT) codering. De SW theorie stelt dat de performantie vaneen systeem dat afzonderlijke codering en gemeenschappelijke decodering van tweegecorreleerde bronnen exact hetzelfde is als het scenario waarin de bronnen gemeen-schappelijk gecodeerd en gedecodeerd worden. Een praktische ontwikkeling vandeze informatietheoretische wetten is het domein van gedistribueerde videocodering(DVC). In deze familie van videocoderingstechnieken wordt de computationele com-plexiteit verplaatst van de encoder naar de decoder. Dit recente onderzoeksdomein ispopulair in een breed spectrum van toepassingen zoals video surveillance, real-timestreaming van meerdere cameras en immersieve communicatie in het algemeen.

Het ultieme doel van deze dissertatie is om een volledige analyse van hetbinaire Wyner-Ziv coderingsprobleem te geven. De inleidende hoofdstukken van ditmanuscript situeren het onderzoeksonderwerp binnen het algemene spectrum vaninformatietheorie en introduceert de fundamentele concepten die nodig zijn om debelangrijkste bijdragen te begrijpen. We starten door een duidelijk onderscheid temaken tussen point-to-point en multiterminal coderingsscenarios. Bestaande theo-

xi

Samenvatting

retische analyses van algemene SW- en WZ-systemen worden uitgebreid beschrevenin Hoofdstuk 2. Het pad van ons onderzoek terug volgend naar het initile uit-gangspunt, i.e. asymmetrische correlatiemodellenmodellen in DVC, verkent Hoofd-stuk 3 de basis van predictieve videocodering in vergelijking met de klassiekeDVC-architecturen. We bespreken de essentile verschillen tussen de twee paradig-mas, namelijk de aanwezigheid/afwezigheid van randinformatie aan de encoder,en rechtvaardigen het belang van onze afleidingen. De volgende drie hoofdstukkenzijn geweid aan een gedetailleerde uiteenzetting van de bijdragen van deze thesis.Hoofdstuk 4 beschrijft de analytische afleiding van de rate-distortiefunctie voorbinaire broncodering met encoder-randinformatie. Dit komt overeen met predictievecodering en deze afleiding is van groot belang aangezien ze de absolute ondergrensvormt voor WZ-codering. Hoofdstuk 5 legt de eigenlijke afleiding van de voorgestelderate-distortiegrens uit voor het binaire WZ-coderingsscenario. De input van hetprobleem onderstelt een binaire generieke informatiebron, die niet uniform magzijn. De randinformatie is geweten te komen van een binair correlatiekaneel, hetbinair asymmetrisch correlatiekanaal (BAC), en de distortie-metriek is de Hamming-afstand. Het is aangetoond dat voor dit probleem geen analytische oplossing bestaaten bijgevolg is de voorgestelde oplossing een numerieke benadering. We beschrijvende strategien om alle rate-distortiepunten op de voorgestelde grens te bereiken doorde juiste parameters te identificeren die de parameters van de reconstructiefunctiesminimaliseren. Door onze grens te vergelijken met de output van het Blahut-Arimotoalgoritme kunnen we stellen dat onze grens een infimum vormt. Bovendien stellenwe een analytische benadering voor die kan gebruikt worden met een verwaarlozeschattingsfout in de orde van 10-3 bits per sample (bps) van de echte grens om denumerieke aard van het probleem te verlichten.

Hoofdstuk 6 gebruikt de eerder vastgelegde grenzen om het rate-verlies van WZ-codering vast te stellen in vergelijking met predictieve codering. Een eerste stapbestaat erin om de input verdelingen te identificeren die leiden tot de hoogstecoderingsrate. Vervolgens wordt de verandering van het rate-verlies beschrevenin de extreme gevallen, i.e. maximaal rate-verlies en geen rate-verlies. Zowel demaximale coderinsrate als het maximale rate-verlies van WZ-codering ten opzichtevan predictieve codering komen vanuit uniforme informatiebronnen en symmetrischecorrelaties. In een belangrijke bijdrage tonen we aan dat de bovengrens op het rate-verlies die beschreven wordt in de literatuur geen supremum is en dat het maximaleverlies eigenlijk significant minder is, i.e. 0.0769 bps vs 0.22 bps. Uiteindelijkbewijzen we dat het enige binaire correlatiekanaal dat geen rate-verlies veroorzaaktten opzichte van predictieve codering het Z-kanaal is. De geen-rate-verlies eigenschapvan binaire WZ-codering onder Z-kanaal correlatie is een verrassend resultaat enook fundamenteel vanuit een onderzoeksperspectief aangezien het slechts het derdebekende scenario van WZ-systemen is dat deze eigenschap vertoont.

Hoofdstuk 7 trekt enkele conclusies uit dit werk en lijst enkele mogelijke onder-

xii

Samenvatting

zoeksrichtingen op die dit werk zouden kunnen aanvullen en het huidige manuscriptkunnen uitbreiden.

Samengevat stelt deze thesis een gedetailleerde theoretische analyse voor van hetprobleem van verlieshebbende compressie van binaire bronnen in de aanwezigheidvan gecorreleerde randinformatie. Twee gevallen worden ondersteld: randinformatiebeschikbaar voor zowel de encoder als decoder predictieve codering of randin-formatie enkel beschikbaar voor de encoder WZ-codering. De afleiding van deovereenkomstige rate-distortiegrenzen en de beschrijving van het WZ-rate-verliesheeft geleid tot twee ISI-gerangschikte tijdschriftartikels en vier peer-reviewedconferentieartikels.

xiii

xiv

Acronyms

AEP Asymptotic Equipartition Property

AVC Advanced Video Coding

BAC Binary Asymmetric Channel

BEC Binary Erasure Channel

BSC Binary Symmetric Channel

bps bits per symbol

DMC Discrete Memoryless Channel

DMS Discrete Memoryless Source

DSC Distributed Source Coding

DVC Distributed Video Coding

GOP Group of Pictures

HEVC High Efficiency Video Coding

i.i.d. independent, identically distributed

LDGM Low-Density Generator Matrix

LDPC Low-Density Parity-Check

MCI Motion Compensated Interpolation

RD Rate-Distortion

SW Slepian-Wolf

WZ Wyner-Ziv

xv

xvi

Chapter 1

Introduction

1.1 Introduction

The fact that any source of information, be it text, sound, image or video, can beconsistently represented in a binary form, i.e., as a string of 0’s and 1’s, was onlyone of Claude Shannon’s ground-breaking ideas. This unified way of representinginformation, which is in fact data agnostic, allowed the formulation of a generaltheory of communications. Shannon’s 1948 paper [80] introduced a universal formalmodel for a communication system (see Fig. 1.1) and addressed two fundamentalsubsequent problems: what are the limitations of data representation and datatransmission. Information theory as a distinct research field was born.

The applications of information theory principles in everyday life are uncount-able. Advances in hardware technologies allowed extremely complex schemes to beimplemented on a single chip at affordable costs. As such, computers, networks,satellites, wireless communications, optical communications, images, videos, datastorage, the Internet, and many others, all of which rely on data coding and datatransmission, are not only ubiquitous, but also reshape human day-to-day life.

In spite of all the decades of research, the information theory field still proposesopen questions and is an active academic domain. The topic of this thesis is rooted inan information theory problem, but stems from a practical video coding application.

Figure 1.1: Shannon’s ”Schematic diagram of a general communication system”.

1

Chapter 1

Figure 1.2: Point-to-point communication system model.

In what follows we introduce the general communication notions required to positionour work within the information theoretical spectrum, and motivate our research byinvoking the need for a theoretical performance bound in a video coding scenario.

1.2 Point-to-point Communications

The goal of a transmission system is to provide an optimal trade-off between the costof transmission and the distortion accepted for the reconstructed source. Shannon[80,81] set the theoretical grounds for point-to-point communication, by introducinga simplified model for a communication system and characterizing compression andtransmission rate bounds. The architecture of the system is presented in Fig. 1.2: asource S communicates k-symbol sequences Sk to a receiver over a channel whichis considered to introduce noise. To this extent, in the generic block-coding schemeintroduced by Shannon, the source sequence is mapped to an n-symbol encodedsequence Xn = Encode(Sk) and transmitted over the noisy channel. The sequenceY n received by the decoder is mapped to a reconstructed source Sk = Decode(Y n).The analysis of this system assumes discrete memoryless models for both the sourceand the noisy channel.

This model of point-to-point communications allowed Shannon to formulatefour fundamental theorems which stand at the base of modern-day communicationsystems. The theorems are presented in what follows.

1.2.1 Channel Coding Theorem

Assume a discrete memoryless channel with input X and output Y , and the condi-tional probability p(y|x) characterizing the transmission noise, i.e., the probability ofreceiving symbol y when x was transmitted. The decoder must find an approximateS of the source S such that the error probability pr(S 6= S) is less than an imposedvalue Pe. The parameters that allow to control the performance of the system arethe number of source bits k, the code word length n, and the probability of errorPe.

In general, the problem is posed as finding a tradeoff between the above-mentioned parameters, and is not tractable; however, Shannon reformulated it bydefining the channel capacity C to be the maximum communication rate R = k/n in

2

1. Introduction

bits per channel usage such that the probability of error Pe can be made arbitrarilysmall as the code word length n increases. In other words, Shannon theorem statesthat it is possible to communicate discrete data through any noisy channel, nearlyerror-free, up to a computable maximum rate – equal to the channel capacity.

The formal definition of the channel capacity C is given in terms of the mutualinformation I(X; Y ) (see section 2.2 for the definition of the mutual information)between the input X and the output Y of the channel [12,25]:

C = maxp(x)

I(X; Y ) bits/transmission

1.2.2 Lossless Source Coding Theorem

Consider a sender who wants to communicate data from a source in a losslessmanner to a receiver, over a noiseless binary channel, by using the minimum possiblenumber of bits. The source S is discrete and memoryless, so the source samples areindependent and identically distributed (i.i.d.). The sender will encode Sk at a rate ofR = k/n bits per source symbol into a n-bit codeword Xn that will be sent over thenoiseless channel. The decoder must find the source reconstruction Sk = Sk basedon the received Y n = Xn. Shannon formulated this problem as finding the minimumrate Rmin that allows lossless reconstruction for an arbitrarily large codeword sizen, and proved that this minimum rate is given by the entropy (see section 2.2 forthe definition of the entropy) of the source S [12, 25]:

Rmin = H(S) bits/symbol

1.2.3 Lossy Source Coding Theorem

Let us assume a per-symbol distortion measure d(s, s) between the original sourcesymbol s and the reconstructed source symbol s. Now, assume that the source S

has to be sent over a noiseless binary channel such that the receiver should estimateit within an imposed distortion bound given by the above-introduced distortionmeasure, instead of losslessly reconstructing the original source. The optimal tradeoffbetween the rate R = k/n and the imposed distortion D = (1/k)

∑k1 E(d(Si, Si))

is given by the rate-distortion function, which is defined in terms of the mutualinformation I(S; S) between source S and reconstruction S as [12,25]:

R(D) = minp(s|s):E(d(S,S))≤D

I(S; S) bits/symbol

3

Chapter 1

1.2.4 Source-Channel Separation Theorem

Consider again the general model of a point-to-point communication system pre-sented in Fig. 1.2. Let C be the capacity of the Discrete Memoryless Channel(DMC) and let R(D) be the rate-distortion function associated with the DiscreteMemoryless Source (DMS) S. What are the conditions for transmitting the DMSover the DMC within an imposed distortion bound?

Assume for simplicity that k = n. Shannon [81] showed that the necessaryand sufficient condition for transmission without errors is that R(D) < C. Assuch, performing separate source coding and channel coding is a valid strategy forachieving optimal transmission.

Fundamentally, Shannon’s separation theorem states that the optimal tradeoffcan be achieved by splitting the coding procedure in two distinct parts: first, thesource is compressed to remove redundancy, while satisfying an imposed fidelityconstraint; then, the coded version of the source is transmitted across the communi-cation channel by using codes with asymptotically vanishing error probabilities. Forthe compression of the source, no a-priori knowledge about the channel is required;conversely, the properties of the source are not relevant for the channel code design.

1.2.5 Optimality of Transmission

Shannon’s theorems guarantee that optimal transmission is possible as long as thecodeword length can be arbitrarily large. In practice, however, no channel code canbe expected to have absolutely no errors as long as the codeword length is limited.To circumvent this, it is possible to design channel codes that take into account thesource characteristics, and, reciprocally, source codes that account for the statisticsof the channel. Those constructions are referred to as joint source-channel codes.It is not within the scope of this presentation to go into any details about jointsource-channel constructions, but they can be equally optimal, as the source-channelseparation theorem does not claim uniqueness.

As a matter of fact uncoded transmission can turn out to be optimal forparticular examples such as the doubly binary symmetric case [44]. In general, if thesource distribution p(x) is such that it attains the maximum mutual information,i.e., the channel capacity, and, at the same time, the channel transition probabilityp(s|s) attains the rate distortion function, then R(D) = C and transmission isoptimal [45].

1.3 Distributed Compression

In practical applications it is often the case that multiple sources send informationto a unique decoder. Consider as an example a network of temperature sensors

4

1. Introduction

Figure 1.3: Distributed compression system.

deployed over a small geographical area. The sensors will capture the temperatureat their respective positions and send the values to a common base station. Eachsensor can perform compression individually, but this might not be optimal in termsof overall data expenditure. Since the sensors are located closely one to each other,temperature values are correlated, and point-to-point compression can be replacedwith a distributed coding schema.

1.3.1 Distributed Lossless Compression

In order to illustrate this paradigm, consider the distributed lossless compressionsystem presented in Fig. 1.3. Sources X1 and X2 are correlated discrete memorylesssources distributed according to the joint probability (X1, X2) ∼ p(x1, x2). Thesequences Xn

1 and Xn2 are compressed separately into an nR1-bit codeword U1,

and an nR2-bit codeword U2, respectively. A common receiver must recover thesource sequences from the codeword pair (U1, U2). The obvious question that canbe posed regarding this setup is to determine the minimum sum of the two ratesRsum = R1 + R2 such that both sources can be recovered losslessly.

If we consider that each source uses a point-to-point coding scheme, then wealready know from Shannon’s lossless source coding theorem that Rmin1 = H(X1)and Rmin2 = H(X2). As such the resulting sum-rate is Rsum = H(X1) + H(X2).This scenario makes the implicit assumption that the sources are not correlated.However, if the sources are jointly encoded, the joint source (X1, X2) can be encodedby the lossless coding theorem at a rate of R∗

sum = H(X1, X2), which is alwayssmaller than the sum of individual entropies for correlated sources.

Slepian and Wolf [83] showed that this value R∗sum = H(X1, X2) is the minimum

sum-rate for distributed compression, as well as for centralized compression. In otherwords, this rate is sufficient in theory to encode two correlated sources that do notcommunicate with each other.

1.3.2 Distributed Lossy Compression

We now consider the generic distributed lossy compression system presented in Fig.1.4. As before, X1 and X2 are a pair of correlated DMSs. Let di(xi, xi), i ∈ {1, 2} be

5

Chapter 1

Figure 1.4: Distributed lossy compression system.

two distortion measures. Both sources have to be transmitted over a noiseless channelto a receiver, and the reconstructions have to be within some given distortion boundsDi, i ∈ {1, 2}. In this scenario, the sum-rate as a function of the two distortions isdelimited by two theoretical bounds, known as the Berger-Tung inner and outerbounds [13, 14, 87]. The formulation of these bounds is a complex problem andis not within the scope of this thesis. Nevertheless, this generic problem includestwo particular instantiations which represent the theoretical fundament of the workpresented in the thesis.

The first particular case is the distributed lossless case we have mentioned before,i.e., the Slepian-Wolf case. The second particular case is known in literature as theWyner-Ziv [103] coding schema. When there is no rate limitation for the secondsource X2, this is readily available at the decoder; note that the second distortionconstraint is irrelevant, since for rates above H(X2), the source X2 can be perfectlyreconstructed at the decoder. As seen in Fig. 1.5, in Wyner-Ziv coding the sourceX must be reconstructed within a distortion bound in the presence (only at thedecoder) of a correlated source Y , named side information.

1.4 Motivation

1.4.1 Asymmetric vs Symmetric Correlation Models in Dis-

tributed Video Coding

A well-known application of the principles of WZ coding is distributed video coding(DVC) [2,8,35,36,46,65,69,89,91]. By exploiting the source statistics at the decoder,

Figure 1.5: Wyner-Ziv compression system.

6

1. Introduction

the encoder can be designed to be very simple, as all the computational burden isshifted to the decoder. As such, DVC finds applications in energy and computationconstrained visual sensors, e.g., wireless capsule endoscopy [34,35], or 1K-pixel visualsensors [93].

Distributed Video Coding – the Basics

For a distributed video coding system [2, 8, 35, 36, 46, 89, 91], considering a singleview scenario, the video sequence is split in two correlated streams. The firstone is conventionally encoded using an intra codec such as H.264/AVC [100] andused at the decoder to produce side information by means of motion-compensatedprediction. The second stream is encoded using WZ principles [8], and decoded withthe help of the side information. However, in practical DVC systems, the encoderdoes not have exact knowledge of the side information. The dependency between WZframes and the side information is usually modeled as an additive noise channel [20].Since the correlation statistics determine the compression performance of the codec,it is desired to have an accurate modeling of the correlation channel.

Correlation Channel Models – the Basics

In Distributed Source Coding [23,101], the correlation is often modeled as an additivenoise channel, X = Y + N , where the side information Y is seen as the input,the source X is the output, and the noise N is independent of the input (i.e.,independent of the side information). In DVC, most modeling approaches [20,40,46,53,99] modeled the correlation noise as independent, having a Laplacian distributionwith mean μ and variance σ, i.e., N ∼ L(μ, σ). The work in [28, 30, 32] proposed adifferent correlation channel modelling concept, where the noise is considered to havea Laplacian distribution with a standard deviation that depends on the realizationsof the side-information: N ∼ L(μ(y), σ(y)). The authors of [28, 30, 32] coined theterms Side-Information-independent (SII) and Side-Information-Dependent (SID)for the former and latter correlation channel models, respectively.

The SID model was shown to outperform the SII in DVC, where the underlyingproblem is estimating the variance of the Laplacian noise [28, 32]: for the samedistortion level, it is always possible to find an SID model which will describe thecorrelation channel at a lower rate than the SII model. Unlike the SII model,for the SID model the noise variance depends on the realisations of the sideinformation. As stated in [28], following the terminology of channel symmetry in [25],the SII model is seen as a K-ary-input continuous-output symmetric Laplacianchannel. Conversely, the SID model is equivalent to a K-ary-input continuous-outputasymmetric Laplacian channel. It was shown in [28] that if both the source and theSI have a binary alphabet, the correlation models reduce to a Binary Symmetric

7

Chapter 1

Channel (BSC) in the SII case and a Binary Asymmetric Channel (BAC) in theSID case.

Motivation and Expected Outcome

Our research started out motivated by systematic rate-distortion gains observedwhen applying asymmetric correlation models in Wyner-Ziv video coding. Theobservations needed a theoretical corroboration, and the basic problem of binaryWyner-Ziv coding turned out to have a known solution only for the symmetriccorrelation case, while the case of the asymmetric correlation was still an openproblem.

The goal of our analysis is therefore to provide the mathematical means ofcharacterizing the fundamental limits of the SID correlation, i.e. asymmetric corre-lation in the binary case, and compare them to the SII correlation, i.e. symmetriccorrelation in the binary case. We will study the influence of the asymmetricity ofthe correlation channel on the rate-distortion performances of a binary Wyner-Zivcoding system and, by generalization, provide and explanation as to why asymmetriccorrelation models outperform symmetric correlation models in practical Wyner-Zivvideo coding applications.

1.4.2 Rate-Distortion Function for Binary Wyner-Ziv Cod-

ing

Berger gave in [12] the general expression for the predictive rate-distortion functionfor binary coding with side information, available both at the encoder and at thedecoder. For the case where the side information is only available at the decoder,Wyner and Ziv proved in [103] the rate-distortion bound under the assumptionthat the source is binary uniform (that is, p(X = 0) = p(X = 1) = 0.5) and thecorrelation is given by a BSC [103].

The characterization of the rate-distortion function for both predictive andWZ coding under the generic non-uniform binary source and binary asymmetriccorrelation channel setup has not yet been derived, and it is known that the lattermight not admit an analytical solution [25,45]. Only a numerical computation of therate-distortion function using the Blahut-Arimoto algorithm [16] has been providedin [22].

Motivation and Expected Outcome

A mathematically-founded derivation of the rate-distortion bound is important inorder to gain a comprehensive insight into behavior of the rate-distortion functionwhen the statistics of the source or the correlation channel vary. Having a rate-distortion characterization of binary Wyner-Ziv systems is a necessary step in

8

1. Introduction

producing a performance comparison between the asymmetric and the symmetriccorrelations.

The goal of our analysis is to provide a full description of the rate-distortion func-tion in the binary Wyner-Ziv coding system, for any given set of input parameters,i.e. any source and any correlation channel. This is a significant problem on its own,as seen from the theory of source coding. But its importance goes beyond the source-coding aspect alone. Solving this enables us to provide a comparative assessment ofthe compression performance between predictive and Wyner-Ziv coding in terms ofrate loss. Moreover, it allows establishing a reference for the evaluation of source andchannel codes used in practical schemes – given the theoretical bound, how closecan a practical construction approach it?

1.5 Outline and Major Contributions

This thesis addresses the problem of binary source coding with side-information. Asa preamble, we have considered two particular cases for which the rate distortioncharacteristics were derived: the uniform source case, analysed extensively in ourpapers [75, 77, 78], and the Z-channel correlation case, whose no-rate-loss propertywas presented in [33]. However, the ultimate goal of this dissertation is to providea complete analysis of the binary Wyner-Ziv coding problem: the genericnon-uniform source and asymmetric correlation initial setup. The aforementionedanalysis has been presented in [76].

The remainder of the manuscript is organised as described bellow.Chapter 2 is an introduction to the problem of lossy compression and it

introduces the information-theory-related concepts that lead to the formulation ofthe main researched topic, namely the binary Wyner-Ziv coding problem. The readerwill be acquainted with basic notions such as entropy and information, sources andchannels, source coding and channel coding, as well as distributed source coding.

Chapter 3 follows the path of the research topic proposed by our work, i.e.,binary source coding with side information, all the way back to its initial triggeringframework: Distributed Video Coding (DVC) and asymmetric correlation channelmodels. As such, it explores the basics of classical video coding, as well as of theclassical DVC architectures. This allows us to underline the fundamental differencebetween the two paradigms, namely, the presence/absence of side-information at theencoder, and reveal the premises of our subsequent derivations.

Chapter 4 presents the first contribution of this dissertation, namely theanalytical derivation of the rate-distortion function for binary sourcecoding with encoder-decoder side-information. This is the predictive codingscenario in which the encoder has access to the same information as the decoder. Itis of high importance, as it represents an absolute lower bound for the Wyner-Ziv

9

Chapter 1

coding problem.Chapter 5 focuses on the derivation of the proposed rate-distortion bound

for the binary Wyner-Ziv coding problem. By considering the all-binary case,we express our rate-distortion bound to be the solution of a constrained minimizationprocess. However, since this is proven not to admit an analytical solution, we proposea numerical algorithm to describe the bound and, by comparing it to the result of theBlahut-Arimoto algorithm, we conjecture our bound to be tight, i.e., the actual rate-distortion function. Moreover, we also propose an analytical approximation ofthe solution which has a negligible estimation error when compared to theresult of the Blahut-Arimoto algorithm.

Chapter 6 uses the results from the previous two chapters in order to assessthe rate-loss of Wyner-Ziv coding when compared to predictive coding,in the generic binary source and binary asymmetric correlation setup . Aremarkable finding is that this rate-loss vanishes in the case of the Z-channelcorrelation, and this is one of the only three known cases for which Wyner-Zivcoding suffers no rate-loss when compared to predictive coding.

Chapter 7 draws the conclusions of this work and enumerates possible researchdirections that would complement and extend the present manuscript.

10

Chapter 2

Rate Distortion Theory

2.1 Preliminaries

The main goal of this chapter is to gradually introduce the basic information theoryconcepts, terminology and notations which will be used in the remainder of thethesis. It is organised as a compendium with a formal mathematical approach, butat the same time it illustrates rather abstract concepts, such as entropy or channelcapacity, with relevant examples. It introduces the rate-distortion function of aninformation source with respect to a distortion measure, and its significance in thecontext of Distributed Source Coding (DSC), with an emphasis on Wyner-Ziv [103]coding.

The structure of the chapter is as follows: Sections 2.2 through 2.4 investigatethe rate-distortion function, while Section 2.5 presents a numerical algorithm whichcan compute the rate-distortion function, i.e., the Blahut-Arimoto [7,16] algorithm.Sections 2.6 and 2.7 deal with the DSC problem, and elaborate on the fundamentaltheorems of Slepian-Wolf [83] and Wyner-Ziv [103]. Section 2.8 describes the rateperformance expectations in lossy source coding with side information. Sections 2 .9and 2.10 address the achievability of Slepian-Wolf and Wyner-Ziv theoretical boundsin practice, while Section 2.11 draws the conclusions of this chapter.

2.2 Entropy and Information

We consider a finite set X = {x1, x2, ..., xM}.

Definition. X is called an M-letter alphabet, and xj are letters of the alphabet,j ∈ {1, ..., M}.

Definition. A probability distribution is a function p(∙) mapping X to the interval[0, 1] such that

∑Mj=1 p(xj) = 1.

11

Chapter 2

Definition. Any function defined on the ensemble (X , p) is called a discrete randomvariable.

Unless specified otherwise, we will only consider discrete random variables.

Definition. If the codomain of the random variable is the set of real numbers R,then this is called a real random variable.

We will consider the real random variable X to take value xj with probabilityp(xj).

Definition. The expected value of a real random variable is defined as:

E[X] =M∑

j=1

xjp(xj). (2.1)

An essential quantity in information theory is the self-information, which isdefined as follows:

Definition. The self-information of a random variable X is i(xj) = − log p(xj).

This is a measure of the information one receives when observing that randomvariable X was equal to xj .

All the logarithms will be considered to the base 2, unless otherwise specified.The corresponding unit of information is called a bit.

Definition. The entropy of a real random variable is the expected value of the self-information, and is denoted as follows:

H(X) = −M∑

j=1

p(xj) log p(xj). (2.2)

As an extension to the notion of self-information, the entropy measures howmuch information one receives on average when observing that random variable X

was equal to xj . The generally accepted convention is that 0 ∙ log 0 = 0, which resultsnaturally from the limit of the function x log x when x → 0. The entropy is a positivefunction, H(X) ≥ 0, which results directly by observing p(xj) ∈ [0, 1], implying− log p(xj) ≥ 0. Fig. 2.1 shows the basic case of the binary entropy function,corresponding to a binary random variable X which takes values 1 with probabilityp and 0 with probability 1 − p. The function is concave, with a maximum value of1 bit when p = 1

2 , corresponding to the maximum incertitude, and a minimum of 0when p = 0 or p = 1, i.e., when there is no uncertainty.

Consider X and Y two random variables with alphabets X = {x1, x2, ..., xM}and Y = {y1, y2, ..., yN}. The definition of the entropy can be extended to a pair ofrandom variables with a joint distribution p(xj , yk) as follows:

12

2. Rate Distortion Theory

Figure 2.1: The binary entropy function.

Definition. The joint entropy of two random variables is:

H(X, Y ) = −∑

j,k

p(xj , yk) log p(xj , yk). (2.3)

It measures the average information one receives when being told that X = xj

and Y = yk.

Definition. The conditional self-information i(xj |yk) = − log p(xj |yk) mea-sures the information received by being told X = xj when already knowing thatY = yk.

Similarly, i(yk|xj) = − log p(yk|xj) measures the information received by beingtold Y = yk when already knowing that X = xj . The conditional self-informationis not symmetric, i(xj |yk) 6= i(yk|xj).

By averaging the conditional self-information we obtain the conditional entropy:

Definition. The conditional entropy is defined to be:

H(X|Y ) = −∑

j,k

p(xj , yk) log p(xj |yk) (2.4)

andH(Y |X) = −

∑

j,k

p(xj , yk) log p(yk|xj) (2.5)

A fundamental result in information theory is the following:

Theorem 1. Conditioning reduces entropy:

H(X|Y ) ≤ H(X). (2.6)

13

Chapter 2

Definition. The reduction in the uncertainty of random variable X provided by theknowledge of Y is called mutual information, and is equal to the reduction in theuncertainty of random variable Y provided by the knowledge of X.

The mutual information I(X; Y ) can be written as:

I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) (2.7)

Theorem 2. Some basic properties of the mutual information are:

I(X; Y ) = I(Y ; X) (2.8)

I(X; Y ) ≥ 0 (2.9)

I(X; X) = H(X) (2.10)

2.3 Sources and Channels

Let t = (t1, t2, ..., tn) be a vector of discrete time moments and X a random variabletaking values in X = {x1, x2, ..., xN}. Then Xt = (Xt1 , Xt2 , ..., Xtn

) denotes thevalues source X takes at time indices t.

Definition. The random sequence {Xt} is called a discrete information source, andX is called the source alphabet.

Let x = (x1, x2, ..., xn), with xi ∈ X , i = 1, ..., n.

Definition. Pt(x) is defined to be the probability that Xt = x.

Let t + ΔT = (t1 + ΔT, t2 + ΔT, ..., tn + ΔT ) be a new time vector.

Definition. If Pt+ΔT (x) = Pt(x) for all integer values ΔT , then the source is saidto be stationary.

The simplest type of information source is the Discrete Memoryless Source(DMS).

Definition. A DMS is a stationary source that satisfies the additional constraintthat P (x) =

∏nt=1 P (xt), for all sources x.

In other words, the symbols xi generated by the DMS are independent andidentically distributed (i.i.d.).

We have seen that the source information is encoded and transmitted to the enduser through a channel, as in Fig 1.2. We are interested in a mathematical model ofa channel, which usually consists of a transition probability that specifies an outputdistribution for each possible input sequence. In information theory, it is commonto consider memoryless channels in which the output probability distribution onlydepends on the current channel input.

14


Figure 2.2: The binary symmetric channel.

Definition. A discrete memoryless channel (DMC) has discrete input X ={x1, ..., xM} and output Y = {y1, ..., yN} alphabets and is defined by specifying forevery pair (xj , yk) the conditional probability p(yk|xj) that the symbol yk is the outputof the channel when xj was the input. The channel is memoryless if those conditionalprobabilities are independent.

The transition matrix that characterizes the channel is a M × N matrix withthe conditional probability p(yk|xj) as an entry on the jth row and the kth column,1 ≤ j ≤ M and 1 ≤ k ≤ N .

The simplest examples are in the binary case, where:

• a Bernoulli(π) source outputs 1 with probability π and 0 with probability1 − π at all times;

• a binary symmetric channel Binary Symmetric Channel (BSC) with crossoverprobability p is represented in Fig. 2.2 and has p(Y = 1|X = 0) = p(Y =0|X = 1) = p, as summarized by the transition matrix:

p(Y |X) =

[(1 − p) p

p (1 − p)

]

. (2.11)

For a stationary source, we denote the probability that a random vector X =(Xt+1, Xt+2, ..., Xt+n) is equal to a particular source word x = (x1, x2, ..., xn) byp(X = x) = p(x). The entropy of the source is H(X) = −

∑x p(x) log p(x). It is of

interest to know how the source entropy varies as the length of the source wordsincreases. This quantity is given by the entropy rate of the source, which is:

Definition. The entropy rate of a source is

H = limn→∞

n−1H(X) = limn→∞

n−1H(X1, ..., Xn) (2.12)

15

Chapter 2

Theorem 3. Shannon’s Lossless Source Coding Theorem

A rate equal to the entropy rate H of a stationary source X is sufficient onthe average to describe the source. H(X) is the optimal compression rate for thesource [12,25]:

R∗ = H(X)

Definition. The maximum rate at which information can be transmitted over achannel is called channel capacity, and is defined to be

C = maxp(x)

I(X; Y ). (2.13)

Theorem 4. Source-Channel Coding Theorem

Given a Discrete Memoryless Channel (DMC) with capacity C and a DMS withentropy rate H, then the output of the source can be encoded and transmitted overthe channel with arbitrarily low probability of errors if H ≤ C [12,25].

In what follows we present two examples of channels and their correspondinginformation capacities.

Example 2.3.1 (Binary symmetric channel) Consider the BSC presentedin Fig. 2.2, with crossover probability p ≤ 0.5. The channel has binary input X

and output Y , and an input symbol may be flipped with probability p. This isequivalent to considering that the BSC introduces an additive noise modeled by abinary random variable Z ∼ Bernoulli(p), considered independent from the inputX, i.e., Y = X ⊕ Z. The capacity is:

C = maxp(x)

I(X; Y )

= maxp(x)

(H(Y ) − H(Y |X))

= maxp(x)

(H(Y ) − H(X ⊕ Z|X))

= maxp(x)

(H(Y ) − H(Z|X))

(a)= max

p(x)H(Y ) − H(Z)

= 1 − H(p),

where (a) follows from the independence of the source X and noise Z. Thecapacity is reached for a symmetric source, X ∼ Bernoulli(0.5).

Example 2.3.2 (Binary erasure channel) Consider the Binary ErasureChannel (BEC) with erasure probability p, presented in Fig. 2.3. The source X

is binary, and each of its symbols may be erased, i.e., mapped to a distinct symbol

16


Figure 2.3: The binary erasure channel.

ε, with a probability p. The capacity of this channel is:

C = maxp(x)

I(X; Y )

= maxp(x)

(H(X) − H(X|Y ))

(a)= max

p(x)(H(X) − pH(X))

= 1 − p,

where (a) follows since H(X|Y = y) = 0 if y ∈ {0, 1}, and H(X|Y = ε) = H(X)as Y = ε does not give any information about X. As in the case of the BSC, thecapacity is reached by a uniform source X ∼ Bernoulli(0.5).

2.4 The Rate-Distortion Function

Assume a source X produces a i.i.d. sequence {X1, X2, ..., Xn} according to theprobability p(x), with x ∈ X . The decoded source X outputs the sequence{X1, X2, ..., Xn} with values in X .

Definition. A distortion measure is a function d : X × X → R+; given x ∈ X andx ∈ X , d(x, x) represents the cost of decoding input symbol x to be the output symbolx.

Definition. The distortion between words x = (x1, x2, ..., xn) and x = (x1, x2, ..., xn)is given by:

d(x, x) =1n

n∑

i=1

d(xi, xi) (2.14)

An example of a common distortion function is the Hamming distortion, given

17

Chapter 2

by:

d(xi, xi) =

{0, if xi = xi

1, if x 6= xi

(2.15)

which is actually also equal to the probability of error, since

E[d(X, X)] = 0 ∙ pr(X = X) + 1 ∙ pr(X 6= X)

= pr(X 6= X)

Definition. The Rate-Distortion (RD) function is the effective rate R at whicha source can produce information such that the reconstructed output reproduces thesource with a total distortion smaller or equal to D.

Let p(x) be given. Then, the joint distribution p(x, x) = p(x)p(x|x) dependson the choice of the conditional probability p(x|x). Every such choice will have anassociated expected distortion:

dp(x|x) =∑

j,k

p(xj)p(xk|xj)d(xj , xk), (2.16)

where 1 ≤ j ≤ M , and 1 ≤ k ≤ N , with M and N being the cardinality of theinput and, respectively, the output alphabet. The choice of p(x|x) also determinesan average mutual information:

Ip(x|x)(X; X) =∑

j,k

p(xj)p(xk|xj) logp(xj)p(xk|xj)p(xj)p(xk)

=∑

j,k

p(xj)p(xk|xj) logp(xk|xj)p(xk)

(2.17)

For every fixed D, the rate distortion function can be expressed using (2.16) and(2.17) as:

R(D) = minp(x|x):dp(x|x)≤D

Ip(x|x)(X; X) (2.18)

Theorem 5. Shannon’s Lossy Source Coding Theorem The rate-distortionfunction for a DMS X with distortion measure d(x, x) is given by (2.18) [12,25].

In other words, the rate-distortion function is the minimum mutual informationI(X; X) between source and reconstruction, chosen over all possible choices of p(x|x)such that the distortion constraint is satisfied.

18


Properties of the Rate-Distortion Function

Theorem 6. The R(D) function is defined between the following limit values:

• If R = H(X) then Dmin = 0.

• If R = 0 then the least average distortion is Dmax = mink

∑j p(xj)d(xj , xk)

[12].

Proof. If R = H(X) then the source is fully described and lossless reconstruction ispossible. So X = X and d(X, X) = 0.

If R = 0 the sources are independent, and I(X, X) = 0, and p(xk|xj) = p(xk). Byreplacing in (2.16) and looking for the minimum average distortion, we can obtain itby setting p(xk) = 1 for the value of k that minimizes the quantity

∑j p(xj)d(xj , xk).

Theorem 7. R(D) is a convex function [12]

Proof. We have to show that ∀(D1, D2) distortion levels and λ ∈ [0, 1], the followinginequality holds:

R(λD1 + (1 − λ)D2) ≤ λR(D1) + (1 − λ)R(D2) (2.19)

Let p1(xk|xj) achieve the pair (D1, R(D1)) and p2(xk|xj) achieve the pair (D2, R(D2)).According to (2.16) we have:

λD1 + (1 − λ)D2 =∑

j,k

p(xj) ∙ (λp1(xk|xj) + (1 − λ)p2(xk|xj)) ∙ d(xj , xk)

=∑

j,k

p(xj) ∙ p∗ ∙ d(xj , xk),

where p∗ = λp1(xk|xj) + (1 − λ)p2(xk|xj).From (2.18) and (2.17) we can write the following:

R(λD1 + (1 − λ)D2) ≤∑

j,k

p(xj)p∗ log

p∗

p∗(xk), (2.20)

where p∗(xk) =∑

j p(xj)p∗.

19

Chapter 2

The right term in the inequality above becomes:

∑

j,k

p(xj)p∗ log

p∗

p∗(xk)=∑

j,k

p(xj) ∙ (λp1(xk|xj) + (1 − λ)p2(xk|xj)) ∙ logp∗

p∗(xk)

= λ∑

j,k

p(xj)p1(xk|xj) logp∗

p∗(xk)+

(1 − λ)∑

j,k

p(xj)p2(xk|xj) logp∗

p∗(xk)

= λ∑

j,k

p(xj)p1(xk|xj) log

[p1(xk|xj)p∗1(xk)

p∗1(xk)p1(xk|xj)

p∗

p∗(xk)

]

+

(1 − λ)∑

j,k

p(xj)p2(xk|xj) log

[p2(xk|xj)p∗2(xk)

p∗2(xk)p2(xk|xj)

p∗

p∗(xk)

]

= λ∑

j,k

p(xj)p1(xk|xj)

[

logp1(xk|xj)p∗1(xk)

+ logp∗1(xk)

p1(xk|xj)p∗

p∗(xk)

]

+

(1 − λ)∑

j,k

p(xj)p2

[

logp2(xk|xj)p∗2(xk)

+ logp∗2(xk)

p2(xk|xj)p∗

p∗(xk)

]

,

where p∗1(xk) =∑

j p(xj)p1(xk|xj) and p∗2(xk) =∑

j p(xj)p2(xk|xj). Using theinequality log x ≤ x − 1, the identity in (2.17) and the fact that p1(xk|xj) andp2(xk|xj) achieve R(D1) and R(D2), respectively, we have:

∑

j,k

p(xj)p∗ log

p∗

p∗(xk)≤ λ

[

R(D1) +∑

j,k

p(xj)p1(xk|xj)

(p∗1(xk)

p1(xk|xj)p∗

p∗(xk)− 1

)]

+

(1 − λ)

[

R(D2) +∑

j,k

p(xj)p2(xk|xj)

(p∗2(xk)

p2(xk|xj)p∗

p∗(xk)− 1

)]

= λ

[

R(D1) +∑

k

p∗1(xk)p∗(xk)

∑

j

p(xj)p∗ −

∑

j,k

p(xj)p1(xk|xj)

]

+

(1 − λ)

[

R(D2) +∑

k

p∗2(xk)p∗(xk)

∑

j

p(xj)p∗ −

∑

j,k

p(xj)p2(xk|xj)

]

= λ

[

R(D1) +∑

k

p∗1(xk) − 1

]

+ (1 − λ)

[

R(D2) +∑

k

p∗2(xk) − 1

]

= λR(D1) + (1 − λ)R(D2),

where the last equality holds because∑

k p∗1(xk) = 1 and∑

k p∗2(xk) = 1. Combiningthe above with (2.20) concludes the proof.

Summing up, we can state that R(D) is a convex, continuous, monotonic

20


decreasing function on the interval D ∈ [0, Dmax].

Computation of the Rate-Distortion Function

We only consider the case when the input alphabet matches the output alphabetin cardinality, i.e., |X | = |X |. As expressed in (2.18), finding the rate-distortionfunction is an optimization problem: minimize the rate given by (2.17) such that theexpected distortion satisfies (2.16). However, we must add the additional constraintthat:

∑

k

p(xk|xj) = 1 (2.21)

We can therefore formulate a Lagrangian multiplier problem, and the Lagrangianfunction is expressed using (2.17), (2.16) and (2.21):

J(p(xk|xj)) =∑

j,k


−

− s∑

j,k

p(xj)p(xk|xj)d(xj , xk) −∑

j

μj

∑

k

p(xk|xj),(2.22)

where μj and s are Lagrange multipliers. As will be proven next, the value of themultiplier s controls the slope of the rate-distortion function. The solution of theoptimization problem can be determined by solving dJ

dp(xk|xj)= 0. (2.22) can be

conveniently rewritten by setting log λj = μj

p(xj)as:

J(p(xk|xj)) =∑

j

p(xj)∑

k

p(xk|xj)

[

logp(xk|xj)λjp(xk)

− s ∙ d(xj , xk)

]

. (2.23)

Taking the derivative yields:

dJ

dp(xk|xj)= p(xj)

[

logp(xk|xj)λjp(xk)

− s ∙ d(xj , xk)

]

= 0, (2.24)

which, in turn, gives:p(xk|xj) = λjp(xk) ∙ 2s∙d(xj ,xk) (2.25)

By using (2.21), the summation over k of the above equation gives:

λj =

(∑

k

p(xk) ∙ 2s∙d(xj ,xk)

)−1

(2.26)

21

Chapter 2

As such, we can rewrite (2.25) as:

p(xk|xj) =p(xk) ∙ 2s∙d(xj ,xk)

∑i p(xi) ∙ 2s∙d(xj ,xi)

(2.27)

This expresses p(x|x) as a function of the marginal probability p(x). With 1 ≤k, j ≤ M , (2.27) gives M2 probability values, thereby fully describing p(x|x). Thepossibility exists that the resulting values might not be lying in the interval [0 , 1]. [12]proves that in this case the problem can be reduced to a solvable formulation bychoosing an appropriate subset of the output alphabet.

In order to find probabilities p(x) we need to process the equality in (2.27) bymultiplying both sides with p(xj), dividing both sides by P (xk) and summing overj. By using the fact that

∑x p(x|y) = 1 and the identity in (2.26), we obtain:

∑

j

p(xk|xj)p(xj)p(xk)

=∑

j

p(xj) ∙ 2s∙d(xj ,xk)

∑i p(xi) ∙ 2s∙d(xj ,xi)

∑

j

p(xj |xk) =∑

j

p(xj) ∙ 2s∙d(xj ,xk) ∙

(∑

i

p(xi) ∙ 2s∙d(xj ,xi)

)−1

1 =∑

j

p(xj) ∙ 2s∙d(xj ,xk) ∙ λj , (2.28)

∀k for which p(xk) 6= 0.By replacing the probability from (2.25) in (2.16) and (2.17), we can express the

rate and distortion functions in an alternative compact form:

R = sD +∑

j

p(xj) log λj (2.29)

D =∑

j,k

λjp(xj)p(xk) ∙ 2s∙d(xj ,xk)d(xj , xk) (2.30)

In this formulation, the parameter s is the slope of the rate-distortion function, andit can be used as a parameter to generate all the (R, D) points. The slope of theR(D) is negative and continuous.

Example: Rate-distortion for a binary source with Hamming distortion

Let us consider the following rate-distortion example: given X a Bernoulli(π) sourceand the Hamming distortion, find the corresponding rate-distortion function. Wewill find the solution following two distinct approaches: one analytical proof, byderiving the probabilities in equation (2.27), and another more intuitive solution,given by [25].

22


Figure 2.4: The reverse binary symmetric channel.

Solution 1 X is binary, X = {0, 1}, with p(X = 1) = π and p(X = 0) = 1 − π.

Then (2.28) gives:

{λ0 ∙ (1 − π) + λ1 ∙ π ∙ 2s = 1, for k=0

λ0 ∙ (1 − π) ∙ 2s + λ1 ∙ π = 1, for k=1

Solving for (λ0, λ1) gives: {λ0 = 1

(1−π)(1+2s)

λ1 = 1π(1+2s)

(2.31)

Replacing these values in (2.30) and (2.29) gives the following expressions for thedistortion and rate:

D = λ0(1 − π) ∙ p(x = 1) ∙ 2s + λ1π ∙ p(x = 0) ∙ 2s

⇔ D =2s

1 + 2s∙

(

p(x = 0) + p(x = 1)

)

⇔ D =2s

1 + 2s(2.32)

23

Chapter 2

and

R = s ∙ D + (1 − π) log λ0 + π log λ1

⇔ R = log 2s ∙2s

1 + 2s− (1 − π) log(1 − π) − (1 − π) log(1 + 2s)−

− π log π − π log(1 + 2s)

⇔ R =

[

− π log π − (1 − π) log(1 − π)

]

−

[

log(1 + 2s) − log 2s ∙2s

1 + 2s

]

⇔ R = H(π) −

[

−1 + 2s

1 + 2slog

11 + 2s

−2s

1 + 2s∙ log 2s

]

⇔ R = H(π) −

[

−1

1 + 2slog

11 + 2s

−2s

1 + 2s∙ log

2s

1 + 2s

]

⇔ R = H(π) − H(D) (2.33)

If we are interested in determining the probabilities achieving this rate-distortionbound, we can use the solution from (2.31) in (2.26) and, denoting p(X = 1) = μ

and p(X = 0) = 1 − μ, as can be seen in Fig. 2.4 we obtain:

1λ0

= (1 − μ) ∙ 1 + μ ∙ 2s

⇔ (1 − π)(1 + 2s) = 1 − μ ∙ (1 − 2s)

⇔ μ =1

1+2s − (1 − π)1−2s

1+2s

⇔ μ =π − D

1 − 2D(2.34)

Solution 2 We denote the output by X, and the modulo 2 addition (which corre-sponds in the binary case to the Hamming distance) by ⊕, i.e., X⊕X = 0 ⇔ X = X.The quantity in (2.18) can be bounded as follows:

I(X; X) = H(X) − H(X|X)

= H(π) − H(X ⊕ X|X)

≥ H(π) − H(X ⊕ X) (2.35)

≥ H(π) − H(D) (2.36)

where (2.35) follows since conditioning reduces entropy and (2.36) follows since the

24


expected value of the Hamming distance is the distortion. Thus, we have that:

R(D) = min I(X|X)

⇔R(D) ≥ H(π) − H(D)

We show that the above inequality can actually become an equality by finding ajoint distribution for which the distortion constraint is met and I(X; X) = R(D).Knowing that I(X; X) = H(X) − H(X|X), we have to find a setup for whichH(X|X) = H(D). This can be achieved by setting the inverse transition matrix tobe:

p(x|x) =[ 1 − D D

D 1 − D

](2.37)

We refer again to Fig. 2.4 for an illustration of the inverse channel. The expecteddistortion will be D.

By denoting p(X = 1) = μ and p(X = 0) = 1 − μ, we can compute p(X = 1):

π = μ(1 − D) + (1 − μ)D

⇔ μ =π − D

1 − 2D

The solutions of the two methods are naturally the same, namely:

R(D) =

{H(π) − H(D), 0 ≤ D ≤ min{π, 1 − π}

0, D > min{π, 1 − π}

The function is presented in Fig. 2.5.

2.5 Blahut-Arimoto Algorithm

Equations (2.27) and (2.28) can be used to check if a given p(x) is a solution tothe rate-distortion minimization problem. Nevertheless, solving the system for theoptimal solution is not always tractable analytically. An alternative solution wasproposed as an instance of a more general algorithm: finding the minimum distancebetween two convex sets.

The rate distortion function in (2.18) can be expressed in the form of a doubleminimization:

R(D) = minp∗(x)

minp∗(x|x):

∑p(x)p∗(x|x)d(x,x)≤D

∑

j,k


(2.38)

Then, the Blahut-Arimoto [7, 16] algorithm numerically computes the rate-

25

Chapter 2

Figure 2.5: The rate-distortion function for a binary uniform source with Hammingdistortion.

distortion bound (see Algorithm 1).

Algorithm 1: Blahut-Arimoto Algorithm for Rate-Distortion Computation

Input: p(x), random initializations for: the slope of the R(D) - s∗ and the1

output distribution p∗(x)Output: Optimal p(x|x) and p(x)2

1: Obtain the minimizing p∗(x|x) given s∗ and p∗(x) from (2.27)2: Obtain the minimizing p∗(x) given s∗ and p∗(x|x): p∗(x) =

∑j p(xj)p∗(xk|xj)

3: Repeat until the quantity R(D) in (2.38) converges to a minimum value

For every slope value s∗ of the R(D), the Blahut-Arimoto algorithm was provento converge to the optimal solution in [26]. Appropriate choice of the slope s∗ valuesexhaustively sweeps the R(D) curve.

2.6 Distributed Lossless Compression – Slepian-

Wolf Theorem

So far we have considered the coding of a single source of information X, and wehave established that in order to reconstruct it, the encoding rate R has to be biggerthan H(X). Suppose that we have two sources of information X and Y . What wouldbe the minimum rate required to encode both sources? Clearly, if the sources areindependent of each other, then the rate will have to be bigger than the sum of

26


Figure 2.6: Distributed compression system.

the individual entropies, R > H(X) + H(Y ). If the sources are correlated, thenjoint encoding requires a rate of at least H(X, Y ). What is then the minimum raterequired to separately encode and jointly decode sources X and Y ? The nonintuitiveanswer was given in a fundamental paper by Slepian and Wolf [83], which have shownthat a total rate of R = H(X, Y ) is sufficient for lossless reconstruction, even in thecase of separate encoding of correlated sources.

An exposition of the proof of this theorem, as well as of the general Wyner-Ziv theorem, are not within the scope of this dissertation. Instead, the interestedreader is encouraged to consult the books of Cover and Thomas [25], El Gamal andKim [38], or the respective papers [83] and [103].

2.6.1 Distributed Lossless Source Coding

Let X ∈ X and Y ∈ Y be two correlated sources described by (X, Y ) ∼ p(x, y), thejoint probability mass function over X × Y . The sources are separately encoded atthe rate pair (RX , RY ). This system is presented in Fig. 2.6.

Definition. A (2nRX , 2nRY , n) distributed lossless source code for the joint source(X, Y ) consists of:

• two encoding functions: u1 : Xn → {1, 2, ..., 2nRX} and u2 : Yn →{1, 2, ..., 2nRY }

• a decoder: to each index pair (u1, u2) ∈ {1, 2, ..., 2nRX} × {1, 2, ..., 2nRY },assign an estimated (xn, yn) ∈ X × Y

Definition. The probability of error for a distributed lossless source code is Pne =

p

(

(Xn, Y n) 6= (Xn, Y n)

)

.

Definition. A rate pair (RX , RY ) is achievable for distributed lossless sourcecoding if there exists a sequence of (2nRX , 2nRY , n) codes with limn→∞ Pn

e = 0.The achievable rate region is the closure of the set of achievable rate pairs.

We have already given the intuitive limits for the optimal rate region. The innerand outer bounds are presented in Fig. 2.7. The shaded area corresponds either to

27

Chapter 2

Figure 2.7: Inner and outer bounds on the optimal rate region.

non-optimal rate pairs (inner bound) or rate pairs insufficient for lossless recovery(outer bound). A rate pair (RX , RY ) is always achievable for RX > H(X) andRY > H(Y ), which gives the inner bound in Fig. 2.7. In the case where the sourcescan communicate, a rate pair (RX , RY ) is achievable if the sum of the rates issuch that RX + RY ≥ H(X, Y ). Moreover, since H(X, Y ) = H(X) + H(Y |X) =H(Y ) + H(X|Y ), individual rates (RX , RY ) must be such that RX ≥ H(X|Y ) andRY ≥ H(Y |X). This is represented by the outer bound in Fig. 2.7.

Theorem 8. Slepian-Wolf [83]For distributed lossless source coding, the optimal rate region is given by rate

pairs (RX , RY ) such that:

RX ≥ H(X|Y )

RY ≥ H(Y |X)

RX + RY ≥ H(X, Y )

Essentially, the Slepian-Wolf theorem states that the outer bound in Fig. 2.7 istight, i.e., all sum-rates are achievable even in the case when the two sources X and Y

do not communicate with each other. Therefore, the sum-rate RX +RY = H(X, Y )is sufficient to separately encode and jointly decode correlated sources X and Y .

2.6.2 Lossless Source Coding with Coded Side Information

Consider the distributed lossless source coding instance where the two correlatedvariables (X, Y ) are encoded separately, but only one of them X is to be losslesslyrecovered. If the source Y is described using RY bits, we want to know what is therequired rate RX for the recovery of X.

28


Figure 2.8: Lossy source coding with side information.

If we do not spend any rate on describing Y and RY = 0, then RX > H(X).If RY > H(Y ), then Y can be perfectly reconstructed, so by the Slepian-Wolftheorem, RX ≥ H(X|Y ) suffices to find X. We remind that, in general, a rate equalto I(Y ; U) can be used to describe Y with a certain fidelity. Assuming RY = I(Y ; U),then U will be available at the decoder, and the rate required to find X should beat least H(X|U). In accordance with this intuition, the following theorem gives theachievable rate pairs (RX , RY ) for lossless recovery of X:

Theorem 9. If X is encoded at rate RX and Y is encoded at rate RY , X can berecovered with arbitrary low error if and only if:

{RX ≥ H(X|U)

RY ≥ I(Y ; U)

for an auxiliary random variable U ∈ U such that X – Y – U form a Markov chainand |U| ≤ |Y| + 2 [25].

We make the observation that [74] and [49] independently prove that an auxiliaryvariable with cardinality |U| = 2 is sufficient to achieve the above defined rate region.

2.7 Lossy Compression with Side Information -

Wyner–Ziv Theorem

Consider now that variable Y is made available to the decoder losslessly, i.e., RY >

H(Y ). For lossless recovery of correlated source X, knowing (X, Y ) ∼ p(x, y), we arein the case of the Slepian-Wolf theorem, and RX ≥ H(X|Y ). We know that R(D)(2.18) is the rate required to describe X within a distortion level D. How does thepresence of correlated source Y at the encoder and/or decoder reduce this rate?The system we refer to is presented in Fig. 2.8, where the dotted lines represent thepossibility of making source Y available to either encoder, decoder or both.

Definition. The side information is an auxiliary source of information availableto the encoder and/or decoder.

29

Chapter 2

Definition. The rate-distortion function with side information is defined tobe the minimum rate required to achieve any distortion D, in the presence of sideinformation.

The case when the side information is present only at the encoder and not atthe decoder does not present interest for the source coding problem, since with anuninformed decoder the problem reduces to the point-to-point case, and the ratedistortion function is given by (2.18).

When the side information is available to both the encoder and the decoder, therate distortion function is the conditional version of the rate distortion with no sideinformation.

Theorem 10. Rate Distortion with Encoder-Decoder Side Information [12]If the side information Y is available both at the encoder and the decoder, the

rate-distortion function for X is:

RX|Y (D) = minp(x|x,y):Ed(X,X)≤D

I(X; X|Y ) (2.39)

The case when the side information is only available at the decoder was studiedby Wyner and Ziv in their fundamental paper from 1976 [103]. The following theoremgives the rate distortion function for this specific case.

Theorem 11. Wyner-Ziv Theorem [103]

If the side information Y is available only to the decoder, the rate-distortionfunction for X is:

RWZ(D) = minp(u|x), p(x|u,y)

(

I(X; U) − I(Y ; U)

)

(2.40)

where the minimum is over all conditional probabilities p(u|x) with random variableU such that |U| ≤ |X | + 1, Y − X − U and X − (U, Y ) − X form Markov chains,and over all reconstruction functions p(x|u, y) such that E(d(X, X)) ≤ D.

The distortion can be computed as:

d(X, X) =∑

x,y,u,x

p(x, y)p(u|x)p(x|u, y)d(x, x) (2.41)

Theorem 12. The Wyner-Ziv rate-distortion function defined in (2.40) is a non-negative non-increasing convex function of D [12].

Proof. The non-negativity of the rate-distortion function in (2.40) follows by rewrit-

30


ing the difference of the mutual information quantities as:

I(X; U) − I(Y ; U) = H(U) − H(U |X) − H(U) + H(U |Y )

= H(U |Y ) − H(U |X)

= H(U |Y ) − H(U |X, Y )

= I(X; U |Y ) (2.42)

where we have used the equality H(U |X) = H(U |X, Y ), owed to the MarkovianityY − X − U . Since the information is a non-negative quantity, the rate-distortionfunction in (2.40) is non-negative.

The monotonicity of RWZ(D) is a direct consequence of the the distortion condi-tion Ed(X, X) ≤ D. More specifically, consider an operational point characterized bythe pair (RWZ(D∗), D∗). Then for all distortion levels D∗∗ > D∗ the correspondingrate value RWZ(D∗∗) has to be such that RWZ(D∗∗) < RWZ(D∗), otherwise thepair (RWZ(D∗), D∗) satisfies the distortion constraint for a lower rate. So RWZ(D)is non-increasing.

Since the rate-distortion function for a single source is a convex function, weexpect its extension to the Wyner-Ziv case to be convex as well. In order to provethis, consider two distortion levels D1 and D2, and the pairs (U1, X1) and (U2, X2)achieving the minima RWZ(D1) and RWZ(D2) in (2.40). Then let W be a randomvariable such that:

W =

{U1 with probability λ

U2 with probability (1 − λ)

where λ ∈ [0, 1]. Then, we can write:

XW =

{X1 with probability λ

X2 with probability (1 − λ)

and the corresponded expected distortion will be:

DW = Ed(X, XW )

= λEd(X, X1) + (1 − λ)Ed(X, X2)

= λD1 + (1 − λ)D2

31

Chapter 2

RWZ(DW ) can be written as:

RWZ(DW ) = I(W ; X) − I(W ; Y )

= H(X) − H(X|W ) − H(Y ) + H(Y |W )

= H(X) − λH(X|U1) − (1 − λ)H(X|U2)−

H(Y ) + λH(Y |U1) − (1 − λ)H(Y |U2)

= λ

(

H(X) − λH(X|U1) − H(Y ) + λH(Y |U1)

)

+

(1 − λ)

(

H(X) − λH(X|U2) − H(Y ) + λH(Y |U2)

)

= λ

(

I(U1; X) − I(U1; Y )

)

+ (1 − λ)

(

I(U2; X) − I(U2; Y )

)

Therefore, we can write:

RWZ(DW ) = minU :Ed≤D

(

I(U ; X) − I(U ; Y )

)

≤ I(W ; X) − I(W ; Y )

= λ

(

I(U1; X) − I(U1; Y )

)

+ (1 − λ)

(

I(U2; X) − I(U2; Y )

)

= λRWZ(D1) + (1 − λ)RWZ(D2)

Therefore RWZ(D) is a convex function.

2.7.1 Maximum distortion

The maximum distortion is a property of the Wyner-Ziv system; it is defined to bethe minimum distortion between the source and the reconstructed source that canbe achieved when no rate is spent – see also Theorem 6 in Section 2.4.

As an example, let X be binary, with pr(X = 1) = π, and π < 0.5. If no rateis spent (R = 0), then there are two choices for the reconstructed source: eitherX = Y – the side information, or X = 0. The expected distortion in the first case isDmax1 = Ed(X, Y ). In the second case, the expected distortion is Dmax2 = Ed(X, 0),and this is equal to pr(X = 1), so Dmax2 = π.

It is of course desired to have reconstruction with as low a distortion aspossible, even for the case when R = 0. This is why the maximum possibledistortion of a Wyner-Ziv system is defined to be Dmax = min{Dmax1, Dmax2} =min{π,Ed(X, Y )}.

32


2.7.2 Example - Doubly symmetric binary source

Consider X = Y = U = {0, 1}, the source X to be uniform, i.e., X ∼ Bernoulli(π =0.5), and the correlation between X and Y to be described by the following transitionmatrix:

p(y|x) =

[(1 − p) p

p (1 − p)

]

. (2.43)

The channel corresponding to the transition matrix in (2.43) is called binarysymmetric (BSC), where p ∈ [0, 0.5] is the crossover probability. The distortionmetric is the Hamming distance, so d(x, x) = x⊕ x, where ′⊕′ denotes the modulo-2sum.

When there is no side information available, the rate distortion function is givenby (2.18). I(X, X) = H(X) − H(X|X) = H(π) − H(D) and the rate distortionfunction is equal to:

R(D) =

{1 − H(D), if 0 ≤ D ≤ 0.5

0, if D ≥ 0.5

When side information Y is available both at the encoder and the decoder, therate-distortion function in (2.39) becomes:

RX|Y (D) =

{H(p) − H(D), if 0 ≤ D ≤ p

0, if D ≥ p

When the side information is only available at the decoder, the rate-distortionfunction was proved in [103] to be RWZ(D) = l.c.e. (g(D)), and:

g(D) =

{H(p ∗ D) − H(D), if 0 ≤ D ≤ p

0, if D ≥ p(2.44)

where l.c.e. stands for lower convex envelope and ′∗′ is the binary convolutionoperator: x ∗ y , x(1 − y) + y(1 − x). All rate-distortion functions are presented inFig. 2.9, for a value of p = 0.25. The lower convex envelope of g(D) includes theline segment that is tangent to g(D) and goes to the R = 0 and D = p point. Thetangency point has the coordinates R = g(D∗) and D = D∗, where D∗ satisfies:

g′(D∗) =g(D∗)D∗ − p

(2.45)

As expected, when no side information is available, the rate-distortion perfor-mance is clearly worse, and RWZ(D) < R(D) for all distortions 0 ≤ D ≤ p .We also make the observation that RX|Y (D) ≤ RWZ(D) for all distortion values

33

Chapter 2

Figure 2.9: Rate distortion functions for the doubly binary symmetric case : R(D)is the classical rate-distortion function (i.e., with no side information), RX|Y (D) isthe rate-distortion function with encoder-decoder side-information, while RWZ(D) isthe rate-distortion corresponding to the Wyner-Ziv case. g(D) is the function definedby (2.44) and D∗ is the tangency point defined by (2.45)

D ≥ 0. Knowledge of the side information at the encoder does affect the requiredcode rate, and in the case of the informed encoder, for the same expected distortion,the required rate will in general be smaller than in the Wyner-Ziv (WZ) case.

2.8 Rate loss in Wyner-Ziv coding

It should be a natural observation (see the example presented in Fig. 2.9) the factthat the following inequalities hold:

RX|Y (D) ≤ RWZ(D) ≤ R(D).

where R(D) is the classical rate-distortion function (i.e., with no side information),RX|Y (D) is the rate-distortion function with encoder-decoder side-information,while RWZ(D) is the rate-distortion corresponding to the Wyner-Ziv case, i.e., side-information available only to the decoder.

The difference between the extreme terms, i.e., R(D)−RX|Y (D), is bounded bylog |X |, obtained when X = Y and D = 0; for continuous sources, this differencemight therefore be unbounded. However, the difference between the middle and theleft-most terms, i.e., RWZ(D) − RX|Y (D), is more difficult to bound or interpret.The difference between the rate-distortion functions corresponding to Wyner-Ziv

34


coding and predictive coding, respectively, is the rate loss associated to Wyner-Zivcoding. We know from the Slepian-Wolf (SW) theory that the difference is zero forthe lossless case, i.e., when D = 0. The difference is also zero for D = Dmax, ascan be seen in Fig. 2.9 for D = 0.25. For values of the distortion D ∈ (0, Dmax)Wyner and Ziv observed that most of the times there will be a penalty in rate fornot having access to the side information at the encoder.

Zamir [105] observed that, mathematically, the difference between the two termsis given by the Markov chain constraint in Theorem 11; substituting U by X informulation (2.42) of the WZ rate would lead to the equality RWZ(D) = RX|Y (D).Furthermore, [105] gives an upper bound on the maximum rate loss that WZ codingcan incur when compared to the predictive coding case. This upper bound is actuallycomputed to be 0.22 bits for a binary source and 0.5 bits for continuous sources.However, the upper bound is not tight, and it’s mere purpose is to show that therate-loss is bounded.

2.9 Achievability of Slepian-Wolf and Wyner-Ziv

coding

Most of the results in information theory can be intuitively explained by using afundamental theorem introduced by Shannon in his landmark paper [80], namelythe asymptotic equipartition property. It is a direct consequence of the weak lawof large numbers, which states that for independent, identically distributed (i.i.d.)random variables, the average value converges in probability to the expected valuefor a large sample size:

1n

n∑

i=1

Xi → E(X)

2.9.1 The Asymptotic Equipartition Property

The Asymptotic Equipartition Property (AEP) can be formally stated by thefollowing theorem [25]:

Theorem 13. If X1, X2, ... are i.i.d. ∼ p(x), then 1n log p(X1, X2, ..., Xn) → H(X)

in probability [25].

Proof. The proof is straightforward by noticing that E(log(p(X)) = H(X) andapplying the weak law of large numbers.

Based on the AEP, we can divide the set of all n-length sequences into twodisjunct sets: the typical set, for which the true entropy is close to the real entropy,and the nontypical set, for which the true entropy diverges from the real entropy.

35

Chapter 2

Definition. The typical set A(n)ε contains sequences (x1, x2, ..., xn) ∈ X n with the

property:2−n(H(X)+ε) ≤ p(x1, x2, ..., xn) ≤ 2−n(H(X)−ε),

where ε is arbitrarily small.

Based on the AEP, one can show that the typical set A(n)ε has the following

properties:

Theorem 14. [25]:

1. If (x1, x2, ..., xn) ∈ A(n)ε , then H(X)−ε ≤ − 1

n log p(x1, x2, ..., xn) ≤ H(X)+ε.

2. Pr{A(n)ε } > 1 − ε, for large enough n

3. |A(n)ε | ≤ 2n(H(X)+ε)

4. |A(n)ε | ≥ (1 − ε)2n(H(X)−ε), for large enough n

The proof of the theorem is given in [25]. Based on its results, we make thefollowing observations:

• the average information content of a typical sequence is close to the entropyof the source

• the probability of generating a typical sequence is close to 1 for large enoughsequence sizes

• the number of sequences in the typical set is almost equal to 2nH(X)

• all the elements in the typical set are asymptotically equally probable, withprobability 2−nH(X) for large enough sequence sizes

2.9.2 Consequences of the AEP: Data Compression

The AEP offers theoretical background for improving the efficiency of data compres-sion. We will divide the set of length n sequences into the typical set A(n)

ε and itscomplement. Since there are ∼ 2n(H(X)+ε) sequences, indexing them will require atmost dn(H(X) + ε)e bits, where d∙e is the upper integer operator. These sequenceswill be prefixed with a 0, adding up to a total length of dn(H(X) + ε)e + 1 bits pertypical sequence. Similarly, the rest of the sequences will be coded using at mostdn log |X |e bits, plus one extra bit for prefixing each sequence with a 1. The graphicaldepiction of the described procedure is summarized in Fig. 2.10.

The above method provides a bijective correspondence between sequences andtheir corresponding codes, and is therefore uniquely decodable. Moreover, the typicalsequences have a shorter code-length assigned ≈ nH(X) and it is provable [25] that,

36


Figure 2.10: Data compression using typical sets [25], where d∙e is the upper integeroperator

when n is sufficiently large, the average code length of the proposed constructionsatisfies:

1nE(l(Xn)) ≤ H(X) + ε,

where l(Xn) is the length of the codeword associated to Xn.

2.9.3 Jointly Typical Sequences

Definition. The set A(n)ε of jointly typical sequences (xn, yn) with respect to p(x, y)

includes all sequences such that:

A(n)ε = {(xn, yn) ∈ X × Y :| −

1n

log p(xn) − H(X)| < ε,

| −1n

log p(yn) − H(Y )| < ε,

| −1n

log p(xn, yn) − H(X, Y )| < ε}

There are ≈ 2nH(X;Y ) jointly typical sequences, so not all combinations oftypical pairs Xn(≈ 2nH(X)) and typical pairs Y n(≈ 2nH(Y )) will result in a jointlytypical pair. The probability that a randomly chosen pair is jointly typical is about2−nI(X;Y ) [25].

37

Chapter 2

Figure 2.11: Random binning for Slepian-Wolf coding [27].

2.9.4 Achievability of Slepian-Wolf Coding

Consider a asymmetric SW working rate-point, where the rates are RY = H(Y )and RX = H(X|Y ). Following the AEP, Y n can be reconstructed with arbitrarilylow error probability by using nH(Y ) bits. The sequence will be known at the jointdecoder and used as side-information. However, the encoder of the source X does nothave prior knowledge of the sequence Y n. As such, the achievability of the theoremwas proved by Slepian and Wolf by using the concept of random binning.

During code construction, all possible sequences Xn are randomly distributedinto 2nRX bins, as shown in Fig. 2.11. The mapping is known by both the X-sourceencoder and the joint decoder. All the X-source encoder does is to send the jointdecoder the index of the bin to which the source sequence belongs, i.e., a number inthe set {1, 2, ..., 2nRX}. The decoder will thereafter look in the designated bin andsearch for a sequence Xn which is jointly typical to the available side informationY n. If there is only one jointly typical sequence Xn, it is chosen as the decodedsequence. Otherwise, an error is declared. The probability of error corresponding tothis coding strategy can be proven to be negligible as n → ∞ [25, 83].

2.9.5 Achievability of Wyner-Ziv Coding

We remind the reader the details of WZ problem: as seen in Fig. 2.12 source X needsto be encoded and subsequently decoded in the presence of side information Y ,under an imposed distortion condition. The best rate-distortion points can be foundby minimizing the conditional mutual information I(X; U |Y ) over all conditionalprobabilities p(u|x) with random variable U satisfying |U| ≤ |X | + 1, Y − X − U

and X − (U, Y ) − X forming Markov chains, and over all reconstruction functionsf(y, u) = p(x|u, y) such that E(d(X, X)) ≤ D.

38


Figure 2.12: Wyner-Ziv coding system.

The proof of the achievability of the WZ theorem involves the following steps[25,103]:

• fix the conditional probability p(u|x) and compute p(u) =∑

x p(x)p(u|x); alsofix the reconstruction function x = f(y, u) = p(x|u, y);

• Codebook generation: generate source codewords Un(c), c ∈ {1, 2, ..., 2nR1},where R1 = I(X; U) + ε; using a uniform distribution, randomly assign theindices c to 2nR2 bins, where R2 = I(X; U |Y ) + 5ε; let i ∈ {1, 2, ..., 2nR2} bethe bin index. Asymptotically, there will be ≈ 2n(R1−R2) = 2nI(Y ;U) indices c

in each bin.

• Encoding: Given source sequence Xn, the encoder looks for a Un(c) codewordsuch that (Xn, Un(c)) are jointly typical. If there is no match, the encoder setsc = 1. If there are multiple matches, the encoder chooses the lowest c. Theencoder will send to the decoder the index i of the bin to which the selectedcodeword belongs.

• Decoding: The decoder receives the transmitted index i. Then it will lookfor a codeword Un(c) belonging to the bin i such that (Y n, Un(c)) are jointlytypical. If there is a unique such codeword, the decoder outputs the solutionXn = f(Y n, Un(c)). If the decoder cannot find a proper matching codeword, orthere are more than one candidates, the output is an arbitrary reconstructionXn = random(Xn).

It can be proven [25, 103] that, with high probability, the decoder will producean output sequence Xn which has a distance smaller than nD from the originalsequence Xn, where D is given by D =

∑x,y,u p(x, y)p(u|x)d(x, f(y, u)), and d(∙, ∙)

is the desired distortion metric.

39

Chapter 2

Figure 2.13: Nested linear binary block codes for practical Wyner-Ziv coding.

2.10 Practical Code Constructions for the Binary

Wyner-Ziv Problem

The theoretical limits set by Slepian and Wolf (lossless) and Wyner and Ziv (lossy)are proven to be asymptotically achievable by means of appropriate code construc-tions. Unfortunately, those constructions mostly rely on arguments involving infinitecodeword lengths. In practice however, such codeword lengths are unfeasible, andthe result is that the performances of practical systems fall short of reaching theabsolute bounds. One of the aspects addressed by this chapter is how to approachthose information-theoretic bounds in practice for the binary WZ coding problem.

2.10.1 Binary Block Codes for Binary Slepian-Wolf and

Wyner-Ziv Coding

Wyner’s scheme [101] uses a linear (n, k) binary block code for lossless SW coding.This linear code generates 2n−k distinct syndromes, each indexing a bin of 2k binarywords of length n; the 2k words in each bin must be chosen such that they preserveminimum distance properties – see Fig. 2.11. In order to achieve compression,the original n bits are mapped to the corresponding (n − k) syndrome bits. Thecompression ratio is n : (n − k).

For lossy WZ coding, Zamir et al. [106] generalize Wyner’s scheme by usingnested linear block codes. Consider a linear (n, k2) binary block code used topartition the space of all binary words of length n into 2n−k2 bins of 2k2 elementseach. Out of all 2n−k2 bins only 2k1−k2 are to be used, with k1 > k2. The remainingelements in the not-used 2n−k2−2k1−k2 are quantized to the closest binary word from

40


the allowed 2k1−k2 ×2k2 = 2k1 ones – see Fig. 2.13. This ”quantization” is equivalentto a linear (n, k1) binary block source code. In order to achieve compression, theoriginal n bits are mapped to one of the k1 − k2 bin indices. The compression ratiois n : (k1 − k2).

2.10.2 Practical Code Constructions

The theoretical proofs of the achievability of the SW and WZ bounds can beapproached using random generated codes and the random binning argument [25]

In practice, all good-performing channel codes can be efficiently used to targetthe asymmetric SW scenario, i.e., one source compressed to its entropy and usedas side-information at the decoder to achieve lossless/near-lossless reconstruction.Two different approaches which exploit linear channel codes were proposed in theliterature: syndrome-based [60, 71, 92], characterized by shorter codewords and lesscomplex, and parity-based [104], which are preferred in noisy transmission scenarios.The use of these schemes leads to performances close to the theoretical bounds [43].

In the WZ case, practical code design is a more complex problem. Formally,practical WZ coding combines a quantization step with SW coding of the quantiza-tion indices. It can be seen as a joint source-channel coding problem, where a goodsource code with high source compression properties has to be matched with a goodchannel code acting as a lossless SW code.

Zamir et al. showed in [106] that nested linear codes achieve asymptotically therate-distortion limit. For the binary case, one of the very few practical implementa-tions ever realized for this problem was proposed by Liveris et. al. in [61]: a nestedconvolutional/turbo code construction which was shown to come within 0.09 bpsfrom the theoretical limit. Superposition coding, where two codes act independently,rather than in a nested framework, was proposed for WZ coding in [79]. It was shownthat it can theoretically asymptotically achieve the rate-distortion bound just likethe nested construction. However, practical implementations are difficult to realize,since it is difficult to match the two distinct codes (the source code and the channelcode). Currently, the best source codes available are trellis codes [62], while the mostflexible channel codes are turbo [15] and Low-Density Parity-Check (LDPC) [42]codes. Unfortunately, due to their inner mechanisms, combining the two in a jointscheme is practically unfeasible, and practical implementations of the superpositionapproach remains a hard problem.

More recently, Wainwright et. al proposed [95, 96] a Low-Density GeneratorMatrix (LDGM) coding approach to source compression. This can be extended to aLDGM-LDPC compound construction which was shown to asymptotically achievethe rate-distortion bound for the binary WZ problem with uniform source input [64].The LDGM codes are particularly interesting, as they have been shown to achievethe rate-distortion bound for the lossy source coding problem [94]; however, practical

41

Chapter 2

constructions using these codes are yet to be proposed.Furthermore, the most recent proposal for achieving the rate-distortion bound

in WZ coding problems is using polar codes [6]. They were shown to be optimalfor the binary WZ problem [57,58]. In spite of the promising theoretical propertiesand low complexity, the performance of polar codes for practical block lengths issignificantly worse than other channel codes for the same length [54].

2.11 Conclusions

In this chapter we have defined entropy and information, and we have shown theirbasic properties. In the interest of keeping the presentation on a simple and clearlevel, we have chosen to limit our attention to the simple cases of discrete memorylesssources and channels. This allows nevertheless to point out the significance ofShannon’s coding theorems and to introduce the rate-distortion function of a DMSwith respect to a given distortion measure.

It is interesting to observe the fact that the source coding problem and thechannel coding problem are related by the quantity that defines their respectiveperformances; while source coding tries to minimize the mutual information I(X; X)over all choices of the output given the input p(x|x), channel coding tries to maximizethe same mutual information over input channel probabilities p(x). In [25] it waspointed out that the two problems are information-theoretic duals of each other.This duality is exploited in the formulation of the numerical optimization algorithmswhich compute the channel capacity or the rate-distortion function, such as [16].

This duality can be extended to the cases of source and channel coding in thepresence of side information at the encoder and/or decoder. In [72] the authors showthe functional duality between the two problems; namely, given an optimal sourceand, respectively, channel coding scheme, the optimal encoder for one problem isfunctionally identical to the optimal decoder of its dual. In this context, all theresults we will present for the source coding with side information case can beextended to their respective counterparts, namely the channel coding with sideinformation functional dual.

42

Chapter 3

Distributed Video Coding

Overview

3.1 Distributed Video Coding

With the advent of low-power wireless communications, a substantial researcheffort has been put into practical realizations of Distributed Video Coding (DVC)(also called Wyner-Ziv video coding) in order to provide low-cost encoding, highcompression performance and resilience against transmission errors. DVC systemstarget low-power devices, being specifically designed to trade off high performancefor low complexity of the encoder. Essentially, DVC’s low encoding complexitydecreases the overall power consumption compared to predictive coding systems.However, the resulting drawbacks are an increased complexity for the decoder -still manageable for low scale real-time applications, but also a somewhat lowercompression performance, as until now DVC did not manage to achieve the samecompression efficiency as the state-of-the-art predictive codecs.

In what follows we will make a short introduction to the basic principles of videocoding, with exemplifications on the newest and most efficient predictive codecs,namely H.264/MPEG-4 Advanced Video Coding (AVC) and H.265 High EfficiencyVideo Coding (HEVC) [85, 100]. Then we describe the architecture of a typicalDVC system, with the aim of offering the prerequisite background for a fair analysisand comparison between classical predictive video coding and DVC. We will lookat DVC systems, side-information generation techniques and correlation channelestimation methods. Finally, we will present a couple of practical scenarios whereDVC is used due to the computational limitations imposed on the coding devices:capsule endoscopy and 1K pixel video surveillance.

43

Chapter 3

Figure 3.1: Block transform image coding system.

3.1.1 Brief Overview of Classical Video Compression

Uncompressed video information, especially using today’s High Definition (HD) and4K video technologies, imply the need for extremely large data rates. As such,compression is imperative and good compression methods are at the core of modernvideo technologies [85, 100]. Compression of visual data, be it images or video, isusually lossy, exploiting the human visual system characteristics [97,98] in order toachieve better quality at lower rates. It involves removing duplicate or redundantinformation from the input data by means of prediction.

Image compression is usually performed through block-based transform coding.The original image is divided into (non-overlapping) blocks, and a mathematicaltransform is applied in order to obtain a sparse set of coefficients. These coefficientsare quantized using scalar quantizers and then encoded into a binary stream usingvariable length entropy coding techniques. The basic blocks of such a compressionsystem can be seen in Fig. 3.1.

A video sequence consists of a group of images, called frames, which aredisplayed one after the other. A naive approach to coding such a sequence wouldbe to independently encode every image using a stand-alone image coding system.However, since consecutive frames are often very similar, a higher compression ratiocan be obtained by using predictive coding. A pixel will not be coded directlyanymore, but its value will be predicted from the values of adjacent pixels in thesame frame or in a previously encoded frame.

Consecutive frames will typically contain the same image blocks, but they mightbe located at different spatial coordinates due to motion. An efficient way to improvethe predictability of an image block is to be able to accurately estimate existingmotion between frames (motion estimation-ME) and then forming the predictionthat compensates for the motion (motion compensation-MC). Block-based ME and

44

3. Distributed Video Coding Overview

Figure 3.2: Typical Group of Pictures (GOP).

MC are currently performed by splitting the frame into n×n-pixel blocks – variablesize block matching: for each block the goal is to find the best matching block ina previously coded reference frame. The relative motion between the original blockand the best-matching one is called a motion vector -MV.

There are commonly three types of frames used in encoding video sequences:

• intra-coded frames (I-frames) are coded independently from any other frame;

• predictedly-coded frames (P-frames) are coded based on one previously codedframe;

• bi-directional predicted frames (B-frames) are coded using both previous andfuture coded frames.

The succession of frames between two consecutive I-frames is called a Group ofPictures (GOP). An example of such coding dependencies as well as the corre-sponding encoding order in a GOP is presented in Fig. 3.2.

A block-based video codec will therefore code a number of frames independently,in intra-mode, to create references for other frames and provide error resiliencethrough periodic refreshing. The first frame in a sequence is always an I-frame. Allnon I-frames are split into blocks of pre-defined sizes and can be intra coded (incase the prediction error is too large) or inter coded, i.e., predicted from previouslycoded reference frames, using block-based motion estimation and compensation. Theresulting motion vector specifies the displacement between the current block and itsbest matching counterpart. By using the motion vector at the encoder, a predictionerror block is then generated; the essential difference with basic image processingis that the blocks are not directly encoded, but first predicted, and then the errorbetween the original and the prediction is encoded. The prediction error block will

45

Chapter 3

Figure 3.3: Classical video encoder architecture.

be processed similarly to the flow presented in Fig. 3.1.: it is DCT transformed, theDCT coefficients are quantized and entropy coded.

Summing up, the structure of a classical video encoder is presented in Fig. 3.3:

• I-frames are encoded with switches S1, S2 and S3 in Fig. 3.3 open. The blocksof such a frame are coded independently, i.e., transformed, quantized andentropy-encoded. In order to establish the reference for the encoding of thenext frames, the quantized bits are used to reproduce at the encoder the exactsame result the decoder will have, after performing the inverse operations, i.e.inverse quantization and inverse transform. Such an encoder is called closed-loop video encoder.

• All other frames are inter-encoded, which corresponds to switches S1, S2 andS3 in Fig. 3.3 being closed. Using the reference frames, the best match isfound for each block in the current frame, and the the computed predictionerror together with the corresponding motion vectors are entropy encoded.

Many improvements have been proposed over the presented basic scheme, andthe performance of video coding systems has evolved tremendously over the pastdecades. For more details about video coding standards and their performances, thereader can consult the specifications of the standards [1, 47, 55, 88]. In what followswe will give a brief description of the main characteristics of the two standards whichare mostly relevant to video coding applications: H.264/AVC and HEVC [70].

46


Figure 3.4: Partitioning of a CTU in PUs/TUs in HEVC.

H.264/MPEG-4 AVC vs HEVC

In terms of coding performance, the current pace set by standardisation authoritiesis to halve the bit rate required for the same quality of the decoded imageswhen compared to previous reference standards, without significant computationalovercharge. This was the case with H.264/AVC [55] when it was released back in2003, and this is the case with HEVC [1], which emerged as the new state-of-the-artvideo coding standard in 2013.

Both standards were designed with a focus on flexibility and coding efficiency,and we will make a comparative overview of the key elements that enable their highcoding performance, based on the analysis presented in [70].

Image Partitioning: In H.264/AVC every frame in the video stream is par-titioned into macro-blocks of size 16 × 16 pixels, and each macro-block can be inturn subdivided into smaller blocks down to a size of 4 × 4. Having to cope with aconsiderable increase in video resolutions, HEVC had to adapt and support largerencoding block sizes than H.264/AVC. The basic encoding block in HEVC is calledcoding tree unit (CTU) and can be recursively split into smaller coding units (CUs),which in turn can be split into small prediction units (PUs) and transform units(TU). Each CTU is a square picture area and can have a maximum size of 64 × 64.For coding purposes, each CTU can be quarter-size-divided into smaller CUs, downto a size of 8 × 8. The purpose of this recursive splitting is to allow flexibility forprediction and transform operations. Each CU can be further split into smaller units,PUs, which form the basis for prediction. They can be as small as 4 × 4 pixels, andcan also be asymmetric (non-square). A transform unit(TU) is the basic unit for thetransform and quantization processes. The size and the shape of the TU depend onthe size of the PU. The size of square-shape TUs range from as small as 4 × 4 to aslarge as 32 × 32. Non-square TUs can have sizes of 32 × 8, 8 × 32, 16 × 4, or 4 × 16.

47

Chapter 3

Figure 3.4 illustrates an example of partitioning a CTU into PUs and TUs.By ensuring a higher flexibility for the prediction and transform units, HEVC

will have higher accuracy and better overall performances, at the cost of a slightlymore complicated architecture.

Intra frame Coding: Both H.264/AVC and HEVC take advantage of spa-tial correlations within a frame and use block-based intra prediction. WhereasH.264/AVC had nine possible prediction modes, HEVC has increased this numberto 35. Moreover, the prediction can be done at different block sizes (depending onthe size of the PU – 4 × 4 up to 64 × 64).

Inter prediction: Inter prediction takes into account of the similarities betweeneach picture and its temporal neighbors and exploits these similarities. H.264/AVChad brought quarter-sample accurate motion estimation and compensation by usingbilinear interpolation (six-tap filters). In case a motion vector points to an integer-sample position, the prediction signal is given by the respective sample of thereference picture; otherwise the corresponding sample is obtained using interpolationto generate non-integer positions. HEVC improved on this, as in order to obtainthe non-integer pixel samples, separable one-dimensional eight-tap and seven-tapinterpolation filters are applied horizontally and vertically.

In H.264/AVC, motion vectors are encoded by estimating a predicted motionvector and encoding the difference between the desired MV and the predicted one.The syntax allows MVs to point over picture boundaries as MVs may point outsidethe image area. If this is the case, the reference frame is extrapolated beyond theimage boundaries by repeating the edge samples before interpolation. The MVcomponents are differentially coded using either median or directional predictionfrom neighbouring blocks. In HEVC a new prediction technique is introduced,called motion merge [56]. Motion merge mode implies creating a list of previouslycoded neighboring PUs (called candidates) for the PU being currently encoded. Thecandidates are either spatially or temporally close to the current PU. The encoderwill determine and signal which candidate from this motion merge list will be used,and the motion information for the current PU is copied from the selected candidate.

Transform and Quantization: H.264/AVC applies a DCT-like integer trans-form on the prediction residual. HEVC includes transforms that can be applied toblocks of sizes ranging from 4×4 to 32×32 pixels. HEVC also includes transforms onrectangular (non-square) blocks, where the row and column transforms have differentsizes. The integer transforms used in HEVC are also better approximations of theDCT than the transforms used in H.264/AVC [41,50].

Entropy Coding: After transformation, entropy coding is applied to code allthe syntax elements and quantized transform coefficients. In H.264/AVC, context-adaptive variable-length coding (CAVLC) is the base entropy coder, and context-adaptive binary arithmetic coding (CABAC) is optionally used in the main andhigh profiles. CABAC can provide better coding efficiency than CAVLC given its

48


Figure 3.5: Wyner-Ziv coding problem.

arithmetic coding engine and more sophisticated context modeling. While CABACimproves the coding efficiency, it increases coding complexity. HEVC specifies onlyone entropy coding method, namely CABAC, rather than two as in H.264/AVC. Thealgorithm is fundamentally the same, but minor improvements have been proposed[85] which decrease complexity and improve efficiency of the coding process.

3.1.2 DVC Codec Architecture

The Wyner-Ziv theory states that when the side information is only available tothe decoder, and not available to the encoder, there is a performance loss in therate-distortion sense; therefore, it is expected to see that practical implementationsof Wyner-Ziv coding systems will have lower performance than the classical pre-dictive coding architectures. The Wyner-Ziv theorem was derived assuming perfectknowledge of the statistical correlation between the source and the side-information.However, in a practical distributed coding scheme, this dependency model is notknown at the decoder, since the source and the side-information exist at differentends of the coding system, and do not communicate with each other.

Figure 3.5 represents in a schematic manner the practical view of the Wyner-Zivcoding problem. The Wyner-Ziv coding process involves a quantization step followedby Slepian-Wolf coding of the resulting quantization indices. The correlation depen-dency between source X and side-information Y is modeled by a virtual correlationchannel, described by the conditional probability density function pX|Y (x|y). Theaccuracy of the correlation channel estimate is a key factor in the performance ofa Wyner-Ziv coding system; as such, it is desired to have adequate models for thevirtual correlation.

The conceptual workflow of a DVC system assumes the video sequence to be splitin two correlated sources of information. The first one is encoded using conventionalintra-frame coding, and subsequently used to derive side information at the decoderside. The side information is used to decode the second source, which is compressedusing Wyner-Ziv coding principles.

49

Chapter 3

Figure 3.6: Block diagram of the Stanford transform domain DVC architecture.

Since DVC does not rely on joint encoding of frames, the reference frame onlyneeds to be present at the decoder. Therefore, as opposed to the conventional videocoding paradigm, in DVC the prediction loop is only present at the decoder, andthe DVC encoder does not have access to the reference frames. Temporal correlationcan only be exploited by performing joint decoding of the current and the referenceframes, and the DVC decoder generates locally a estimated current frame based onthe already decoded ones. This estimated frame is the side-information.

The first practical implementation of such a DVC video coding solution wereproposed independently and almost at the same time by a research group at theBerkeley university [73] and a research group at the Stanford University [2].

As an exemplification of the DVC architecture, we will look more closely atthe Stanford architecture. At first it was operating in the the pixel domain [2],and later on it was extended to the transform domain [4]. The system used frame-based coding, with decoder-driven rate control, based on feedback information. Theblock-diagram of the Stanford transform-domain DVC [46] is presented in Fig. 3.6.Its operation can be summarized as follows:

Stanford DVC Encoder

• Frame classification: The video sequence is divided in two distinct types offrames: key frames, which are intra coded using a state-of-the-art conventionalintra-frame video codec, and WZ frames. The key frames occur periodically,determining the size of the GOP.

• Transform: The DCT is applied to WZ frames, and the coefficients are

50


grouped into coefficient bands. Each transform coefficient band is then encodedindependently.

• Quantization: Each DCT coefficient band is quantized using a uniform scalarquantizer with 2L levels. The quantized symbols are converted to fixed-lengthbinary codewords each of length L. Then, the quantized values are arranged ina bit-plane structure, from the most significant bit-plane to the last significantbit-plane. Thus are formed L bit-plane vectors. Each bit-plane vector is sentto a Slepian-Wolf (SW) encoder.

• Slepian-Wolf encoding: SW encoding is performed using a high-performancechannel code, e.g., Turbo codes. Coding starts with the most significant bitplane, and the parity information is stored in the bit-plane buffer, and sent tothe decoder upon request. Any request for additional parity bits is done viathe feedback channel.

Stanford DVC Decoder

• Side-Information generation: The decoder uses motion estimation andcompensation based on already decoded frames in order to create side-information reference frames for every WZ frame.

• Correlation channel estimation The DCT coefficients of the differencebetween the WZ frame and the reference SI frame are modelled to follow aLaplacian distribution, with a variance that can be estimated in an a-priorioff-line training phase.

• Slepian-Wolf decoding: Once the SI DCT coefficients and the statistics forthe residual DCT coefficient band are known, the channel decoder correctsthe errors in the SI. The decoder may request additional parity bits via thefeedback channel until some predefined stopping criterion is satisfied.

• Reconstruction: The decoded bit planes are grouped together to form thedecoded quantized WZ frame in the transform domain. In order to finish thedecoding process, an inverse DCT transform is performed and the WZ framesare properly sequenced.

In the late 2000’s there had been an increase in the research efforts towards moreefficient DVC systems. By trading complexity for higher performances, the Stanfordarchitecture has been modified or tuned in order to improve compression perfor-mance and lower coding delay. The interested reader is referred to [8,28,36,40,89,90].

51

Chapter 3

3.2 Feedback Channels in DVC

Current Wyner-Ziv video codecs employ high-performance channel codes for theSW coding step, and only syndrome/parity bits are sent to the decoder. As such,rate-control becomes a major challenge in practical DVC. Since the quality of theside-information is not known a-priori at the encoder, how much syndrome/parityinformation is required for successful channel decoding?

Most of the DVC systems, e.g. [2, 46], rely on a communication link betweenencoder and decoder, called feedback channel. In such a system rate control isdecoder-driven. The encoder sends chunks of data and should decoding prove tobe unsuccessful, the decoder can request more syndrome/parity information andretry the decoding procedure. The presence of a feedback cannel ensures successfuldecoding and also minimum rate expenditure.

However, feedback-channel-based rate control architectures for DVC are incom-patible with unidirectional application scenarios, and might also incur excessivedelays. The alternative feedback-free systems, e.g. [73] lack competitive performance,or exhibit high decoding complexity [21]. The encoder needs to coarsely approximatethe side-information, and care must be taken to ensure maintaining low-complexitycharacteristics at the encoder. In such a system rate control is encoder-driven,and the rate estimation is a complex problem: underestimation leads to poorreconstruction, while overestimation results in rate waste.

3.3 Side-information Generation Techniques

An essential factor affecting compression performance in Wyner-Ziv video coding isthe quality of the side-information. A better approximation translates into a highercorrelation between the source and the side-information, and implicitly a lower bitrate for the Slepian-Wolf encoder.

3.3.1 Motion-Compensated Interpolation

Traditional video codecs exploit the readily available reference frames buffered at theencoder and perform prediction by means of motion estimation and compensation.The same principle applies at the decoder of a DVC system, where Motion Com-pensated Interpolation (MCI) is used based on the frames buffered at the decoderin order to generate the predicted side-information.

Initial pixel-domain proposals generated side-information at the decoder byaveraging the reference frames [5], which has the advantage of being computationallyefficient. On the down side, wherever there is motion in a video sequence, averagingfalls short of producing a good side-information, and the prediction is in general oflow quality (ghosting artifacts, especially at low bit rates). In order to boost theaccuracy of the prediction and the overall rate performance of the system, recent

52


Figure 3.7: MCI SI generation framework.

state-of-the-art architectures generate the side-information using MCI. Traditionalmotion estimation and compensation methods used at the encoder in predictivevideo coding are not suitable for frame interpolation, as they aim to find the bestprediction for the current frame in a rate-distortion sense. For frame interpolation,when the entire current frame needs to be constructed, a suitable criterion is to esti-mate the true motion, and based on that to perform motion compensation betweentemporally adjacent frames. The state of the art MCI framework [9, 10, 18, 29] ispresented in Fig. 3.7. Its processing steps are as follows:

• Both reference frames are low-pass filtered (averaging) in order to increasemotion vector reliability.

• Subsequently, the motion between the key frames is estimated using a blockmatching algorithm. For each block in the next reference frame, the best matchis found in the previous reference frame, within a given search window, andthe resulting motion vector is intercepted in the current frame (see Fig. 3.8a).The motion vectors serve as candidates for every non-overlapping block in thecurrent frame, and from the available candidate vectors, the motion vectorthat intercepts the interpolated frame closer to the center of block underconsideration is selected. At the end of the Forward Motion Estimation phaseeach block as an associated motion vector, as seen in Fig. 3.8a.

• The Bidirectional Motion Estimation scheme refines the motion vectors previ-ously obtained by selecting linear trajectories between the two reference framespassing at the center of the blocks in the interpolated frame. Every selectedmotion vector will be split in two new equal and symmetric motion vectors,thus generating a bidirectional motion field, as illustrated in Fig 3.8b.

• In order to remove possible false motion vectors, a spatial smoothing algorithmis employed. If a motion vector is diverging from the estimated motion field, itwill be removed by the means of a weighted median vector filter as describedin [9].

53

Chapter 3

(a) Forward motion estimation

(b) Bidirectional motion estimation

(c) Overlapping block structure

Figure 3.8: MCI steps illustrations.

54


• Once the final motion vector field is derived, the side information frame isobtained by performing bidirectional overlapped block motion compensation(OBMC) [29]. By allowing blocks to overlap, the same pixel will belong tomultiple blocks (see Fig. 3.8c), and will have multiple possible motion vectorsassigned to it. Instead of using a single symmetric motion vector per block,the OBMC prediction uses motion information from overlapping neighboringregions. As detailed in [28], two overlapped block motion-compensated framesare derived based on the forward, and respectively the backward motion fields,and the final proposed side-information frame is the mean of the two. OBMCis implemented as a windowed averaging, where every pixel is predicted as aweighted average of the candidate predictor pixels.

In order to illustrate the result of side-information generation mechanisms, wepresent two examples in Fig. 3.9. The original sequences are 128 × 128 pixels(courtesy of Mr. J. Hanca) and the GOP of the encoding Wyner-Ziv (WZ) framestructure is 8. For each of the two sequences, we present the two intra-coded keyframes that serve as reference for the side-information generation algorithms:

• Fig. 3.9a and 3.9c for the first sequence;

• Fig. 3.9f and 3.9h for the second sequence.

Between them we can see the original frame to be predicted – 3.9b for the firstsequence and 3.9g for the second sequence. In Fig. 3.9d and 3.9i we can see thegenerated side-information frame using MCI, while Fig. 3.9e and 3.9j denoted withWA are side-information frames generated with simple frame averaging (weightedaverage between two pixel values based on the distance). Since the GOP is big andthe two key frames differ, the quality of the side-information is low in dynamic areasdue to motion in the sequences. Moreover, the naive weighing method suffers fromobvious ghosting artifacts.

3.3.2 Hash-Based Motion Estimation

A fundamental problem of the MCI-based side-information generation methodsis that motion estimation is performed without access to the original frames. Inpredictive coding the encoder has access to the original reference frames, and thisessential difference enables high performance prediction algorithms as well as ratesavings – by exploiting correlations over many frames, i.e., very large GOP sizes.DVC architectures perform motion estimation at the decoder based on reconstructedframes, and the accumulating errors lead to a severe decrease in performance withthe increase of the GOP size. Furthermore, the prediction quality of the MCI-basedmethods in scenes which are highly dynamic is reduced due to the complex motionpatterns.

55

Chapter 3

(a) Key Frame 1 (b) Original (c) Key Frame 2

(d) Side-information MCI (e) Side-information WA

(f) Key Frame 1 (g) Original (h) Key Frame 2

(i) Side-information MCI (j) Side-information WA

Figure 3.9: Side-information frame examples – courtesy of Mr. J. Hanca

56


To overcome this significant drawback, enhanced DVC schemes [3, 11, 63] wereproposed which transmit auxiliary information from the encoder, with the aim ofhelping the prediction process at the decoder. This auxiliary information is calledhash information. Usually, the hash is a low-quality sub-sampled version of theoriginal WZ frame [3, 63] or quantized low frequency DCT coefficients [37]. It issent when the MCI is foreseen to fail, and it increases the prediction quality atthe expense of extra rate. For a detailed overview of hash-based side-informationgeneration, please refer to [19,28].

3.4 Correlation Channel Modeling

In Wyner-Ziv coding, the correlation dependency between the source and the side-information determines the compression rate. In the Distributed Source Coding(DSC) theory and in the theoretical analysis presented in the previous chapter,the correlation statistics are assumed to be known a-priori, therefore allowing for aprecise computation of the rate-distortion characteristic. Unfortunately, in practicalDVC systems, the current frame is known only by the encoder, and the generatedside-information frame is known only by the decoder. Since the encoder and decodercannot communicate, the correlation between the two frames can never be perfectlycomputed in DVC systems. This is why the correlation needs to be accuratelyestimated, as the correlation model holds an essential role in the decoding process.

SID versus SII Models - the Mathematics

Early works assumed the noise introduced by the correlation channel to be additive,spatially stationary Laplacian and independent of the channel input values (i.e.,side-information independent SII) [20, 46]. More recently, an alternative model wasintroduced [28, 86] in which the noise characteristics are not uniform, but dependon the input signal (i.e., side-information dependent SID). Experimental as well astheoretical results [28,86] have validated the constant gains obtained by SID modelswhen compared to their SII counterparts.

Starting from the above-mentioned models, correlation channel estimation meth-ods can be used to determine the noise statistics at the decoder. The noisecharacteristics are used as input for Slepian-Wolf decoding of Wyner-Ziv framesand are essential for efficient bit rate expenditure.

A mathematical interpretation of the distinction between the SII and SID modelscan be given by assuming the noise to be additive, and writing X = Y + N , whereX is the source, Y is the side-information and N the noise. We follow the expositionin [28] where Y is the input of the correlation channel – hence the name of themodel, i.e., input dependent – and X is the output. The characteristics of the noise

57

Chapter 3

(a) SII - Input Independent (b) SID - Input Dependent

Figure 3.10: Correlation channel diagrams.

can be expressed by the conditional pdf of the output given the input:

fX|Y (x|y) = fX−Y |Y (x − y|y) = fN |Y (n|y)

The SII assumption, whereby the noise is independent of the side information, allowsfor the following simplification:

fX|Y (x|y) = fN |Y (n|y) = fN (n)

In the case when the noise is a stationary process, the difference between thetwo models can be depicted as in Fig. 3.10.

SID versus SII Models - Empirical Evidence

In order to illustrate the result of applying the two different modelling strategies,i.e., SII and SID, in practical DVC, we will present the example of the 71st frameof the Foreman 30Hz CIF test sequence [27]. The correlation was assumed to bezero-mean Laplacian, with constant standard deviation σ for SII model, and σ(y)for the SID model or; the SID model of the source given the side-information canbe then written as:

fX|Y (x|y) =1

σ(y)√

2exp(

√2|x − y|σ(y)

) (3.1)

As will be seen, the SID channel model finds its motivation in the empiricalconditional probability mass function of the video data. Figure 3.11a presentsthe measured correlation in the pixel domain between the original frame and itsgenerated side-information. As such, random variable Y corresponds to the samplevalues of the side-information, and the correlation is the conditional probabilitymass function as resulting from empirically measuring pX,Y (x, y) and then applying:pX|Y (x|y) = pX,Y (x, y)/

∑x pX,Y (x, y), ∀y. The SII model considers the standard

deviation of the noise to be independent of y for the analyzed frame, and the resulting

58


(a) Experimentally measured correlation

(b) SII spatially stationary model

(c) SID spatially stationary model

Figure 3.11: The correlation channel fX|Y (x|y) for frame 71 of Foreman 30Hz CIF.[27]

59

Chapter 3

correlation is presented in Fig. 3.11b. In contrast, the SID model assumes the noiseto be dependent of the side-information y, and the resulting correlation model canbe seen in Fig. 3.11c.

It is straightforward to see that the SID model is a better approximation of theempirical correlation than the SII model. As such, using the former is expected toyield systematic gains in the Wyner-Ziv compression system. This rate gain hasbeen mathematically proven and experimentally corroborated in [27,28]

SID versus SII Models - the Binary Case

We wish to give a graphical representation of the difference between the SII andthe SID models, assuming stationary noise. When the side-information Y has analphabet size of K and the input source X is continuous, the projections of the SIIand SID correlation channel probability density functions onto the (X, Y ) plane canbe visualized in Fig. 3.12.

In 3.12a the noise has a constant variance corresponding to the SII model; thedependency in (3.1) which describes input-dependent variance of the noise can beseen in 3.12b, and this corresponds the SID model. As shown in [27], following theterminology in [25], the SII model is a K-ary input, continuous output symmetricLaplacian channel, while the SID model is equivalent to a K-ary input, continuousoutput asymmetric Laplacian channel.

We make the observation that in case both X and Y are binary, the SII and SIDmodels correspond to the Binary Symmetric Channel (BSC) and Binary AsymmetricChannel (BAC) models, respectively. The BSC and the BAC have the followingassociated transition matrices, respectively:

p(y|x) =

[(1 − p) p

p (1 − p)

]

and p(y|x) =

[(1 − a) a

b (1 − b)

]

.

Here, p, a, b ∈ [0, 1] are probabilities and a 6= b. In the asymmetric case, theprobability of observing 1 as output when 0 was the input is not equal to theprobability of observing 0 as output when 1 was the input.

This symmetric/asymmetric dichotomy has not been explored in the literature,and a theoretical justification for the fact that asymmetric models outperformsymmetric ones was the initial goal of our research. The basic instance of thisproblem is the binary case, i.e., binary source of information with binary correlationchannel, and then the conclusions can be generalized.

60


(a) σSII = σ, ∀yk (b) σSID = σ(yk), ∀yk

Figure 3.12: Graphical representation, that is, 2D projection of the 3D histogram,of the (a) SII and (b) SID correlation channel model [28].

3.5 DVC systems for Capsule Endoscopy and 1K

pixel video applications

The problem of source coding in the presence of side-information available only atthe decoder has found application in many practical scenarios where the encodingdevices are limited in terms of computational power and energy consumption. Themost promising application scenarios [66] include wireless low-power surveillance,visual sensor networks or mobile video cameras. The DVC paradigm can be suc-cessfully applied in such scenarios, as it allows a complexity shift from the encoder(classical video coding case) to the decoder, which can be a resourceful base-station.DVC schemas have additional built-in benefits such as error resilience, given by theuse of powerful channel codes, as well as scalability, when adopting layered WZconstructions [23].

We consider two scenarios in which DVC is required due to the limited compu-tational capabilities of the devices acquiring and coding video data. In what follows,we describe their respective coding architectures.

3.5.1 Codec Description - Capsule Endoscopy

Recent advances in the miniaturization of sensing devices made wireless capsuleendoscopy a viable alternative to the classical and more invasive means of gastroin-testinal diagnose [82]. With the size of a large pill, a wireless capsule endoscopecomprises a light source, an integrated chip video camera, a radio telemetrytransmitter and a limited lifespan battery. Current capsule endoscopic video systemsoperate at modest frame resolutions, e.g., 256×256 pixels, and low frame-rates, e.g.,

61

Chapter 3

2−5 Hz, on a battery life time of approximately 8 hours. Given the small-scale natureof the recording device, the computational complexity of the encoding algorithmshas to be severely constrained. At the same time, since the recorded video is usedfor medical diagnosis, the quality of the decoded video is of extreme importance.

In brief, at the encoder, the video frame sequence is separated into key frames,which are entropy coded, and WZ frames, which are encoded using Wyner-Zivprinciples. The key frames are encoded using the H.264/AVC [100] intra-frame codec.The WZ frames are encoded in two stages: for every Wyner-Ziv frame, the encoderfirst generates and codes a hash - auxiliary downscaled information which assiststhe decoding process; then, the WZ frame undergoes a discrete cosine transform(DCT), quantization using a set of predefined quantization matrices [8] and is codedusing LDPC Accumulate (LDPCA) [92], which is a capacity achieving channel code.At the decoder, the hash bit-stream is reconstructed and used in generating theside information. The side-information generation algorithm performs overlappedblock motion estimation (OBME) [31] using the available hash information and thehierarchical structure of reconstructed Wyner-Ziv and/or key frame as references.Finally, the derived side information is used as input for the iterative LDPCAdecoder. The codec is presented in more detail in [35].

3.5.2 Codec Description - 1-K pixel camera

Inexpensive low resolution sensors can be efficiently used in applications such assecurity and surveillance, where accurate occupancy maps can be generated using anetwork of sensors acquiring video data of 64 × 48 pixels [48], or even 30× 30 pixels[39]. Even though the resolution is very low, the limited computational power andenergy supply poses severe constraints on the complexity of the encoder. As such,efficient coding and transmission methods are of high importance in this scenario aswell.

The characteristics of the encoding system have to be tailored to the videosensors: 30 × 30 video resolution, low complexity and low memory, while offeringreal-time execution at a frame rate of 25 frames per second. The encoding processstarts with the separation of the input video sequence into key frames and WZframes. The key frames are encoded using a simplified version of H.264/AVC intra,with only one block size, i.e., 4 × 4, and no mode decision. The intra codec wasdescribed in [17].

The WZ codec also uses only 4 × 4 blocks. Given the low resolution, a skipmode is employed if the difference between current block and the block located atthe same position in the previous frame is below a given threshold. Non-skip blocksare transformed using an integer DCT, then quantized using predefined quantizationmatrices [52] and encoded using a 132 bit LDPCA; we note that even if the codewordlength is small, the encoding delay is minimal and the compression system remains

62


efficient. At the decoder, the key frames are intra decoded and serve as referencefor motion estimation and side-information generation. The preferred method forside-information generation is OBME [31]. The decoder can use a feedback channelfor rate control, by asking bits until successful decoding. If feedback is absent, astoping criterion is used to determine whether the decoding attempt was successfulor not. The codec is presented in more detail in [52].

3.5.3 Predictive vs DVC coding

In [59], the authors observe that, although in theory WZ coding can gain over 6 dBover conventional coding without motion search, in practice it falls 6 dB behind thebest motion-compensated prediction inter-frame codecs. They propose a comparisonbetween distributed and predictive coding by employing a model which separatesthe losses into three categories: system loss (due to the lack of side information atthe encoder), source coding loss (due to inefficient channel codes and quantizationschemes) and video coding loss (due to loss of reference information). Their analysistargets the last category, relying on the impact of the subpixel and multi-referencemotion search methods in the generation of the side information.

It is our intention to give a realistic comparison between predictive and dis-tributed video coding in practical scenarios, i.e., video sensors, where the use ofpowerful codecs is not possible due to resource scarcity and power consumptionlimitations. Following our recent analysis of binary WZ coding in Chapter 6, wepresent a evaluation of the encoding performances of our state-of-the-art DVCencoding systems [35, 51, 52] when compared predictive coding for two practicalapplications involving video sensors: 256 × 256 pixels capsule endoscopy data and30 × 30 low resolution video. The test sequences are application specific, and areacquired by the respective sensors, as illustrated in Fig. 3.13. The endoscopic videomaterial was obtained from clinical examinations, the acquisition rate being 2 fpsand the frame resolution 256 × 256 pixels [35]. For the 1K-pixel sensors, the testsequences were 30 second videos recorded at a frame rate of 25 Hz [52]. Only group-of-pictures (GOPs) of size 2 and 4 were considered. The quantization parameterpairs for the key frames and WZ frames were chosen such that the Peak Signal-Noise Ratio (PSNR) differences between the intra and WZ frames is less than 1 dB,therefore retaining almost constant quality of the decoded frames.

The proposed DVC architectures can be deployed on the actual sensors, whilethe conventional H.264/AVC inter-frame codec is too computationally demanding.As such, all experiments were performed offline. For the capsule endoscopy case, wecompare the best possible settings of our DVC codec, i.e. exploiting a feedbackchannel and using the hash-based OBME to generate side information, againstH.264/AVC intra (only intra-coded frames – GOP: III...) and H.264/AVC no motion(one intra-coded frame followed by predictive coded frames – GOP: IPP...P). The

63

Chapter 3

(a) A frame from a capsule endoscopy sequence.

(b) Frames from 30 × 30 sequences

Figure 3.13: Typical frames from the acquired sequences.

64


Rate [kbps]0 50 100 150 200 250

PS

NR

[dB

]

34

36

38

40

42

44

H.264 No MotionDVCH.264 Intra

(a) GOP 2

Rate [kbps]0 50 100 150 200 250

PS

NR

[dB

]

34

36

38

40

42

44

H.264 No MotionDVCH.264 Intra

(b) GOP 4

Figure 3.14: RD performances for capsule endoscopy sequences.

65

Chapter 3

Rate [kbps]10 20 30 40 50 60 70

PS

NR

[dB

]

30

32

34

36

38

40

42

44

H.264 IntraH.264DVC fbckDVC no fbck

(a) GOP 2

Rate [kbps]0 10 20 30 40 50 60 70

PS

NR

[dB

]

30

32

34

36

38

40

42

44

H.264 IntraH.264DVC fbackDVC no fback

(b) GOP 4

Figure 3.15: RD performances on 1K-pixel sequences.

66


results can be seen in Fig. 3.14. In the 1K-pixel case we compare our DVC codec,both with and without feedback, against H.264/AVC Main Profile and H.264/AVCintra. Here, we have chosen the full H.264/AVC setup instead of the no-motionsetup, since the DVC codec performs particularly well on very low resolutions. Theresults are given in Fig. 3.15.

For the capsule endoscopy sequences, the trade-off between encoding complexityand compression performance is clearly visible. In the case of a GOP size of 2, DVCkeeps up to the level of the no-motion H.264/AVC, especially in the medium qualityrange. On the other hand, for a GOP of 4, DVC is constantly performing worse thanH.264/AVC intra. This is to be expected, as the temporal correlations are limitedand the quality of the side information deteriorates.

For the 1K-pixel sequences, with lower resolution and higher frame rate, the skipmode can be efficiently used by the encoder, and the performances of the codec areclose to the powerful H.264/AVC codec. It should be noted that, since the distributedcodec runs on the sensor, it employs a less efficient intra codec for the key framesand uses a very short codeword length for the channel code. Despite of this, for bothGOP sizes of 2 and 4, the DVC performance is highly competitive, and the feedbackversion even outperforms H.264/AVC for medium quality reconstructions.

3.6 Conclusions

Although, in essence, this thesis deals with an information-theoretic problem withapplications in source coding, our work is motivated by a practical problem.Specifically, the practical motivation of our work lies in the finding that assumingasymmetric channel models in distributed video coding leads to compression per-formance improvements compared to symmetric channel models. The specific casewhen both source and side-information have binary alphabets leads to the binaryWyner-Ziv coding problem with asymmetric correlation. In the next chapters wewill derive rate-distortion bounds for this problem that will enable us to confirm(i) the finding that the highest rate required to encode a source corresponds to thedoubly symmetric case presented in [103], and (ii) that an asymmetric correlationinducing the same average distortion leads to a lower encoding rate.

In this chapter we have explored the fundaments of classical video coding, aswell as of DVC. This allowed us to underline the fundamental difference betweenthe two paradigms, namely the presence/absence of the side-information at theencoder. Classical predictive coding systems are known to outperform DVC systemsfor a majority of the relevant use-cases. An evaluation of the possible maximumloss in rate incurred by the absence of side-information at the encoder stands as arelevant problem.

Therefore, this chapter also presents the assessment of the difference between

67

Chapter 3

Wyner-Ziv and predictive coding in two practical coding scenarios where thedistributed coding paradigm is particularly relevant due to complexity restrictionsat the encoder: wireless capsule endoscopy and 1-K pixel video surveillance. It hasbeen shown that, baring restrictions on the encoder computational power, Wyner-Ziv coding can come reasonably close in terms of rate performances to the wellestablished classical predictive video codecs.

Regarding the following chapters, which deal exclusively with the binary Wyner-Ziv problem, we would like to make the following important mention: a practicalWyner-Ziv video coding system is not a simple binary system like the one wewill consider, and the bit rate allocation process cannot be driven by finding thebinary Wyner-Ziv coding rate-distortion function. The binary case can be used fora qualitative assessment of the Wyner-Ziv coding systems, in general. The analysisof the binary case will provide a justification for the systematic gains of asymmetricmodels over symmetric models and will give a measure of the rate loss in the Wyner-Ziv coding case, when compared to predictive coding.

Our goal is not to improve the coding performances of practical DVC systems,but to explain observations that were made when using practical DVC systems.

68

Chapter 4

Binary Rate-distortion with

Encoder-Decoder Side Information

4.1 Introduction

In this chapter we provide the derivation of the rate-distortion function for binarysource coding with encoder-decoder side information, which corresponds to classicalpredictive coding. We have opted for a gradual exposition of the problem, namely,first we will present the case of the uniform source, which will serve as a lighterpreamble of the general case; the latter is more complex, but the layout of the proofswill be based and supported through analogies to the uniform source particularcase. The derivations follow closely the proofs in the published articles, namely, theuniform source case has been presented in [75, 77, 78], while the generalization ispart of [76].

The chapter is structured as follows. Section 4.2 introduces the setup of thesystem we refer to, namely the binary source coding with side information givenby a Binary Asymmetric Channel (BAC) and made available to both the encoderand the decoder. Section 4.3 presents the derivation of the rate-distortion functionfor the uniform source case as a particular case of Section 4.4. which derives therate-distortion bound for a generic binary source. Section 4.5 draws the conclusionsof this chapter.

4.2 Problem definition - System overview

We consider the following setup: let (X, Y ) ∈ X × Y be correlated binary randomvariables, such that the source X ∼ Bernoulli(π) is to be encoded using the sourceY as side information. The correlation between the two sources is described by a

69

Chapter 4

Figure 4.1: Setup for source coding with side information available both at encoderand decoder.

binary asymmetric channel:

p(y|x) =

[(1 − a) a

b (1 − b)

]

, (4.1)

(a, b) ∈ [0, 1]2. The reconstructed source is the binary variable X ∈ X , and thedistortion metric considered is the Hamming distance: if (x, x) ∈ X × X thend(x, x) = 0 if x = x, and d(x, x) = 1 if x 6= x. Our goal is to describe the ratedistortion characteristics of this system when the side information Y is available atboth the encoder and decoder, corresponding to the conventional predictive case.The system we refer to is presented in Fig. 4.1.

When the side information is available both at the encoder and the decoder, therate-distortion function is given by [12]:

RX|Y (D) = infp(x|x,y): E[d(x,x)]≤D

I(X; X|Y ), (4.2)

where E[∙] denotes the expectation operator, d(∙, ∙) is the distortion metric and D

is the expected distortion.

4.3 The Rate-Distortion Function - the Uniform

Source Case

Let us consider the source to be binary uniform, i.e., X ∼ Bernoulli(0.5). Thequantity to be minimized in (4.2) can be written as

I(X; X|Y ) = H(X|Y ) − H(X|X, Y )

= H(X|Y ) − H(X ⊕ X|X, Y )

≥ H(X|Y ) − H(D|Y ), (4.3)

70

4. Binary Rate-distortion with Encoder-Decoder Side Information

where H(∙) denotes the binary entropy function and D = X ⊕ X. With the X − Y

correlation channel given by (4.1), the maximum distortion to be considered is theaverage crossover of the side information channel, i.e., if X = Y : Dmax = a+b

2 . Weconsider the inverse channel Y − X characterized by the transition matrix:

p(x|y) =

[1 − a∗ a∗

b∗ 1 − b∗

]

=

(1−a)1−a+b

b1−a+b

aa+1−b

(1−b)a+1−b

. (4.4)

We have p(Y = 0) = 1−a+b2 and p(Y = 1) = a+1−b

2 , and the maximum distortioncan be written as

Dmax = p(Y = 0) ∙ a∗ + p(Y = 1) ∙ b∗ =a + b

2.

Without loss of generality, let a ≤ b in (4.1). This implies a∗ ≥ b∗ in (4.4), as willbe shown next. Since we are in the uniform source case, π = 1/2 and Dmax = a+b

2 ,with a+b

2 ≤ 1/2, or (a + b) ≤ 1. Then a∗ ≥ b∗ can be proven to be true whenever(a + b) ≤ 1, by writing the following:

b

1 − a + b≥

a

a + 1 − b⇔ ab + b − b2 ≥ a − a2 + ab

⇔ (b − a) ≥ (b2 − a2)

⇔ 1 ≥ a + b

We will now express the expected distortion in function of the realisations of theside-information Y . Given that the side information is known at the encoder, wecan always describe the overall distortion as being

d = E[D] = p(Y = 0) ∙ dY =0 + p(Y = 1) ∙ dY =1, (4.5)

where dY =0 = E[d(X, X)|Y = 0] ≤ a∗ is the distortion corresponding to allrealizations of Y = 0 and dY =1 = E[d(X, X)|Y = 1] ≤ b∗ is the distortioncorresponding to all realizations of Y = 1.

Maximizing the term H(D|Y ) in (4.3) with the above mentioned constraintsgives the minimum for I(X; X|Y ). This is an optimization problem that can be

71

Chapter 4

expressed using the Karush-Kuhn-Tucker conditions:

maximize p(Y = 0) ∙ H(dY =0) + p(Y = 1) ∙ H(dY =1)

subject to

p(Y = 0) ∙ dY =0 + p(Y = 1) ∙ dY =1 = d

dY =0 ≤ a∗

dY =1 ≤ b∗

(4.6)

We formulate the Lagrangian optimization problem with the above inequalityconstraints, and the Lagrangian function is:

J = p(Y = 0) ∙ H(dY =0) + p(Y = 1) ∙ H(dY =1)+

λ(p(Y = 0) ∙ dY =0 + p(Y = 1) ∙ dY =1 − d)+

λ0(dY =0 − a∗) + λ1(dY =1 − b∗)

(4.7)

Taking the derivative with respect to the unknown variables (dY =0, dY =1) gives:

{dY =0 = dY =1 = d, if d ≤ b∗

dY =0 = d− a2

p(Y =0) , dY =1 = b∗, if d > b∗(4.8)

This yields the following form for the predictive rate distortion function:

RX|Y (d) =

p(Y = 0) ∙ [H(a∗) − H(d)] + p(Y = 1) ∙ [H(b∗) − H(d)], if d ≤ b∗

p(Y = 0) ∙ [H(a∗) − H( d− a2

p(Y =0) )], if b∗ ≤ d ≤ Dmax

0, if d ≥ Dmax

(4.9)This can be achieved by considering an auxiliary variable U given by a binarysymmetric channel with input X and output U , with crossover probability p0 asfollows:

p0 =

{d, if d ≤ b∗

d− a2

p(Y =0) , if d > b∗(4.10)

The corresponding reconstruction function is:

if d ≤ b∗ then X = U

if d > b∗ then

{X = Y if Y = 1

X = U if Y = 0

(4.11)

72


4.4 The Rate-Distortion Function - General Case

We remove the constraint on the uniformity of the source. Let X ∼ Bernoulli(π),with π ∈ [0, 0.5]. In order to establish the rate-distortion function when thecorrelation between the source and the side information is described by a BAC[see (4.1)], we introduce the maximum distortion acceptable for the reconstructionof source X – see also Section 2.7.1.

Definition. The maximum distortion (maximum expected Hamming distance) onsource X can be expressed as

Dmax = min{π, (1 − π)a + πb}, (4.12)

Davg , (1 − π)a + πb

where the second term Davg represents the average crossover of the correlationchannel X − Y given by (4.1).

Note that in the uniform source case, the maximum distortion is always givenby the average crossover probability of the correlation channel.

Alternatively, by considering as previously the inverse channel Y − X:

p(x|y) =

[(1 − a∗) a∗

b∗ (1 − b∗)

]

, (4.13)

where a∗ = πb(1−π)(1−a)+πb and b∗ = (1−π)a

(1−π)a+π(1−b) , the average crossover of thecorrelation channel can be expressed in function of p(y) as follows:

Davg = p(Y = 0) ∙ a∗ + p(Y = 1) ∙ b∗. (4.14)

Then, our result is formulated in the following theorem.

Theorem 15. The rate-distortion function of binary source coding in the presenceof encoder-decoder side information with the Hamming-distance as distortion metricand the correlation expressed by the binary asymmetric channel is given by (4.15).

Proof. The proof follows the same steps as in the uniform source case. The quantity

RX|Y (D) =

p(Y = 0) ∙ [H(a∗) − H(D)] + p(Y = 1) ∙ [H(b∗) − H(D)], if D ≤ min(a∗, b∗)

p(Y = 0) ∙[H(a∗) − H

(D−(1−π)a

p(Y =0)

)], if b∗ ≤ a∗ ≤ D ≤ Dmax

p(Y = 1) ∙[H(b∗) − H

(D−πb

p(Y =1)

)], if a∗ ≤ b∗ ≤ D ≤ Dmax

0, if D ≥ Dmax

(4.15)

73

Chapter 4

to be minimized in (4.2) can be written as

I(X; X|Y ) = H(X|Y ) − H(X|X, Y )

= H(X|Y ) − H(X ⊕ X|X, Y )

≥ H(X|Y ) − H(X ⊕ X|Y ), (4.16)

where H(∙) denotes the binary entropy function, i.e., H(p) = −p log p − (1 −p) log (1 − p).

The knowledge of the side information at the encoder allows the definition ofthe distortions for specific realizations of Y , namely, dY =i is the distortion whenY = i, i ∈ {0, 1}. The expected distortion can be expressed as follows:

D = p(Y = 0) ∙ dY =0 + p(Y = 1) ∙ dY =1 (4.17)

with dY =0 ≤ a∗ and dY =1 ≤ b∗. We will use this, because the right term in inequality(4.16) can be written as:

H(X|Y ) − H(D|Y ) =∑

y

p(y)∑

x

[− p(x|y) ∙ log p(x|y) + p(D|y) ∙ log p(D|y)

]

=∑

y

p(y) ∙ (H(X|y) − H(D|y)) (4.18)

The first entropy term in (4.18) is a constant since the conditional distribution p(x|y)is fixed. The goal is to maximize the second entropy term, namely, H(D|Y ). Thiscan be formulated as a constrained maximization problem, as follows:

maxdY =0,dY =1

p(Y = 0) ∙ H(dY =0) + p(Y = 1) ∙ H(dY =1)

subject to

{p(Y = 0) ∙ dY =0 + p(Y = 1) ∙ dY =1 ≤ D

dY =0 ≤ a∗ and dY =1 ≤ b∗(4.19)

In order to solve the above problem, we need to establish min{a∗, b∗}, that is, theminimum between the crossovers of the channel Y −X given in (4.13). We find that:

{b∗ ≤ a∗ if (1 − π)2a(1 − a) ≤ π2b(1 − b)

b∗ > a∗ if (1 − π)2a(1 − a) > π2b(1 − b).(4.20)

74


Hence, by solving for π in the two cases of (4.20) we obtain:

min(a∗, b∗) =

a∗, if π < 1

1+√

b(1−b)a(1−a)

b∗, if π ≥ 1

1+√

b(1−b)a(1−a)

.

Furthermore, the maximum distortion defined in (4.12) is written as

Dmax =

{π, if π < a

a+1−b

(1 − π)a + πb, if π ≥ aa+1−b .

The problem in (4.19) can be solved by Lagrange optimization using the KKTconditions. To the problem in (4.19) we associate the following Lagrangian function:

J = p(Y = 0) ∙ H(dY =0) + p(Y = 1) ∙ H(dY =1)

+ λ1(p(Y = 0) ∙ dY =0 + p(Y = 1) ∙ dY =1 − D) + λ2(dY =0 − a∗) + λ3(dY =1 − b∗)

(4.21)

which is to be minimized with respect to (dY =0, dY =1), and parameters λ1, λ2, andλ3 are non-negative. To solve this dual problem, i.e., the minimization of (4.21), wedistinguish the following cases:

• setting λ2 = λ3 = 0, then∂J

∂dY =0= p(Y = 0) ∙ log( 1−dY =0

dY =0) + λ1 ∙ p(Y = 0) = 0

∂J∂dY =1

= p(Y = 1) ∙ log( 1−dY =1dY =1

) + λ1 ∙ p(Y = 1) = 0,

which lead to the solution dY =0 = dY =1 = D, such that dY =0 ≤ a∗ anddY =1 ≤ b∗;

• setting λ2 = 0 and dY =1 = b∗,

the solution becomes dY =0 = D−(1−π)ap(Y =0) and of course, dY =1 = b∗;

• setting λ3 = 0 and dY =0 = a∗,

leads to the solution dY =1 = D−πbp(y=1) , for a∗ ≤ dY =1 ≤ b∗ and of course,

dY =0 = a∗;

• finally, setting dY =0 = a∗ and dY =1 = b∗, leads to the solution D = (1−π)a+πb.

By replacing the resulting values of (dY =0, dY =1) in (4.18), it can be readily shownthat the four cases above correspond to the cases in the rate-distortion function asgiven in (4.15).

75

Chapter 4

We make the following remarks concerning the optimal decoding strategy thatleads to the derived rate-distortion function. Referring to Fig. 4.1, the decoder mustguess the source X by observing the side information Y and the auxiliary variableU . Without loss of generality, let b∗ ≤ a∗. In case the expected distortion level isD ≤ b∗, the reconstruction function ignores the side information. When the expecteddistortion becomes D > b∗, we observe two cases. Firstly, when Y = 1, the sourcereconstruction should be X = Y = 1. In this case, no rate needs to be spent,corresponding to a constant distortion that is equal to p(Y = 1) ∙ b∗ = (1 − π) ∙ a.Secondly, when Y = 0, the decoder ignores the side information and sets X = U .

4.5 Conclusions

In this chapter we presented the derivation of the rate-distortion bound for lossysource coding with encoder-decoder side information in the case when:

• the source is binary and may be non-uniform

• the correlation between the source and the side-information is given by a BAC.

An encouraging fact to notice is that this problem admits an analytic solution.The importance of this result is that the rate-distortion bound for the encoder-decoder side information scenario represents an absolute lower bound for the Wyner-Ziv (WZ) case, i.e., when the side-information is only available at the decoder. Wewill use this bound as a reference for the rate-loss evaluation of Wyner-Ziv relativeto predictive coding proposed in Chapter 6.

76

Chapter 5

Binary Rate-distortion with

Decoder Side Information: WZ

Coding

5.1 Introduction

In this chapter we provide the derivation of a bound for the rate-distortion functionfor binary source coding with decoder side information. The system we refer to ispresented in Fig. 5.1. Our initial goal was to derive an analytical rate-distortionbound for the binary Wyner-Ziv (WZ) problem under the most generic inputconditions: arbitrary binary source and arbitrary binary correlation channel. Assuch, the source was assumed to be X ∼ Bernoulli(π), the correlation to be binaryasymmetric and the distortion measure to be the Hamming distance. In order toapproach the solution from a mathematically rigorous analytical path, we assumethe auxiliary variable U to be binary, even though from the Wyner-Ziv paper [103]we know that the maximum cardinality which can achieve the rate-distortion boundis |U| ≤ |X |+1, i.e., |U | = 3. Nevertheless, the mathematical derivations in the caseof the ternary auxiliary variable quickly become tedious, so |U| = 2 was a simplifying

Figure 5.1: Setup for source coding with side information available at the decoder(Wyner-Ziv).

77

Chapter 5

choice.Somewhat to our disappointment, even the simplified version of the binary WZ

problem, assuming a binary auxiliary variable, turned out not to have a closed-form solution. The only way in which we could make use of our mathematicalanalysis was to propose a numerical algorithm which would take into account theproposed formulations, and numerically produce the solution for the rate-distortionoptimization problem.

At all times, we used a more generic numerical optimization algorithm, namelythe Blahut-Arimoto [16], as implemented in [22] to verify the plausibility of ourresults. Contrary to the Blahut-Arimoto algorithm, which performs successiveminimizations of specific cost functions in order to obtain the overall optimalprobabilities, our method relies on analytical derivations: we analytically derive theexpressions for rate and distortion, and use them to restrict the size of our searchspace.

In order to mitigate the numerical nature of the derived solution, we have alsoproposed an alternative analytical approximation which comes within 10−3 bps fromthe rate-distortion function obtained by the Blahut-Arimoto algorithm in [22].

The results of the numerical simulations showed a mostly interesting fact, namelythat our proposed numerically-derived bound was identical to the rate-distortionfunction outputted by the Blahut-Arimoto algorithm, within the numerical precisionof our algorithms. In spite of not being able to mathematically prove this fact, wehave conjectured that a binary auxiliary variable is sufficient to achieve all pointsof the rate-distortion function for the binary WZ source coding problem.

As in the predictive coding case in Chapter 4, we will present the derivationsof the rate-distortion bound for the simplifying scenario of the uniform source, andgeneralize the proof for any source distribution later on. The derivations followclosely the proofs in the published articles, namely, the uniform source case hasbeen presented as part of [75], while the generalization is part of [76].

As for the structure of the chapter, Section 5.2 introduces the system we arereferring to, namely the binary source coding with side information at the decoder,when the side information is given by a Binary Asymmetric Channel (BAC). Section5.3 presents the analytical derivations and the numerical solution for the particularcase of the uniform source, while Section 5.4 generalizes the above for the generalcase of the non-uniform source. Section 5.5 illustrates our algorithm with a couple ofrelevant examples. Section 5.6 makes some considerations regarding the tightness ofour bound and the optimality of choosing a binary auxiliary random variable. Section5.7 proposes an alternative analytical bound which has negligible approximationerror when compared to the bound given by the Blahut-Arimoto algorithm. Finally,Section 5.8 draws the conclusions of this chapter.

78

5. Binary Rate-distortion with Decoder Side Information: WZ Coding

5.2 Problem definition - System overview

As in the case of the predictive coding setup, we consider (X, Y ) ∈ X × Y to becorrelated binary random variables, such that the source X v Bernoulli(π) is to beencoded using the source Y as side information. Y is available only at the decoder,and not to the encoder. The correlation between the two sources is described by abinary asymmetric channel:

p(y|x) =

[(1 − a) a

b (1 − b)

]

, (5.1)

(a, b) ∈ [0, 1]2. The reconstructed source is the binary variable X ∈ X , and thedistortion metric considered is the Hamming distance: if (x, x) ∈ X × X thend(x, x) = 0 if x = x, and d(x, x) = 1 if x 6= x. The system we refer to is presentedin Fig. 5.1.

When the side information is available only at the decoder, the rate-distortionfunction is given by

RWZ(d) = infp(u|x)p(x|u,y): E[d(X,X)]≤d

I(X; U |Y ), (5.2)

where X = f(U, Y ) and U is an auxiliary random variable, satisfying the Markovchains: U − X − Y and X − (U, Y ) − X, such that E[d(X, X)] ≤ d. Let U ∈ U bebinary, i.e., the outcome of a binary channel with input X and the transition matrixgiven by:

p(u|x) =

[(1 − p) p

q (1 − q)

]

, (5.3)

with (p, q) ∈ [0, 1]2.

5.3 The Rate-Distortion Function - the Uniform

Source Case

Let us consider the case of the uniform source, X ∼ Bernoulli(0.5). We aimto express the rate and the distortion as functions of the unknown transitionprobabilities (p, q) in (5.3), and formulate a minimization problem in order topropose a rate-distortion bound and an achievability strategy.

79

Chapter 5

5.3.1 Expression of the Rate

The quantity to be minimized in (5.2) is written as

I(X; U |Y ) = H(U |Y ) − H(U |X). (5.4)

Given the Markovianity U−X−Y the channel between Y and U can be expressed asthe concatenation of the channels Y −X and X−U . The first channel is characterizedby the transition matrix given in (4.4), namely:

p(x|y) =

(1−a)1−a+b

b1−a+b

aa+1−b

(1−b)a+1−b

. (5.5)

Knowing (5.5) and (5.3), their multiplication gives the transition matrix of theequivalent Y − U channel, namely:

p(u|y) =

(1−a)(1−p)+bq1−a+b

(1−a)p+b(1−q)1−a+b

a(1−p)+(1−b)qa+1−b

ap+(1−b)(1−q)a+1−b

. (5.6)

Therefore, knowing (5.3) and (5.6), the expression in (5.4) lets us define R∗WZ(p, q)

to be:

I(X; U |Y ) , R∗WZ(p, q)

= H(U |Y ) − H(U |X)

=(1 − a + b)

2∙ H

((1 − a)p + b(1 − q)

1 − a + b

)

+(1 + a − b)

2∙ H

(a(1 − p) + (1 − b)q

1 + a − b

)

−12∙ [H(p) + H(q)]. (5.7)

This is the expression of the rate as a function of probabilities (p, q) in the uniformsource case.

5.3.2 Expression of the Distortion

The distortion expression can be determined by observing that for a fixed pair (u, y),x is given by the conditional distribution p(x|u, y). Since the decoder can only makea deterministic decision given u and y, the best choice is to output the maximumlikelihood estimate of x :

x = f(u, y) = arg maxx

p(x|u, y). (5.8)

80


We can write the error rate conditioned on (u, y) as [1 − p(f(u, y)|u, y)], so theaverage distortion is given by

d =∑

u,y

(1 − p(f(u, y)|u, y))p(u, y). (5.9)

After some basic simplifications, (5.9) can be reduced to:

d(p, q) =∑

u,y

min( p(X = 0, y, u), p(X = 1, y, u) ). (5.10)

Knowing that, due to the Markov property U − X − Y ,

p(x, y, u) = p(x, y)p(u|x) = p(x)p(y|x)p(u|x) (5.11)

we can write p(x, y, u) as follows:

p(X = 0, y, u) =12∙

(1 − a)(1 − p) (1 − a)p

a(1 − p) ap

, (5.12)

and

p(X = 1, y, u) =12∙

bq b(1 − q)

(1 − b)q (1 − b)(1 − q)

. (5.13)

The distortion function will be the summation of 4 terms obtained by comparingelement by element the two probability matrices (5.12) and (5.13).

5.3.3 Symmetry considerations

At this point, it is useful to make the following observations regarding the symme-tries of the R∗

WZ(p, q) function in (5.7) and the distortion function d(p, q) given by(5.10) with respect to the crossover probabilities of the correlation channel p(y|x)in (5.1) and the crossover probabilities of p(u|x) given in (5.3):

• If in (5.1) we substitute the pair (a, b) by (1−a, 1−b), the functions R∗WZ(p, q)

and d(p, q) do not change.

• If in (5.3) we substitute the pair (p, q) by (1−p, 1−q), the functions R∗WZ(p, q)

and d(p, q) do not change.

Essentially, the substitutions are equivalent to a label swap at the output of therespective channels and do not affect the rate-distortion function. The consequenceis that the domains of interest for the pairs (a, b) in (5.1) and (p, q) in (5.3) can bereduced from [0, 1]2 to 0 ≤ a + b ≤ 1 and 0 ≤ p + q ≤ 1.

81

Chapter 5

Using symmetry constraints we can give en explicit form to the distortionfunction by comparing the terms in (5.12) and (5.13); as such, (5.10) becomes:

D(p, q) =(bq + min((1 − a)p, b(1 − q))+

min(a(1 − p), (1 − b)q) + ap)/2. (5.14)

There can be four possible expressions for the above equation, each correspondingto a different reconstruction strategy at the decoder.

Without loss of generality, let a < b, fixed by the initial setup. The crossoverprobabilities p and q can vary. The possible expressions for (5.14) are the following:

•

{(1 − a)p < b(1 − q)

a(1 − p) > (1 − b)q, is equivalent to having the reconstruction function

X = U and a distortion value of D1 = p+q2 ;

•

{(1 − a)p < b(1 − q)

a(1 − p) < (1 − b)q, is equivalent to having the reconstruction function

X = Y ∨ U (where ∨ is the binary OR operator) and a distortion value ofD2 = bq+(1−a)p+a

2 ;

•

{(1 − a) > b(1 − q)

a(1 − p) > (1 − b)q, is equivalent to having the reconstruction function

X = Y ∧ U (where ∧ is the binary AND operator) and a distortion value ofD3 = (1−b)q+ap+b

2 ;

•

{(1 − a)p > b(1 − q)

a(1 − p) < (1 − b)q, is equivalent to having the reconstruction function

X = Y and a distortion value of D4 = a+b2 .

Summing up, we can write the distortion function as follows:

D(p, q) =

D1 = p+q2 , if X = U

D2 = (1−a)p+bq+a2 , if X = Y ∨ U

D3 = (1−b)q+ap+b2 , if X = Y ∧ U

(5.15)

The distortion value in the 4th case, D4 = a+b2 , is a constant equal to the average

crossover probability of the side information correlation channel given in (5.1).

5.3.4 Existence of an unique solution

In this section we will formulate the optimization problem resulting from themathematical derivation of the rate and distortion functions. As we will see, the

82


system is formed from transcendental equations which do not admit an analyticalsolution. However, at a closer look, we are still able to prove that the system alwaysadmits a solution which is unique. The derivations were presented in [78].

Given R∗WZ(p, q) as in (5.7) and the expressions of the distortion function in

(5.15), finding the lower bound can be formulated as a minimization problem withconstraints:

for each d ∈ [0,a + b

2] minimize R∗

WZ(p, q)

subject to 0 ≤ D(p, q) ≤ d ≤a + b

2.

We will only consider the first case, i.e., D(p, q) = p+q2 , as the others alternative

follow identical reasonings, up to a constant term. We formulate the Lagrangianfunction associated with the minimization problem above:

J (p, q) = R∗WZ(p, q) + λ(d −

p + q

2)

and we set the partial derivatives with respect to (p, q) to zero:

∂J (p, q)∂p

= 0 and∂J (p, q)

∂q= 0

This gives the following:

(1 − a) log (1−a)(1−p)+bq(1−a)p+b(1−q) − a log ap+(1−b)(1−q)

a(1−p)+(1−b)q = log 1−pp − λ

−b log (1−a)(1−p)+bq(1−a)p+b(1−q) + (1 − b) log ap+(1−b)(1−q)

a(1−p)+(1−b)q = log 1−qq − λ

(5.16)

where the logarithms are in the base 2. Removing λ from (5.16) and raising to thepower 2 gives the following identity:

((1 − a)(1 − p) + bq

(1 − a)p + b(1 − q)

)(1−a+b)

∙

(a(1 − p) + (1 − b)qap + (1 − b)(1 − q)

)(1+a−b)

=

(1 − p

p∙

q

1 − q

)

.

(5.17)For a known distortion level d = D(p, q), there is an additional linear dependancebetween p and q. In our case, by replacing q = 2d − p, the above becomes atranscendental equation in p, i.e., an identity involving both polynomial functionsand functions that cannot be expressed as polynomials. It does not admit a closed-form solution. Nevertheless, we make the following claim:

83

Chapter 5

Proposition 1. For every distortion level 0 ≤ d ≤ a+b2 , by imposing the distortion

constraint in (5.15), the equation in (5.17) always admits a solution which is unique.

Proof. Considering 2d = p + q, we make the following notations for the terms in(5.17):

f1(p) = (1−a)(1−p)+b(2d−p)(1−a)p+b(1−2d+p)

f2(p) = a(1−p)+(1−b)(2d−p)ap+(1−b)(1−2d+p)

g1(p) = 1−pp

g2(p) = 2d−p1−2d+p

For 0 < a + b < 1 in (5.1) we differentiate and obtain the following:

∂f1(p)∂p = − (1−a+b)2

[(1−a)p+b(1−2d+p)]2< 0

∂f2(p)∂p = − (1+a−b)2

[a(1−p)+(1−b)(2d−p)]2< 0

(5.18)

At the same time, we can write:

{∂g1(p)

∂p = − 1p2 < 0

∂g2(p)∂p = − 1

(1−2d+p)2 < 0,(5.19)

Given that ∂fk(x)∂x = kfk−1(x)∂f(x)

∂x and considering the negativity of the terms in(5.18) and (5.19), we have that both terms of the equality in (5.17) are strictlydecreasing functions in p. Moreover, we observe that:

limp→0

f1(p) = C1 and limp→1

f1(p) = C2

limp→0

f2(p) = C3 and limp→1

f2(p) = C4

limp→0

g1(p) ∙ g2(p) = +∞ and limp→1

g1(p) ∙ g2(p) = 0

(5.20)

where C∗ are positive constants. Since g1(p) ∙ g2(p) is continuous, from (5.20) itfollows that it is surjective on [0, +∞]. As the left hand side term in (5.17) is alsocontinuous, its values for p ∈ {0, 1} are in the open interval (0, +∞), and bothterms of the equality are strictly decreasing, it follows that the two functions mustintersect, but once and only once.

Essentially, the purpose of this analysis is twofold. First, it shows the natureof the equations that form our minimization problem, and justifies the use of a

84


numerical algorithm to determine the solutions. Secondly, it shows that the problemis well posed and that, for correlation pairs (a, b) in (5.1) such that 0 < a + b < 1,the solution to the minimization problem always exists and is unique. This impliesthat the proposed bound will achievable for any distortion d.

5.3.5 A Numerical Algorithm

In order to establish a characterization of the bound, we need to show how a certaindistortion level can be achieved, i.e., what reconstruction function should be used.For example, if the required distortion level is d = 0, this can only be obtained usingX = U , since by analysing the terms in (5.15), we observe that min(D2) = a/2,min(D3) = b/2, and D4 is a constant.

As such, we consider all possible values for the expected distortion d, and we mustfind what pairs (p, q) can achieve it and the corresponding reconstruction strategy.As d grows from zero to Dmax, we can derive the following conclusions:

• If 0 ≤ d < a2∙(1−b) we have only one reconstruction possibility, denoted RI,

namely X = U, and d = D1.

• If a2∙(1−b) ≤ d ≤ b

2∙(1−a) we have two possible reconstructions, RI and RII

(i.e. X = U ∨ Y ), and d can have forms D1 or D2.

• If b2∙(1−a) ≤ d ≤ Dmax = a+b

2 we have three possible reconstructions, cases

RI, RII and RIII (i.e. X = U ∧ Y ), so d can have forms D1, D2 or D3.

• The reconstruction function X = Y gives constant distortion D4 = Dmax.

Once we know how the different distortion values can be achieved, the optimal choicefor the reconstruction function will be the one that yields the minimum rate. Morespecifically, we fix a distortion level; then, for every pair of values (p, q) that achievethe fixed distortion value as given by (5.15), we compute the corresponding ratevalue, where the rate function given by (5.7). The minimum rate gives the optimaltransition probabilities (p∗, q∗), and the minimization is done numerically for everydistortion value d ∈ {0, a+b

2 }. The result of the algorithm will be the minimumachievable rate for every distortion level, denoted by R∗

WZ(d) and the achievabilitystrategy, namely, for every rate point the corresponding reconstruction function.

An illustration of the rate-distortion points resulting from the numerical com-putations is given in Fig. 5.2, where we have considered the correlation channel in(5.1) to be given by (a, b) = (0.1, 0.4).

By numerical evaluation, we observe that the minimum for R∗WZ(d) can be

achieved by letting X = U at lower distortion levels, and by letting X = U ∨ Y forhigh distortions. Since our goal is to determine the rate-distortion function for thebinary WZ problem, we want our proposed bound to be a convex function; R∗

WZ(d)

85

Chapter 5

0 0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

D

1

R

1

R I : X = U

1

R II : X = Y ∨ U

1

R III : X = Y ∧ U

1

RX|Y (D)

1

R∗WZ(D)

1

Time sharing region

1

R II

1

R I

1

R III

1

Figure 5.2: RX|Y (D) and RWZ(D) for (a, b) = (0.1, 0.4).

on the other hand may not be convex, as it is the union of two convex curves: onecorresponding to X = U , the other to X = U ∨ Y . Fortunately, this union admitsa ’convexifying’ time-sharing strategy: we must consider the lower convex envelope(l.c.e.) of the achievable points: Rbound(d) = l.c.e.{R∗

WZ(d)}. The l.c.e. in this caseconsists of the common tangent of the two curves; the rate-distortion points inthe tangent region are achievable through time-sharing. The time-sharing principleallows the use of the system at different rate-distortion characteristics during acertain time interval. Consider that pairs (R1, d1) and (R2, d2) are achievable, andfix λ ∈ [0, 1]. We let the system function in the first regime (R1, d1) for a fractionλ of the time, and in the second regime (R2, d2) for a fraction (1 − λ) of the time.The resulting rate and distortion averages will be Rts = λR1 + (1 − λ)R2 anddts = λd1 + (1 − λ)d2. By varying λ, the pairs (Rts, dts) cover all the functioningpoints on the segment between (R1, d1) and (R2, d2).

Fig. 5.2 presents R∗WZ(d), the common tangent corresponding to the time region

which gives the actual proposed rate-distortion bound Rbound(d) and the predictivecase RX|Y (d) for an exemplification of the rate loss suffered by WZ coding. Theoptimal crossover pairs (p, q) that achieve the R∗

WZ(d) bound, are plotted asp(q) in Fig. 5.3, while Fig. 5.4 presents the same (p, q) pairs as functions of thedistortion. The cross markers in both figures show the values that delimit the time-sharing region. For low distortions, when the reconstruction function is X = U , theoptimal channels are close to binary symmetric channels. When the transition to

86


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

q

1

p

1

I:X = U

1

II:X = Y ∨ U

1

III:X = Y ∧ U

1

optimal p(q)

1

II

1

I

1

III

1

IV

1

Figure 5.3: Reconstruction regions in the (p, q) plane for a = 0.1 and b = 0.4.

0 0.05 0.1 0.15 0.2 0.250

0.2

0.4

0.6

0.8

1

D

1

pan

dq

1

p

1

q

1

Figure 5.4: Optimal p(d) and q(d) for a = 0.1 and b = 0.4.

87

Chapter 5

the reconstruction function X = U ∨ Y occurs, there is a discontinuity in the p(d)and q(d) functions. That can be noticed in Fig. 5.3 as well, where the two greencurve segments are disjoint.

5.4 The Rate-Distortion Function - the General

Case

Let us now consider the generic setup for the binary WZ coding setup:

• the source is a generic X ∼ Bernoulli(π) source

• the correlation channel is binary asymmetric as given by (5.1)

• the distortion is the Hamming distance

In the derivation of the rate-distortion bound we will follow the same steps as inthe uniform source case. We will derive the rate and the distortion expressions asfunctions of the crossover probabilities (p, q) of p(u|x) in (5.3). Next we will posea optimization problem constrained by the possible distortion function, which willbe solved numerically. The non-uniformity of the source makes the analysis morecomplex, as the maximum distortion for which the WZ is defined will not be definedjust by the average crossover of the correlation channel in (5.1), but will be influencedby the source.

Essentially, the system can spend rate in two ways in order to find the realisationsof the source X: remove incertitude in the realisations of the source (given byH(X) = H(π)), or remove incertitude in the realisations of the side information(given by H(Y |X)). In the uniform case, the source entropy is maximum; however,if the source is close to being constant, say p(X = 1) = 0.95, and the distortionintroduced by the correlation channel is bigger, the side information Y may not beuseful.

5.4.1 Expression of the Rate

The quantity to be minimized in (5.2) is written as:

I(X; U |Y ) = H(U |Y ) − H(U |X). (5.21)

Given the Markovianity U −X −Y , the channel between Y and U can be expressedas the concatenation of two channels, namely, X − U and X − Y , given by (4.13)and (5.3), respectively.

88


We know that, as given by (4.13):

p(x|y) =

[1 − a∗ a∗

b∗ 1 − b∗

]

=

(1−π)(1−a)(1−π)(1−a)+πb

πb(1−π)(1−a)+πb

(1−π)a(1−π)a+π(1−b)

π(1−b)(1−π)a+π(1−b)

.

Also,p(y) =

[(1 − π)(1 − a) + πb (1 − π)a + π(1 − b)

](5.22)

The transition matrix of the channel Y − U is:

p(u|y) = p(x|y) ∙ p(u|x)

=

(1−π)(1−a)(1−p)+πbq(1−π)(1−a)+πb

(1−π)(1−a)p+πb(1−q)(1−π)(1−a)+πb

(1−π)a(1−p)+π(1−b)q(1−π)a+π(1−b)

(1−π)ap+π(1−b)(1−q)(1−π)a+π(1−b)

. (5.23)

Therefore, based on (5.21), we establish the following definition for the rate functionfor the WZ problem under a binary asymmetric correlation channel. The ratefunction R∗

WZ(w, q) is defined to be:

I(X; U |Y ) , R∗WZ(p, q)

= H(U |Y ) − H(U |X)

= p(Y = 0) ∙ H

((1 − π)(1 − a)p + πb(1 − q)

(1 − π)(1 − a) + πb

)

+ p(Y = 1) ∙ H

((1 − π)a(1 − p) + π(1 − b)q

(1 − π)a + π(1 − b)

)

− (1 − π) ∙ H(p) − π ∙ H(q). (5.24)

5.4.2 Derivation of the Distortion

Following the derivation in section 5.3.2 we can write the best reconstructionstrategy to be:

x = f(y, u) = arg maxx

p(x|y, u), (5.25)

and the resulting distortion as:

d =∑

u,y

min( p(X = 0, y, u), p(X = 1, y, u) ). (5.26)

89

Chapter 5

where p(x, y, u) = p(x, y)p(u|x) = p(x)p(y|x)p(u|x) is written as follows:

p(X = 0, y, u) =

[(1 − π)(1 − a)(1 − p) (1 − π)(1 − a)p

(1 − π)a(1 − p) (1 − π)ap

]

(5.27)

and

p(X = 1, y, u) =

[πbq πb(1 − q)

π(1 − b)q π(1 − b)(1 − q)

]

. (5.28)

From (5.26), (5.27), (5.28) it follows that the distortion function D is defined to be:

d(p, q) = min( (1 − π)(1 − a)(1 − p), πbq ) + min( (1 − π)(1 − a)p, πb(1 − q) )

+ min( (1 − π)a(1 − p), π(1 − b)q ) + min( (1 − π)ap, π(1 − b)(1 − q) )

(5.29)

It is useful to also know the optimal decision taken by the reconstruction functiongiven in (5.25) for every pair (y, u). As such, we introduce the function xmap(y, u) :Y × U → X , which is defined as:

xmap(y, u) = arg maxx

p(x, y, u) (5.30)

In our binary case, xmap can be represented as a 2 × 2 matrix Xmap , withY ∈ {0, 1} and U ∈ {0, 1} as row and column indices, respectively. This matrix willindicate whether the reconstruction was X = 0 or X = 1. For example, if X = U ,then

Xmap =

(xmap(Y = 0, U = 0) xmap(Y = 0, U = 1)xmap(Y = 1, U = 0) xmap(Y = 1, U = 1)

)

=

(0 10 1

)

,

while if X = Y , Xmap =

(0 01 1

)

.

5.4.3 Symmetry Observations

Before establishing a more compact form for the distortion, we make a series ofobservations regarding the symmetry properties of the rate and distortion previouslydefined, which can be proven in the formulas for the rate, given in (5.24), and forp(x, y, u), given in (5.27) and (5.28). Our first observation is that if in (5.1) wesubstitute the pair (a, b) by (1− a, 1− b), the functions R∗

WZ(D) and d(p, q) do notchange. Secondly, if in (5.3) we substitute the pair (p, q) by (1−p, 1−q), the functionsR∗

WZ(D) and d(p, q) do not change. As before, the substitutions are equivalent toa label swap at the output of the respective channels and do not affect the rate-distortion function. The consequence is that the domains of interest for the pairs

90


(a, b) and (p, q) can be reduced from [0, 1]× [0, 1] to 0 ≤ a+ b ≤ 1 and 0 ≤ p+q ≤ 1.Thirdly, if in the distribution of X we substitute π by 1 − π, the functions

R∗WZ(D) and d remain the same if the following substitutions occur: (a, b) ↔ (b, a)

and (p, q) ↔ (q, p). Therefore, we will consider only 0 ≤ π ≤ 0.5. If π > 0.5, we canreach the same initial input conditions by considering the problem for (1 − π), butthen substituting (a, b) ↔ (b, a) and (p, q) ↔ (q, p). The new input is an equivalent”mirrored” version which will comply to the required conditions:

0 ≤ a + b ≤ 1, 0 ≤ p + q ≤ 1 and 0 ≤ π ≤ 0.5. (5.31)

5.4.4 Possible Values for the Distortion Function

We first underline that the maximum value of interest for the distortion is givenby (4.12): Dmax = min{π, (1 − π)a + πb}, where the second term is the averagedistortion incurred if no rate is spent and the reconstruction is given by the sideinformation, that is, when X = Y we obtain D = (1 − π)a + πb.

Therefore

Dmax =

{π, if π ≤ a

a+1−b

(1 − π)a + πb, if π > aa+1−b

. (5.32)

The distortion function in (5.29) is formed by the summation of four terms, eachof them being the minimum of a pair of values. Given the symmetry observationsin Sect. 5.4.3, we only consider setups that comply with the inequalities in (5.31).In theory, based on (5.29), there should be 16 possible forms for the resulting finaldistortion D - four terms, each with two possible values. In practice, some of thecombinations never occur; in fact, we will show that d(p, q) can take one of fiveforms.

In what follows we analyze each of the terms in (5.29) and establish theirrespective contributions to the overall distortion D. The initial input of the sourcecoding problem (in the WZ setup in this case) is given by the source distribution π

and the correlation channel (a, b). In order to visualize the different reconstructionpossibilities for the distortion function, we will illustrate the analysis with a numer-ical example (Fig. 5.5). We fix the triplet (π, a, b) to define a particular instance ofthe problem, and we will represent the terms in (5.29) in the (p, q) plane. Each ofthe four terms divides the (p, q) plane in two half planes, and the decision boundaryis a line. Our analysis will show those lines as well as the corresponding resultingreconstruction functions.

Each of the minimization problems in (5.29) corresponds to a comparisonbetween analogous terms in p(X = 0, y, u) [see (5.27)] and p(X = 1, y, u) [see(5.28)]. We distinguish four cases analogous to the values of (y, u), as follows:

• Regarding the term p(x, Y = 0, U = 0), we evaluate min( (1 − π)(1 − a)(1 −

91

Chapter 5

(a) π = 0.2 ≤ aa+1−b

= 0.25

(b) π = 0.3 ≥ aa+1−b

= 0.25

Figure 5.5: Distortion regions in the (p, q) plane for (a, b) = (0.2, 0.4) and (a) π = 0.2,(b) π = 0.3. The lines I, II and III correspond to the separation lines in Proposition 2

p), πbq ). From (5.31), we have that π ≤ (1− π), b ≤ (1− a) and q ≤ (1− p),so we always have πqb ≤ (1 − π)(1 − a)(1 − p). Therefore the term in (5.41)corresponding to X = 1 is smaller, and Xmap(Y = 0, U = 0) = 0 is constant.

• Regarding the term p(x, Y = 0, U = 1), we evaluate min( (1−π)(1−a)p, πb(1−q) ). In the (p, q) plane, as seen in Fig. 5.5, the line that determines theseparation between the two terms, which is marked with I, is given by: (1 −π)(1 − a)p = πb(1 − q). If we consider the intersection with the axes of the(p, q) plane, we have that: if p = 0 then q = 1; or otherwise, if q = 0 thenp = πb

(1−π)(1−a) ≤ 1.

• Regarding the term p(x, Y = 1, U = 0), we evaluate min( (1−π)a(1−p), π(1−

92


b)q ). In the (p, q) plane, the line that determines the separation between thetwo terms, which is marked with II, is given by (1 − π)a(1 − p) = π(1 − b)q.Considering the intersection with the axes, we have: if q = 0 then p = 1; orotherwise if p = 0 then q = (1−π)a

π(1−b) , which is bigger than 1 for π ≤ aa+1−b , as

seen in Fig. 5.5a, or smaller than 1 when π > aa+1−b , as seen in Fig. 5.5b.

• Regarding the term p(x, Y = 1, U = 1), we evaluate min( (1−π)ap, π(1−b)q ).In the (p, q) plane, the line that determines the separation between the twoterms, which is marked with III, is given by (1 − π)ap = π(1 − b)q. Theintersection with the axes gives: if p = 0 then q = 1; or otherwise if q = 0 thenp = π(1−b)

(1−π)a , which is smaller than 1 when π ≤ aa+1−b [see Fig. 5.5a], or bigger

than 1 when π > aa+1−b [see Fig. 5.5b].

As a consequence, we establish the following proposition:

Proposition 2. If π ≤ aa+1−b , there will be only two reconstruction functions;

otherwise, for π > aa+1−b , there are three possible reconstructions.

Proof. This result is a direct consequence of the nature of the terms to be minimizedin (5.29). Each of the terms in (5.29) compares one linear function of p and one linearfunction of q; for each of the 4 terms, the decision region is a line in the (p, q) plane.We will call these decision boundaries separation lines.

Both p and q are probabilities, and exist in the interval [0, 1].The number of possible reconstruction functions is given by the number of regions

created in the area (p, q) ∈ [0, 1] × [0, 1] by the 4 separation lines. In what followswe will use the analysis from section 5.4.4 in order to determine the intersection ofthe 4 separation lines with the axes of coordinates of the (p, q) plane. To visualizethe significance of the resulting equations, we will make references to Fig. 5.5.

We make the observation that, based on the symmetry conditions in (5.31),πb < (1 − π)(1 − a).

In (5.29) we have the following 4 terms:

• corresponding to p(x, Y = 0, U = 0), the equation of the separation lineis (1 − π)(1 − a)(1 − p) = πbq. The intersection points with the axes are(p = 0, q = (1−π)(1−a)

πb ) and (p = 1, q = 0). Based on the symmetry condition,the first intersection point has coordinate q > 1, so outside the area of interest(p, q) ∈ [0, 1] × [0, 1]. This is why this separation line has no influence on thenumber of regions and is ignored in Fig. 5.5.

• corresponding to p(x, Y = 0, U = 1), the equation of the separation lineis (1 − π)(1 − a)p = πb(1 − q). The intersection points with the axes are(p = 0, q = 1) and (p = πb

(1−π)(1−a) , q = 0). Based on the symmetry condition,the second intersection point has coordinate p < 1 always, which can be seenin Fig. 5.5 by looking at the line marked with I.

93

Chapter 5

The following two separation lines have a different behavior of the intersectioncoordinates:

• corresponding to p(x, Y = 1, U = 0), the equation of the separation lineis (1 − π)a(1 − p) = π(1 − b)q. The intersection points with the axes are(p = 0, q = (1−π)a

π(1−b) ) and (p = 1, q = 0). In Fig. 5.5 the line marked with II.

• corresponding to p(x, Y = 0, U = 0), the equation of the separation lineis (1 − π)ap = π(1 − b)(1 − q). The intersection points with the axes are(p = 0, q = 1) and (p = π(1−b)

(1−π)a , q = 0). In Fig. 5.5 the line marked with III.

Observe that the equation yielding the intersection coordinates is the same inboth cases:

(1 − π)a = π(1 − b) ⇔ π =a

a + 1 − b

If π < aa+1−b we have (1 − π)a > π(1 − b) and line II is outside the interest

region (p, q) ∈ [0, 1] × [0, 1]. It immediately follows that lines I and III have point(p = 0, q = 1) in common, and there are two distinct resulting regions, and only twocorresponding reconstruction functions. This can be seen in Fig. 5.5a.

If, however, π > aa+1−b , we have (1 − π)a < π(1 − b) and line III is outside the

interest region (p, q) ∈ [0, 1]× [0, 1]. Since lines I and II don’t intersect on the axes,there will be three distinct resulting regions, and three corresponding reconstructionfunctions. This can be seen in Fig. 5.5b.

We make the additional observation that the threshold value is the same one asin the case of the maximum distortion value given by (5.32), and equal to a

a+1−b .As a consequence, we have established that if Dmax = π then there are two possiblereconstruction functions, whereas if Dmax = (1 − π)a + πb – average crossover ofthe correlation channel – then there are three possible reconstruction functions.

Proposition 3. There are five possible combinations of values for the expectedWyner-Ziv distortion:

• Xmap =

(0 10 1

)

is equivalent to having the reconstruction function X = U

and a distortion value of D1 = πq + (1 − π)p;

• Xmap =

(0 11 1

)

is equivalent to having the reconstruction function X =

Y ∨ U and a distortion value of D2 = πbq + (1 − π)(1 − a)p + (1 − π)a;

• Xmap =

(0 00 1

)

is equivalent to having the reconstruction function X =

Y ∧ U and a distortion value of D3 = π(1 − b)q + (1 − π)ap + πb;

94


• Xmap =

(0 01 1

)

is equivalent to having the reconstruction function X = Y

and a distortion value of D4 = πb + (1 − π)a;

• Xmap =

(0 00 0

)

is equivalent to having the reconstruction function X = 0

and a distortion value of D5 = π,

where ∨ and ∧ are the binary OR and AND operators.

Proof. Since Xmap(0, 0) = 0, there are only 8 possible candidate functions. It issufficient to make the observation that we cannot have at the same time the followingtwo pairs of inequalities satisfied:

{(1 − π)(1 − a)p < πb(1 − q) from p(x, Y = 0, U = 1)

(1 − π)ap > π(1 − b)(1 − q) from p(x, Y = 1, U = 1)⇔

{(1−π)pπ(1−q) < b

1−a(1−π)pπ(1−q) > 1−b

a

since from (5.31) we have 1−ba > b

1−a , and

{(1 − π)a(1 − p) < π(1 − b)q from p(x, Y = 1, U = 0)

(1 − π)ap > π(1 − b)(1 − q) from p(x, Y = 1, U = 1)⇔

{(1−π)aπ(1−b) < q

1−p(1−π)aπ(1−b) > 1−q

p

since from (5.31) we have 1−qp > q

1−p .The five possible cases and their respective values for the distortion are obtainedby considering all possible combinations that are not excluded by the mentionedinequalities.

The last two cases have a constant output equal to Dmax, therefore they cannotbe used to express variable distortions in the range of interest.

5.4.5 Rate-Distortion Bound - numerical algorithm

In order to formulate the rate-distortion minimization problem, we need to establishhow a certain distortion level can be achieved, that is, what reconstruction functioncan be used. Once this is known, out of all the parameters (p, q) that can yield thedesired distortion level, the pair corresponding to the minimum rate is chosen. Theapproach is summarized in Algorithm 2.

Achievable R(D) Points

We begin by pointing out that, excepting some very simple particular cases,the minimization problem does not have an analytic solution. The mathematicalderivations can go as far as formulating the logarithmic equations that need to be

95

Chapter 5

Algorithm 2: Achieving Minimum Rate for a Target Distortion

Input: Triplet (π, a, b) corresponding to the distribution of the source and1

the crossovers of the correlation channel.Output: The minimum encoding rate.2

1: Determine the maximum distortion Dmax using (4.12).2: Choose a target expected distortion level D < Dmax.3: Given (π, a, b), determine the reconstruction functions that can be used by

Proposition 2.4: For the eligible reconstruction functions, find all pairs (p, q) satisfying (5.31)

that yield the desired distortion level, as given by Proposition 3.5: For all the pairs (p, q) found, compute the rate as given by (5.24).6: Choose the pair (p, q) that corresponds to the minimum rate.

solved in order to obtain the minimum rate for a given distortion. We have used abasic exhaustive-like algorithm to find optimal pairs (p, q) with a desired precision,typically in the order of 10−4.

For every distortion interval D ∈ [0, Dmax], the goal is to find the pair (p, q)defining p(u|x) in (5.3) that minimizes the rate, while satisfying the distortionconstraints:

minp,q

R∗WZ(p, q)

subject to

{0 ≤ d(p, q) ≤ D ≤ Dmax

d(p, q) has a form ∈ {D1, D2, D3, D4, D5}(5.33)

Since this minimization has to be solved for every value of D, the resulting functionhas D as argument, and we will denote it by R∗

WZ(D).The results of applying the minimization algorithm 2 are illustrated in Fig.

5.6. In this example, the crossover probabilities of the correlation channel are keptconstant to be (a, b) = (0.2, 0.4). The distribution of the source varies from analmost constant source when π = 0.05 to a uniform source when π = 0.5. We makea reference to Proposition 2 to notice that the threshold value of π which gives thetransition from two to three possible reconstruction functions can be computed asπT = a

a+1−b = 0.25.For every value of π we will look at two distinct aspects, each depicted in a

figure:

• the first graph will present the achievable rate-distortion points, the cor-responding reconstruction function as well as the achievable minimum, i.e.R∗

WZ(D)

• the second graph will show the optimal crossover probabilities of the X − U

96


(a) R(D), π = 0.05

(b) (p, q), π = 0.05

Figure 5.6: Examples of the obtained R(D) and the corresponding optimal prob-ability distribution (p, q) for crossover probabilities (a, b) = (0.2, 0.4) of the binaryasymmetric correlation channel.

97

Chapter 5

(c) R(D), π = 0.2

(d) (p, q), π = 0.2


98


(e) R(D), π = 0.25

(f) (p, q), π = 0.25


99

Chapter 5

(g) R(D), π = 0.3

(h) (p, q), π = 0.3


100


(i) R(D), π = 0.4

(j) (p, q), π = 0.4


101

Chapter 5

(k) R(D), π = 0.45

(l) (p, q), π = 0.45


102


channel, i.e. (p, q), together with the corresponding reconstruction function;essentially it represents the achievability strategy for all points in the previ-ously mentioned rate-distortion graph.

We make two general observations regarding the optimal reconstruction function:

• for low distortions the reconstruction used is X = U ; the side-information isignored since its quality is worse then the targeted distortion.

• for high distortions the reconstruction varies from X = Y ∧ U – for smallvalues of π – to X = Y ∨ U – for values of π approaching 0.5.

Let us look at the graphs in Fig. 5.6 and discuss them briefly.

• In Fig. 5.6(a)-(b) we have π = 0.05. We can see that, for very low valuesof π, the side-information is not used and X = U . This is to be expected,since the source is almost constant and the average crossover of the correlationchannel is high; the side-information should be ignored. There are two availablereconstruction functions: X = U and X = Y ∧ U . Hypothetically, the lattercould also be used, but it never gives an optimal rate-distortion point.

• In Fig. 5.6(c)-(d) the value of π = 0.2 is still such that π < aa+1−b = 0.25,

so there are only two possible reconstruction functions. As π has increased,the optimal reconstruction takes the side information into account for highdistortions. In the (p, q) plane, the optimal p(q) choice is the reunion of twofunctions, due to the two distinct reconstruction options.

We observe that the zero-rate point corresponds to (p = 0, q = 1) which yieldsU = 0 constant, and X = U ∧ Y = 0 constant. Hence, we find Ed(X, X) =p(X = 1) = π = 0.2.

• When π = aa+1−b = 0.25, the maximum distortion is achieved for (p, q) =

(0.5, 0.5), which implies that U is uniform and independent of X, and X = Y ;this is shown in Fig. 5.6(e)-(f).

• In Fig. 5.6(g)-(h) the value of π = 0.3 > aa+1−b = 0.25. Based on Proposition

2 there are three reconstruction functions available. However, the functionX = Y ∨ U is still not used.

We observe that the zero-rate point now corresponds to (p = 1, q = 0) whichyields U = 1 constant, and X = U ∧ Y = Y .

• There may be cases where all three reconstructions yield points on bound, asis the case when π = 0.4, presented in Fig. 5.6(i)-(j).

103

Chapter 5

• In Fig. 5.6(k)-(l) π = 0.45 and the optimal rate-distortion points have thesame achievability strategy as the case of the uniform source, i.e. π = 0.5 –see Section 5.3.5.

We observe that the zero-rate point now corresponds to (p = 0, q = 1) whichyields U = 0 constant, and X = U ∨ Y = Y .

Convex envelope

It is easy to observe that this procedure will have as result a function, R∗WZ(D),

that may not be convex. Nevertheless, its components are convex. In order to obtainthe proposed rate-distortion bound, the lower convex envelope of R∗

WZ(D) must beconsidered, i.e. RWZ(D) = l.c.e.{R∗

WZ(D)}, which is equivalent to considering thecommon tangent of two curves. All the points on the rate-distortion curve will bethen achievable either directly, or using a time sharing strategy.

The intersection point of the common tangent with the lower curve will vary inthe following manner, which can be followed in figure 5.7.

As long as there is only one reconstruction function used, as in Fig. 5.6a and5.6b, there is no time sharing region needed, as the function R∗

WZ(d) is convex, soR∗

WZ(d) = RWZ(d).As long as the second curve corresponds to the reconstruction function X =

Y ∧ U , the tangency point will get closer to the maximum distortion point, as infigure 5.7a and 5.7b.

There is an unique value π for which there is a transition from X = Y ∧ U toX = Y ∨U , and the tangency point is on the D-axis, i.e., has coordinates (Dmax, 0),as in figure 5.7c. This case is worth mentioning, as it resembles the behavior of thedoubly symmetric binary case; for low distortions X = U , while for high distortionswe only retain the point of maximum distortion, X = Y .

As π further increases, the reconstruction function for the second curve remainsX = Y ∨ U , and the tangency point will move towards lower distortions, as seen infigure 5.7d and 5.7e.

5.5 From BSC to the Z-channel - examples

We now present representative examples of rate-distortion bounds, both for thepredictive coding case as well as for the WZ coding case.

Uniform Source, BSC correlation

When the correlation channel is a BSC, i.e., a = b = p0 in (5.1), and the input isuniform, i.e., π = 0.5, the channel X −U in (5.3) will also be symmetric, i.e., p = q.

104


(a) π = 0.2

(b) π = 0.3

(c) R(D), π = 0.403

Figure 5.7: R(D) with common tangent.

105

Chapter 5

(d) π = 0.45

(e) π = 0.5

Figure 5.7: R(D) with common tangent.

106


Then, in (4.4) we have a∗ = b∗ = p0, and (4.9) gives:

RX|Y (D) =

{h(p0) − h(D) if D ≤ p0

0 if D ≥ p0

Furthermore, (5.7) and (5.10) become:

R∗(D) = h(p0 ∗ D) − h(D)

D = min{p0, p},

where ∗ is the binary convolution operation, i.e., a ∗ b = a ∙ (1− b) + (1− a) ∙ b. Thezero-rate, a.k.a., maximum distortion point (where, R = 0, D = p0) is achievable byconsidering X = Y . RWZ(D) is a convex function and the lower complex envelope isgiven by the tangent to R∗(D) that passes through the zero-rate point. This resultis known from [103]. However, as shown in Fig. 5.8, if the source is not uniform,the behavior of the rate distortion function changes, following the generic behaviorpresented in section 5.4.5 of the paper.

Uniform source, Z-channel correlation

When the correlation is given by a Z-channel, we have a = 0 and b = p0. In orderto write the RX|Y (D), we observe that a∗ = p0

1+p0and b∗ = 0 in (4.4). Then, (4.9)

becomes:

RX|Y (D) =

{1+p0

2 ∙ [H( p01+p0

) − H( 2∙D1+p0

)] if D ≤ p02

0 if D ≥ p02

(5.34)

On the other hand, (5.7) can be written as:

R∗(p, q) =1 + p0

2∙ H

(1 − p + p0q

1 + p0

)

−12∙ H(p) −

p0

2∙ H(q). (5.35)

It was proven in [33] that, in the Z-channel correlation case, the optimal solutionfor the (p, q) pairs in (5.37) is of the form:

p(1 − p)(1 − π)2 = q(1 − q)(πp0)2. (5.36)

Using (5.36) in (5.35), it can be shown that the terms in (5.34) and (5.35) areidentical; hence, in this case, WZ coding does not have a rate loss compared topredictive coding.

107

Chapter 5

(a) a = b = 0.3, π = 0.5

(b) a = b = 0.3, π = 0.3

Figure 5.8: Examples of the R(D) function for a BSC correlation. When the sourceis (a) uniform, where the results is given in [103], and (b) non-uniform, where theresult is obtained in this work.

108


5.6 Tightness of the Bound

We initially opted to find a closed-form solution for the generic problem, namely,the problem of encoding a non-uniform source with side information given by thebinary asymmetric correlation channel. Such a rate-distortion formula can be usedto optimally drive a binary coding system; for example, it can be used to allocatethe rate required to reach a particular distortion value. Our analysis revealed that,even in the case where |U | = 2, the generic problem does not admit an analyticalsolution (we note that an analytical solution was available for the special case ofthe Z-channel correlation, as seen in the previous section). To address this issue, weproposed a novel way to calculate the rate-distortion bound.

5.6.1 Comparison with the Blahut-Arimoto Algorithm

The proposed algorithm constrains the auxiliary variable to be binary, and gives arate-distortion bound and an achievability strategy. Contrary to the Blahut-Arimoto[16] algorithm, which performs successive minimizations of specific cost functions inorder to obtain the overall optimal probabilities, our method relies on analyticalderivations: we analytically derive the expressions for rate and distortion, and usethem to restrict the size of our search space. As such, our method has distinctadvantages over the Blahut-Arimoto approach:

• Firstly, our method is much less complex than the Blahut-Arimoto approach;the latter needs as input the slope of the rate-distortion function and repeatsthe iterative optimization for every individual rate-distortion point. In fact,for certain values of the binary Wyner-Ziv rate-distortion slope, our imple-mentation of the Blahut-Arimoto algorithm was very slow in converging tothe optimal solution.

• Secondly, given a target distortion, our method can calculate the correspondingoptimal rate—therefore, it can be used in a binary coding system for rateallocation purposes. Conversely, the target distortion level cannot be specifiedin the Blahut-Arimoto algorithm, because it takes as input the slope of therate-distortion function.

It is worth mentioning that the Blahut-Arimoto algorithm may use a highercardinality auxiliary variable than |U | = 2 to find the optimal rate-distortion points.An important contribution of our approach, as opposed to a numerical method asin [16], is that we conjectured that an auxiliary variable U with cardinality |U | = 2is sufficient to achieve the rate-distortion bound.

It is proven that the cardinality of the auxiliary variable U that achieves therate-distortion function is upper bounded by |U| ≤ |X | + 1 [103]. Nevertheless,

109

Chapter 5

for simplicity, we have considered U to be binary, |U| = 2. Similarly, the authorsof [74] and [49] have shown that, for lossless binary source coding with binary sideinformation, the optimal size for the auxiliary variable is |U| = 2. We conjecturethat the same is valid for the binary rate-distortion problem as well.

All along the process of evaluating our bound, we have used the Blahut-Arimotoalgorithm presented in [22] as reference technique, generating numerically the rate-distortion function. Since it is not analytical, the Blahut-Arimoto algorithm cannotbe guaranteed to converge to the precise solution, but it has a convergence stoppingcriterion which allows to specify a maximum accepted error. We fixed that precisionresolution to be 10−4. When computing the difference between our rate-distortionbound with the result obtained with Blahut-Arimoto algorithm [22], we observe thatthis difference is within the numerical precision of our experimental setup, indicatingthat the proposed bound may be tight. Exhaustive tests were performed for valuesof π and (a, b) with the same step of 10−4.

Therefore, we make the following conjecture:

Conjecture 1. Within the imposed precision of the numerical algorithms, theproposed rate-distortion bound is identical to the outcome of the Blahut-Arimotoalgorithm presented in [22], and |U| = 2 is sufficient to achieve the rate-distortionfunction in the case of a binary source with correlated binary side information atthe decoder and the Hamming distortion metric.

5.6.2 Rate-distortion bound for ternary auxiliary variable U

In the case when the side information is available only at the decoder, the rate-distortion function was given by [103] (5.2):

RWZ(D) = infp(u|x)p(x|u,y): E[d(x,x)]≤D

I(X; U |Y ),

where X = f(U, Y ) is a reconstruction function, such that E[d(x, x)] ≤ D.So far we have considered the auxiliary variable U to be binary, even though

the maximum cardinality, which offers the most flexibility to the solution space, is|U | = 3. Since the mathematical formulation of the problem in the ternary auxiliaryvariable case is very complicated, we will apply the same numerical strategy as inthe binary case to the ternary case, with the purpose of observing the correspondingrate-distortion bound.

Therefore, as an example (presented in [78]), we consider the uniform source caseX ∼ Bernoulli(0.5) and we assume that U ∈ U in (5.2) is the outcome of a ternarychannel with input X and the transition matrix given by:

p(u|x) =

[m n (1 − m − n)p q (1 − p − q)

]

, (5.37)

110


with 0 ≤ m + n ≤ 1 and 0 ≤ p + q ≤ 1. We express the formulas for the rate anddistortion as functions of the crossover probabilities (m, n) and (p, q); we only pointthe essential steps in the derivation, since they follow the same reasoning as in thebinary case.

From (5.1) and (5.37), we can write:

p(u|y) =

(1−a)m+bp(1−a+b)

(1−a)n+bq(1−a+b)

(1−a)(1−m−n)+b(1−p−q)(1−a+b)

am+(1−b)p(a+1−b)

an+(1−b)q(a+1−b)

a(1−m−n)+(1−b)(1−p−q)(1−a+b)

(5.38)

It follows that (5.2) can be written as:

I(X; U |Y ) , R∗WZ(m, n, p, q)

= H(U |Y ) − H(U |X)

=∑

y

p(y)H(U |y) −∑

x

p(x)H(U |x), (5.39)

where H(U |∗) is the conditional entropy function, which can be written using (5.37)and (5.38) as:

H(U |∗) = −∑

u

p(u|∗) log2 p(u|∗)

The expression for the distortion can be written as (5.10):

D =∑

u,y

min( p(X = 0, y, u), p(X = 1, y, u) ).

We can write p(x, y, u) as follows:

p(X = 0, y, u) =

12∙

(1 − a)m (1 − a)n (1 − a)(1 − m − n)

am an a(1 − m − n)

,(5.40)

and

p(X = 1, y, u) =

12∙

bp bq b(1 − p − q)

(1 − b)p (1 − b)q (1 − b)(1 − p − q)

.(5.41)

111

Chapter 5

Figure 5.9: Rate-distortion curves for (a, b) = (0.1, 0.4). The lower bound is thesame, and is achievable directly for |U | = 3 and by time sharing for |U | = 2.

Equations (5.40) and (5.41) can be used in (5.10) to write:

D(m, n, p, q) = ( min((1 − a)m, bp)+

min((1 − a)n, bq)+

min((1 − a)(1 − m − n), b(1 − p − q))+

min(am, (1 − b)p)+

min(an, (1 − b)q)+

min(a(1 − m − n), (1 − b)(1 − p − q)). (5.42)

Equations (5.39) and (5.42) can be used to perform a full search of the space of thepossible solutions (m, n, p, q) of the following minimization problem:

for each d ∈ [0,a + b

2] minimize R∗

WZ(m, n, p, q)

subject to: 0 ≤ D(m, n, p, q) ≤ d ≤a + b

2.

The example in Fig. 5.9 presents the result of the numerical analysis for (a, b) =

112


(0.1, 0.4) and |U| = 3. It also shows the rate-distortion function as obtained in thecase of |U| = 2, corroborated by the Blahut-Arimoto algorithm in [22]. One noticesthat the two bounds overlap. The main difference between the cases of the binaryand ternary auxiliary variable is the following:

• for card |U| = 3, the space of the possible solutions has a higher dimensionality,so all the points on the rate-distortion function are directly achievable;

• for card |U| = 2, there is a time sharing region (convex envelope, i.e., thecommon tangent) that corresponds to the linear portion of the rate-distortionfunction.

This observation comes as an additional argument to Conjecture 1. A possiblemathematical proof for this is foreseen as future work.

5.7 An Analytical Approximation of the Rate-

Distortion Function

Determining the bound achieving p(u|x) — see (5.3) — for the generic binaryWyner-Ziv problem (5.33) involves using the numerical Algorithm 2. We nowpropose an analytical approximation of the WZ rate-distortion function. FollowingConjecture 1, we will assume that a binary auxiliary variable is sufficient toapproach the rate-distortion function. We already know that, for a simple correlationmodel, i.e., the Z-channel, the rate-distortion function does surprisingly admit ananalytical expression [33], which also assumes a binary auxiliary variable; in thisparticular case, the optimal solution for p(u|x) is known to satisfy eq. (5.36).

Given the input source X ∼ Bernoulli(π) and correlation channel crossovers(a, b), we generalize the solution in the Z-channel case (5.36) to the following generalexpression for the (p, q) pairs:

p(1 − p)(1 − a)2(1 − π)2 = q(1 − q)b2π2 (5.43)

If the reconstruction function is X = U , the side information is not used, which canbe expressed by setting a = 0, b = 1 in (5.43). By doing so, (5.43) becomes:

p(1 − p)(1 − π)2 = q(1 − q)π2. (5.44)

Assume the source distribution π and the correlation channel (a, b) are known.Proposition 3 gives all the possible realisations of the distortion function. By solvingthe system obtained by combining the conditions in (5.43) and (5.44) with theequations for the distortions from Proposition 3, the solutions for the resulting pairs

113

Chapter 5

(a)

(b)

Figure 5.10: Proposed approximation for (a) the rate-distortion function R(D) and(b) (p, q) versus the numerically-obtained optimal values. The source and correlation-channel parameters are set to π = 0.4 and (a, b) = (0.2, 0.4).

114


(p, q) as functions of the distortion level D, are:

p = D(π−D)(1−2D)(1−π) , q = D(1−π−D)

(1−2D)(1−π) for X = U ;

p = (D−(1−π)a)(πb+(1−π)a−D)(1−π)(1−a)((1−π)(1+a)+πb−2D) , q = (D−(1−π)a)((1−π)−D)

πb((1−π)(1+a)+πb−2D) for X = Y ∨ U ;

p = (D−πb)(π−D)(1−π)a((1−π)a+π(1+b)−2D) , q = (D−πb)((1−π)a+πb−D)

π(1−b)((1−π)a+π(1+b)−2D) for X = Y ∧ U .

(5.45)

The three resulting (p, q) pairs define the proposed p(u|x), which in turn leadsto an approximation of the rate-distortion function. More specifically, for everydistortion value D, the resulting (p, q) pairs in (5.45) are substituted in (5.24) andthe minimum rate value (out of the three computed ones) is chosen. All the requiredderivations are analytically computable.

By using this approximation, the difference between the estimated and theactual rate-distortion function is below 10−3 bps. The differences between the ratedistortion regions, as well as the optimal (p, q) pairs, are presented in Fig. 5.10.The advantage brought by this approximation is clear: for a negligible error inthe estimated rate, the solutions are usable in an analytical form, and the use ofnumerical algorithms, which take non-negligible time to run, is no longer needed.

5.8 Conclusions

This section provides an in-depth analysis of the binary Wyner-Ziv problem inthe most general setup, i.e., when the source is Bernoulli(π) and the correlationbetween source and side-information is given by a BAC. In spite of numerous studieson source coding with side information, the rate-distortion function for the describedsetting was not known.

By making a simplifying assumption, we proposed a bound for the rate-distortionfunction which may be tight. Since the problem does not admit an analyticalsolution, we devised a numerical algorithm which solves the optimization problemwith a desired numerical precision. Moreover, we also proposed an analytical boundwhich can be used to approximate the true rate-distortion function with negligibleerror.

From an information theoretical perspective, an intriguing result is that therate-distortion function may be achieved using a binary auxiliary variable. We usedthis fact based on empirical evidence, but a mathematically rigorous proof shouldpresent interest for the research community.

The derived rate-distortion bound serves as reference for the proposed analysisof the rate-loss of binary WZ coding when compared to predictive coding.

115

116

Chapter 6

Rate Loss

6.1 Introduction

This chapter proposes the analysis of the rate loss of Wyner-Ziv (WZ) coding whencompared to predictive coding in the case of binary sources. We have presented therate-distortion functions for the predictive case in Chapter 4 and, respectively, forthe WZ case in Chapter 5.

Let us give a representative illustration of the rate-loss variation by the meansof an example. Consider the source to be uniform, i.e., X ∼ Bernoulli(0.5), and thecorrelation between source and side information to be characterized by transitionprobabilities (a, b) as in (6.1).

p(y|x) =

[(1 − a) a

b (1 − b)

]

, (6.1)

with (a, b) ∈ [0, 1]2. In order to follow the rate-loss variation, we keep the averagecrossover of the correlation channel to be constant, but modify the crossoverprobabilities. Therefore, in (6.1), a and b vary, but such that a + b is kept constant.For an average crossover of a+b

2 = 0.25, the variation of rate-loss can be followed inFig. 6.1. The rate distortion bounds presented lead to the following observations: thehighest rate-loss and also the highest rate required to encode at any given distortionlevel D ≤ Dmax are obtained for the binary symmetric correlation channel. As theasymmetricity of the correlation channel increases, the rate-loss as well as the totalrate decreases. The intuitive motivation is that the entropy of the side informationdecreases, meaning that the quality of the side information increases. The rate-lossgoes to zero as we approach the Z-channel correlation case.

The chapter is structured as follows: Section 6.2 sets the stage for the upcominganalysis by citing previous literature on upper and lower bounds on the rate-loss.Section 6.3 gives the proof of the no-rate-loss property of the Z-channel. Sections

117

Chapter 6

DISTORTION0 0.05 0.1 0.15 0.2 0.25

RA

TE

[bps

]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(a) a = 0.01, b = 0.49

DISTORTION0 0.05 0.1 0.15 0.2 0.25

RA

TE

[bps

]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(b) a = 0.05, b = 0.45

Figure 6.1: Rate-distortion bounds for conventional coding with encoder-decoderside information and Wyner-Ziv coding, for constant maximum distortion Dmax =a+b2

= 0.25.

118

6. Rate Loss

DISTORTION0 0.05 0.1 0.15 0.2 0.25

RA

TE

[bps

]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(c) a = 0.1, b = 0.4

DISTORTION0 0.05 0.1 0.15 0.2 0.25

RA

TE

[bps

]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(d) a = 0.25, b = 0.25

Figure 6.1: Rate-distortion bounds for conventional coding with encoder-decoderside information and Wyner-Ziv coding, for constant maximum distortion Dmax =a+b2

= 0.25.

119

Chapter 6

6.4 and 6.5 analyze the encoding rate and the rate-loss for binary WZ coding, whileSection 6.6 draws the conclusions of this chapter.

6.2 Rate-loss for Binary WZ Coding

The rate-loss in WZ coding has been investigated for the general setup in [105],where it was proven that the loss is upper-bounded by a well defined quantity, theminimax capacity. At the other extreme, there were previously two known cases ofsetups for which WZ coding does not incur a rate-loss when compared to predictivecoding. In Section 6.2.1 we will present the minimax capacity and evaluate themaximum rate-loss for the binary setup with the Hamming distortion metric. InSection 6.2.2 we will look at the no-rate-loss cases known in the literature, with anemphasis on the binary erasure correlation setup.

6.2.1 Upper Bound on the Rate-Loss in the WZ Problem

As defined by [105], the minimax capacity is the worst noise capacity of an additivenoise channel. Formally, let N ∈ X be a random variable independent from W ∈ X ,the input of an additive noise channel W → W + N . Considering d(∙) to be adistortion metric, we can define the capacity of this channel to be:

C(D, N) = maxEd(W )≤D

I(W ; W + N) (6.2)

Then, the worst noise capacity is:

CX (D) = minEd(N)≤D

C(D, N) (6.3)

Then, the minimax capacity bound as given by [105] is given by the followingtheorem:

Theorem 16. The rate-loss in the WZ problem is bounded by

RWZ(D) − RX|Y (D) ≤ CX (D) (6.4)

The theorem is proved and discussed further in [105]. The most importantmessage of this theorem is that the rate-loss corresponding to an uninformed encodercannot be arbitrarily high, but is bounded. In the binary source case CX (D) dependson the allowed distortion; for the Hamming distance, the minimax capacity is equalto:

C(D) = H(D ∗ D) − H(D), 0 ≤ D ≤ 0.5 (6.5)

where H is the binary entropy and ∗ is the binary convolution operator, i.e., D∗D =

120

6. Rate Loss

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

Figure 6.2: Binary minimax capacity for the Hamming distance.

2D(1 − D). The function C(D) in (6.5) is presented in Fig. 6.2. The maximum ofthis function is obtained for a value of D = 0.125, and the corresponding upperbound on the rate-loss is RWZ(D)−RX|Y (D) ≤ 0.22 bits per symbol (bps) [105]. Inspite of being useful for setting a maximum on the loss, this bound is by no meanstight.

6.2.2 No-rate-loss Cases

The most well known no-rate-loss case is the Gaussian source with Gaussiancorrelation Wyner-Ziv setup, which was presented by Wyner in [102] and generalizedby Pradhan et. al. in [72]. Nevertheless, given our interest for binary sources, wewill focus more on a more recent instance, namely the binary uniform source casewith binary erasure correlation, presented in [67,68].

Consider the source X ∼ Bernoulli(0.5) and the correlated side information Y

given by an erasure correlation channel:

Yi =

{Xi with probability 1 − p

ε with probability p

The symbol ε is called an erasure, and it corresponds to a loss of information. Thedistortion measure considered is the Hamming distance.

When the side-information is available at both the encoder and the decoder, therate-distortion function for the above-defined problem is given by [67]:

RεX|Y (D) =

p(1 − H(D

p ))

, when D < p2

0, otherwise(6.6)

121

Chapter 6

Perron et. al. [67] gave the following result:

Theorem 17. For a binary uniform source and an erasure side-information,considering the Hamming distortion measure, the following holds:

RεX|Y (D) = Rε

WZ(D),

where RεWZ(d) is the corresponding Wyner-Ziv case rate-distortion function.

Proof. An initial note is that any strategy achieving a pair (R, D) in the Wyner-Ziv setup can also be applied in the predictive coding setup, by ignoring the side-information at the encoder. As such, the predictive case RX|Y (D) is always a lowerbound for the Wyner-Ziv rate-distortion RWZ(D).

Remember that the Wyner-Ziv rate-distortion formula is given by

RWZ(d) = minp(u|x)p(x|u,y): E[d(X,X)]≤d

I(X; U |Y ), (6.7)

where X = f(U, Y ) and U is an auxiliary random variable, satisfying the Markovchains: U − X − Y and X − (U, Y ) − X, such that E[d(X, X)] ≤ d.

Let U ∈ U be binary, given by a Binary Symmetric Channel (BSC) with inputX and crossover probability D

p .The rate function can be written as:

R = I(X; U |Y ) = H(X|Y ) − H(X|U, Y )

where H(X|Y ) = p.Then, we can write the following:

H(X|U, Y ) = pH(X|U, Y = ε) +1 − p

2H(X|U, Y = 0)

+1 − p

2H(X|U, Y = 1)

(a)= pH(X|U, Y = ε) +

1 − p

2H(X|U, X = Y = 0)

+1 − p

2H(X|U, X = Y = 1)

= pH(X|U, Y = ε)

(b)= pH(X|U)

= pH(X ⊕ U |U)

(c)= pH(X ⊕ U)

= pH(D

p).

122

6. Rate Loss

Figure 6.3: Model of a Z-channel with crossover probability p0.

where the following are true:(a) holds since Y = X if Y 6= ε

(b) holds because of the Markov chain Y − X − U

(c) holds because X and U are independent.It follows that, for this choice of U , we have

R(D) = H(X|Y ) − H(X|U, Y ) = p − pH(D

p)

If the decoder chooses X = U if Y = ε and X = Y when Y ∈ {0, 1}, the expecteddistortion will be:

Pr(X 6= X) = (1 − p) ∙ 0 + p ∙D

p= D,

which corresponds to the RX|Y (D) bound.

6.3 No Rate-loss: the Z-channel Case

We will start our analysis of the rate loss with the result that triggered our interestin the problem in the first place, namely, the no-rate-loss property of the Z-channelcorrelation case. This work was published in [33].

Consider that the dependence between the source X and the side information Y

is expressed by a Z-channel, with crossover probabilities

p(y|x) =

{0 , if X = 0 and Y = 1

p0, if X = 1 and Y = 0(6.8)

This channel model is presented in Fig. 6.3.

123

Chapter 6

6.3.1 Source Coding with Encoder-Decoder Side Information

In the scenario where the side information is available both at the encoder and thedecoder the R(D) function is given by

RX|Y (D) = minp(x|x,y): E[d(X,X)]≤D

I(X; X|Y ),

where E[∙] denotes the expectation operator and D is the target distortion. A closedform expression of the R(D) function for the case of Z-channel correlation was givenin [84], that is,

RZX|Y (D) = (1 − π + πp0)

[

H

(πp0

1 − π + πp0

)

− H

(D

1 − π + πp0

)]

, (6.9)

where π = Pr[X = 1] parameterizes the probability distribution of the binary sourceto be encoded and H(∙) is the entropy function, H(p) = −p log2(p)− (1−p) log2(1−p), with p ∈ [0, 1].

6.3.2 Source Coding with Decoder Side Information

In the Wyner-Ziv coding case, when the side information is only available at thedecoder, the R(D) function is given by [103]

RWZ(D) = minp(u|x)p(x|u,y): E[d(X,X)]≤D

I(X; U |Y ) (6.10)

where X = f(U, Y ) and U is an auxiliary random variable satisfying the Markovchains: U − X − Y and X − (U, Y ) − X. In what follows we will assume that theauxiliary variable U is binary and we will derive a closed form expression for therate-distortion function in the Z-channel case RZ

WZ(D). By showing that the rate-distortion function in the WZ case is equal to the rate-distortion function in thepredictive case, i.e., RZ

WZ(D) = RZX|Y (D), we will draw two conclusions; firstly, the

Z-channel correlation case has no rate loss; secondly, a binary auxiliary variable issufficient to achieve the rate-distortion bound in the Z-channel correlation case.

6.3.3 No Rate-loss Proof

Theorem 18. Consider binary source coding in the presence of side informationwith the Hamming-distance as distortion metric. When the correlation between thesource and the side information is expressed by the Z-channel, Wyner-Ziv codingdoes not suffer a rate loss compared to source coding with side information available

124

6. Rate Loss

at both the encoder and the decoder. Specifically,

RZWZ(D) = RZ

X|Y (D) =

(1 − π + πp0)

[

H

(πp0

1 − π + πp0

)

− H

(D

1 − π + πp0

)]

. (6.11)

Proof. The quantity to be minimized in equation (6.10) can be rewritten as

I(X; U |Y ) = H(U |Y ) − H(U |X). (6.12)

Let U be binary and the transition probabilities be given by:

p(u|x) =

[1 − p p

q 1 − q

]

, (6.13)

where p = p(U = 1|X = 0) and q = p(U = 0|X = 1), according to the asymmetricbinary channel model. From (6.8), the inverse Z-channel between Y and X is

p(x|y) =

1−π1−π+πp0

πp01−π+πp0

0 1

. (6.14)

Then, the channel between Y and U is given by the concatenation of the two abovementioned channels, namely,

p(u|y) = p(x|y) p(u|x)

=

(1−π)(1−p)+πp0q1−π+πp0

1 − (1−π)(1−p)+πp0q1−π+πp0

q 1 − q

. (6.15)

The entropies in (6.12) can be therefore expressed as

H(U |X) = −∑

x

p(x)∑

u

p(u|x) log(p(u|x))

= (1 − π) H(p) + π H(q), (6.16)

125

Chapter 6

and

H(U |Y ) = −∑

y

p(y)∑

u

p(u|y) log(p(u|y))

= (1 − π + πp0) H

((1 − π)(1 − p) + πp0q

1 − π + πp0

)

+ π(1 − p0) H(q). (6.17)

Replacing (6.16) and (6.17) in (6.12) gives

I(X; U |Y ) , R(p, q)

= (1 − π + πp0) H

((1 − π)(1 − p) + πp0q

1 − π + πp0

)

− (1 − π) H(p) − πp0 H(q). (6.18)

In order to determine the possible distortion functions and their correspondingreconstruction strategies, we remind that the decoder can only make a deterministicdecision given u and y, and the best choice is to reconstruct

x = f(u, y) = arg maxx

p(x|u, y),

which, in the particular case of the Z-channel correlation, becomes:

D =∑

u

minx

p(x, y = 0, u)

= min((1 − p)(1 − π), qπp0) + min(p(1 − π), (1 − q)πp0).

Letting 0 ≤ π ≤ 11+p0

(the complementary case yields the same results), andfollowing the variation of p, the values of the overall distortion will be:

• If (1 − p)(1 − π) < qπp0 then D = (1 − p)(1 − π) + (1 − q)πp0;

• If (1 − p)(1 − π) > (1 − π) − (1 − q)πp0 then D = qπp0 + p(1 − π);

• If (1 − p)(1 − π) ∈ [qπp0, (1 − π) − (1 − q)πp0] then D = πp0.

Therefore, we define the distortion function can be defined as:

D(p, q) , min

(

(1 − p)(1 − π) + (1 − q)πp0, qπp0 + p(1 − π), πp0

)

. (6.19)

Equations (6.18) and (6.19) enable us to formulate a Lagrangian optimizationproblem, as follows:

J(p, q) = R(p, q) + λD(p, q).

126

6. Rate Loss

We consider (1 − p)(1 − π) < qπp0, since the case D = πp0 is not interesting(X = Y , constant distortion) and (1 − p)(1 − π) > (1 − π) − (1 − q)πp0 yields thesame solution. Computing the partial derivatives with respect to p and q and settingthem to zero gives

∂J

∂p= log

(p(1 − π) + (1 − q)πp0

(1 − p)(1 − π) + qπp0

1 − p

p

)

− λ = 0,

∂J

∂q= log

(p(1 − π) + (1 − q)πp0

(1 − p)(1 − π) + qπp0

q

1 − q

)

+ λ = 0.

By summing up the above we obtain the following:

(p(1 − π) + (1 − q)πp0

(1 − p)(1 − π) + qπp0

)2

=p(1 − q)(1 − p)q

⇔p(1 − p)(1 − π)2 = (πp0)2q(1 − q). (6.20)

Solving for p and q, the system formed by (6.19) and (6.20) gives

p = 1 − D(D−πp0)(1−π)(2D−(1−π+πp0))

q = 1 − D(D−(1−π))(πp0)(2D−(1−π+πp0))

(6.21)

For every D ∈ [0, πp0], equation (6.21) defines the parameters of the binary channelp(U |X) that yield the minimum rate.

Replacing the above forms for p and q in (6.18) gives an expression for thedesired R(D) function. This expression, however, may only be an upper bound.This is because we have assumed that |U| = 2. Next, we prove that the expressionobtained by replacing p and q in (6.18) matches the lowest R(D) bound, which isRZ

X|Y (D). This proves that the expression corresponds to the desired RZWZ(D) and

that Wyner-Ziv coding does not suffer a rate loss in this case.We consider that (1 − p)(1 − π) < qπp0 (for the other relevant case the proof

is similar), and express the entropy quantities in (6.9) and (6.18) in terms of p andq. Equations (6.19) and (6.20) are the expressions for the distortion and p(U |X)achieving the R(D) points. In this case we have D = (1 − p)(1 − π) + (1 − q)πp0,and after division by (1 − π + πp0), the identity to be proven becomes:

H

((1 − p)(1 − π) + qπp0

1 − π + πp0

)

+ H

((1 − p)(1 − π) + (1 − q)πp0

1 − π + πp0

)

=

H

(πp0

1 − π + πp0

)

+(1 − π)

1 − π + πp0H(p) +

πp0

1 − π + πp0H(q) (6.22)

127

Chapter 6

From (6.20) we can derive the following basic identities:

(1 − p)(1 − π) + (1 − q)πp0

1 − π + πp0=

(1 − p)(1 − π)(1 − p)(1 − π) + qπp0

, (6.23)

p(1 − π) + qπp0

1 − π + πp0=

qπp0

(1 − p)(1 − π) + qπp0, (6.24)

(1 − p)(1 − π) + qπp0

p(1 − π) + (1 − q)πp0=

qπp0

p(1 − π). (6.25)

The first binary entropy function in (6.22) is expanded as

H

((1 − p)(1 − π) + qπp0

1 − π + πp0

)

=

log(1 − π + πp0) −(1 − p)(1 − π) + qπp0

1 − π + πp0∙ log((1 − p)(1 − π) + qπp0)−

p(1 − π) + (1 − q)πp0

1 − π + πp0∙ log(p(1 − π) + (1 − q)πp0). (6.26)

Moreover, using (6.23), the second binary entropy function of (6.22) becomes

H

((1 − p)(1 − π) + (1 − q)πp0

1 − π + πp0

)

= H

((1 − p)(1 − π)

(1 − p)(1 − π) + qπp0

)

=

log((1−p)(1−π)+qπp0)−(1 − p)(1 − π) log((1 − p)(1 − π))

(1 − p)(1 − π) + qπp0−

qπp0 log(qπp0)(1 − p)(1 − π) + qπp0

.

(6.27)

Summing up (6.26) and (6.27), then substituting the terms on the left-hand sideof (6.23)–(6.25) with their right-hand side correspondents, and following basicarithmetic operations and convenient regrouping of the terms, we get

H

((1 − p)(1 − π) + qπp0

1 − π + πp0

)

+ H

((1 − p)(1 − π) + (1 − q)πp0

1 − π + πp0

)

=

log(1 − π + πp0) −1 − π

1 − π + πp0log(1 − π) −

πp0

1 − π + πp0log(πp0)−

1 − π

1 − π + πp0∙ [−p log(p) − (1 − p) log(1 − p)]−

πp0

1 − π + πp0[−q log(q) − (1 − q) log(1 − q)]. (6.28)

It is straightforward to show that this is equivalent to (6.22).

The fact that under this setup Wyner-Ziv coding does not suffer a rate losscompared to source coding with encoder and decoder side information is a somewhat

128

6. Rate Loss

surprising discovery. This is another case for which the Wyner-Ziv no-rate-lossproperty is shown and proven except for the quadratic Gaussian case [103] andits extension [72].

6.4 Encoding Rate

We analyze the variation of the rate required to encode a binary source in the WZscenario, when the average crossover of the correlation channel is constant, but thecrossovers (a, b) in (6.1) vary. We address the case of Slepian-Wolf (a.k.a., lossless)coding but the numerical analysis shows that the conclusions can be generalized forany distortion level D = E [d(x, x)] ≤ Dmax.

The rate required to encode the source in the case of perfect reconstruction atthe decoder is H(X|Y ). We consider different sources X ∼ Bernoulli(π). For each ofthe sources, we consider different average crossover probabilities of the correlationchannel. The average crossover of the correlation channel in (6.1) is given by Davg =(1−π)a+πb, and can be achieved by different pairs (a, b) in (6.1). We will let a vary

from 0 to min(1,

Davg

(1−π)

), where the second value corresponds to having b = 0. The

variation of H(X|Y ) can be seen in Fig. 6.4. We make the following observations:

• For the same average crossover probability of the correlation channel Davg,the highest Slepian-Wolf rate corresponds to a symmetric source. The moreunbalanced the source is, the lower the rate needed to encode.

• For each source, considering a constant Davg, the lowest Slepian-Wolf rate isrequired in the case of the Z-channel correlation.

• For each source, considering a constant Davg, the Slepian-Wolf rate increasesfrom a minimum (corresponding to the Z-channel) to a maximum correspond-ing to a variable pair (a, b), which depends on the values of π and Davg.

• Only in the case of a uniform source, i.e., π = 0.5, the evolution of theSlepian-Wolf rate is symmetric, and the highest value corresponds to the BSCcorrelation, a = b = Davg [see Fig. 6.4d].

6.5 Rate Loss: the General Case

In order to describe the variation of the rate loss for WZ coding with respect topredictive coding, we perform an exhaustive set of numerical simulations consideringrepresentative values for all possible distributions for the source X, as well as allcorrelation channels p(y|x). Fig. 6.5 presents the maximum rate loss for variouscombinations of source distributions and average crossover probabilities of thecorrelation channel. For low values of the average crossover probabilities of the

129

Chapter 6

(a) π = 0.2

(b) π = 0.3

Figure 6.4: Slepian-Wolf rate, a.k.a., H(X|Y ), as a function of a = p(y = 1|x = 0)for a constant average crossover probability Davg of the binary asymmetric correlationchannel. The distribution of the binary input varies with respect to (a) π = 0.2, (b)π = 0.3, (c) π = 0.4, and (d) π = 0.5.

130

6. Rate Loss

(c) π = 0.4

(d) π = 0.5

Figure 6.4: Slepian-Wolf rate, a.k.a., H(X|Y ), as a function of a = p(y = 1|x = 0)for a constant average crossover probability Davg of the binary asymmetric correlationchannel. The distribution of the binary input varies with respect to (a) π = 0.2, (b)π = 0.3, (c) π = 0.4, and (d) π = 0.5.

131

Chapter 6

0.50.4

AVERAGE CROSSOVER

0.30.20.100

0.25

0

0.08

0.06

0.04

0.02

0.5

MA

XIM

UM

R

AT

E L

OS

S

Figure 6.5: Maximum rate-loss as a function of p(X = 1) = π and the averagecrossover probability of the BAC correlation.

AVERAGE CROSSOVER0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

MA

XIM

UM

RA

TE

LO

SS

0

0.02

0.04

0.06

0.08

Figure 6.6: Maximum rate-loss as a function of the average crossover probability ofthe BAC correlation, for π = 0.2.

correlation channel, the quality of the side information is very good, while forhigh average crossover values, the quality of the side information is very poor.In both situations, the difference between WZ and predictive coding is small. Itis straightforward that in the trivial cases, where the correlation is perfect (thatis, a = b = 0) or the source and the side information are uncorrelated (that is,a = b = 0.5), the rate loss becomes zero.

Alternatively, for any given source distribution, when the average crossoverprobability of the correlation channel changes, the maximum rate loss has avariation similar to the one presented in Fig. 6.6. There is an increase from zero—corresponding to perfect correlation—to a maximum rate loss, achieved for anaverage crossover which depends on the values of (π, a, b). Then a decrease to zerofollows, which corresponds to high average distortions.

132

6. Rate Loss

10.80.60.40.200.5

0.40.3

0.2

0.02

0.04

0.06

0.08

0

0.10

Max

imum

Rat

e Lo

ss

Figure 6.7: Maximum rate-loss as a function of crossover probabilities (a, b) of theBAC correlation for a uniform source.

Figure 6.8: Maximum rate-loss as a function of the average crossover probability ofthe BSC correlation for the uniform source.

If the average crossover of the correlation channel is fixed, coding for the uniformsource has the highest rate loss: in Fig. 6.5, the rate loss decreases as π decreases.

We now consider the evolution of the rate loss for a fixed source. For a moredetailed analysis, we take a closer look at the uniform source case, i.e., π = 0.5. Fig.6.7 presents the variation of the rate loss, in function of the crossover probabilitypairs (a, b), in the domain of interest, namely 0 ≤ a + b ≤ 1, for the uniform sourcecase. In the (a, b) plane, the line a = b corresponds to the symmetric correlationchannels, and the sectioning plane observed in Fig. 6.7 corresponds to this line.If the average crossover of the correlation channel is kept constant, the rate losswill start from zero, corresponding to the case when the correlation is given by a Z-channel. It will increase to achieve a maximum value corresponding to the symmetriccorrelation channel. Then, it will decrease from the maximum value, and become

133

Chapter 6

zero for the 0.5 crossover probability correlation. It is very important to underlinethat the only nontrivial binary correlation channel with no rate loss for WZ codingis the Z-channel.

It is worth mentioning that, even if in the specific case of the uniform source, therate loss is maximum for the symmetric correlation, when the source is not uniform,the correlation channel achieving the maximum loss will deviate as well from thesymmetric correlation.

We have seen that, firstly, for constant average crossover probability of thecorrelation channel, the highest rate loss corresponds to the uniform sources.Secondly, when the source is uniform, the highest rate loss corresponds to the binarysymmetric correlation channel. If we are looking for the overall maximum rate lossvalue, we can consider only the uniform sources and binary symmetric correlations.

The variation of the rate loss in function of the crossover probability is presentedin Fig. 6.8. We can see again an increase of the rate loss up to a maximum ofΔR = 0.0765 bps, corresponding to an average crossover probability Davg = 0.227– as the correlation channel is symmetric, a = b = Davg = 0.227. It is interesting tonotice that this rate loss is an absolute maximum for our problem, and is significantlysmaller than the theoretical upper bound derived in [105], which was of 0.22 bps.

6.5.1 Remarks

We started our analysis motivated by the higher performances reported in dis-tributed video coding for correlation channel estimation using SID (asymmetric)models, when compared to SII (symmetric) models. By reducing the problem fromthe continuous domain to the binary case, we have not changed the nature ofthe argument and we compare the rate-distortion functions under BAC and BSCcorrelations. The conclusions of the analysis in this section can be summarized asfollows:

• If the source is uniform, the BSC correlation requires the highest rate toencode, and has the highest rate loss.

• If the source is not uniform, for the same expected distortion, there will alwaysbe a BAC that requires lower rate and has lower rate-loss than the BSC, e.g.,the Z-channel.

6.6 Conclusions

This chapter presented an exhaustive analysis of the rate-loss for binary WZcoding, when compared to predictive coding. Two essential contributions are tobe remembered.

134

6. Rate Loss

First, there is no rate-loss in the case of the Z-channel correlation. This is thethird known ”no-rate-loss” case in the literature. Starting from this property, andconsidering the source-channel coding duality, it is possible to propose an equivalentof the ”Gaussian no-rate-loss” case in [103] and Costa’s ”writing on dirty paper” [24]couple.

Secondly, the maximal rate-loss for binary WZ coding is bounded by 0.076bps. Assume the existence of a practical code construction which works close tothe theoretical bound presented in Chapter 5; then, the maximum rate-loss whencompared to predictive coding is reasonably low, which is encouraging for systemsusing WZ coding.

The maximum rate loss and the maximum rate required to encode the sourceboth correspond to the binary uniform source with symmetric correlation. This isan intuitive result, since uniformity corresponds to the highest incertitude.

135

136

Chapter 7

Epilogue

7.1 General Conclusions

The work presented in this thesis can be seen as the complete solution for a particularcase of the general Wyner-Ziv coding problem, namely, the case of the binary source,with binary side-information and the Hamming distortion metric.

As many other engineering problems, our largely theoretical analysis was basedon a practical observation and a subsequent intuition. While developing correlationchannel models for Distributed Video Coding (DVC), it was observed that assumingan asymmetric correlation leads to better rate-distortion performance when com-pared to the traditional symmetric correlation model [28,86]. DVC systems requirehighly efficient channel codes to achieve the Slepian-Wolf (SW) bound, and this canonly be obtained in practice by using long codewords. On the contrary, by takingthis coding process to the limit of binary symbols, where all the sources are binary,we can model the Wyner-Ziv video coding setup by a binary Wyner-Ziv codingproblem.

The solution for the generic source coding with side-information problem wasgiven in [103] in terms of an equally generic information-theoretical conditionalinformation. For particular instantiations of the problem which may arise in practicalapplications, an explicit form of the generic solution is desired. Unfortunately, owingto its complex nature, finding such an explicit form is often a difficult problem, tothe point where closed-form solutions cannot be expressed.

For the binary doubly symmetric case, i.e., uniform source and symmetriccorrelation, the explicit form of the solution was presented as an example in [103].The case of the non-uniform source and asymmetric correlation was not studiedin the literature. Therefore, we took on the challenge of proposing a solution forthe generic binary Wyner-Ziv coding problem, with the auxiliary motivation ofexplaining the performances of asymmetric correlation models in DVC.

Our first achievement was to observe and later on prove the fact that, for a

137

Chapter 7

extreme asymmetric correlation channel, i.e., the Z-channel correlation case, thereis no rate loss when compared to predictive coding [33]. This was a significantfinding, since it is only the third known case of Wyner-Ziv coding setup havingthe no-rate-loss property. Since the structure of the Z-channel is simplified, thederivations were quite straightforward, and we obtained an analytical expression forthe rate-distortion bound in the case of the Z-channel correlation.

Encouraged by this result, we proceeded to searching for an analytical solutionfor another particular case of the binary Wyner-Ziv coding ensemble: the case of theuniform source and asymmetric correlation channel. Alas, this proved to be muchmore challenging, and we had to accept the fact that, for non-trivial correlationchannels, this problem does not have a closed-form solution. The only alternativewas to go as far as possible with the analytical description and then solve the requiredoptimization using a numerical approach.

The numerical approach compromise led to yet another significant observationfrom the information-theoretical perspective. In order to mathematically express therequired quantities, we had to make a simplifying assumption: the auxiliary variablewhich drives the solution was known to be upper-bounded in cardinality by thecardinality of the source plus one. In our case, this translated to a ternary auxiliaryvariable, and the simplifying assumption was to restrict it to be binary. By usinga generic numerical algorithm which gives the solution for the Wyner-Ziv problem,namely, the Blahut-Arimoto algorithm, we were able to corroborate our formulationswith the ground-truth solution. The intriguing observation we made was that thebinary auxiliary variable is sufficient to achieve the rate-distortion bound, since,within our precision bounds, there was no noticeable difference between the Blahut-Arimoto solution and our solution. Unfortunately, a mathematical proof for thisobservation would have required a time investment incompatible with our resources.We had to settle with a formal statement of the observation in the form of aconjecture.

At this stage we had a numerically-conjectured tight bound for the rate-distortion function of Wyner-Ziv coding in the case of the binary uniform source,with asymmetric correlation. The next step was to investigate the correspondingpredictive coding case, with the aim of evaluating the rate-loss incurred by Wyner-Ziv coding. When the side-information is available both at the encoder and at thedecoder, the general formula is a simplified version of the Wyner-Ziv case, and inturned out that the resulting minimization is fortunately analytically solvable, andthe predictive coding admits a closed-form solution.

The rate-loss evaluation for the uniform source case followed. Intuitively, sincethe uniform source has the highest entropy, the rate required to encode it is thehighest, so the rate-loss was expected to be the biggest out of all possible sources.The analysis showed that previous bounds [105] on the rate-loss incurred by binaryWyner-Ziv coding were unnecessarily over-estimated, and the maximum rate-loss

138

7. Epilogue

is actually one third of the figure proposed by previous literature. Moreover, thisanalysis allowed us to confirm the fact that the Z-channel correlation is the onlynon-trivial binary case exhibiting the no-rate-loss property.

These derivations and observations were the preamble for the challengingstep of generalizing those results to the most general binary Wyner-Ziv codingsetup possible: the non-uniform binary source and the asymmetric correlation.The generalization is non-trivial, as the non-uniformity of the source introducesan unexpected layer of complexity to the problem. The logic behind finding themaximum acceptable distortion for which a problem makes sense, as well as themechanism of determining the reconstruction functions and their correspondingusage intervals, change completely, making the general case an interesting and trulymotivating problem.

As expected, the generalization did not admit a closed-form solution, so thenumerical solution had to be employed again. The observation that a binary auxiliaryvariable suffices to reach all points on the rate-distortion bound holds in the generalcase as well. The corresponding predictive coding setup, with the side-informationavailable at both the encoder and the decoder, did however admit an analyticalexpression. Evaluating the rate-loss and the rate required to encode the non-uniformsources, we were able to confirm the intuition that the highest rate loss correspondsto the binary doubly symmetric setup.

Summing up, the contributions of this thesis are as follows:

• We derived for the first time the analytical rate-distortion function for theconventional predictive binary coding case (Chapter 4 and [75,76]).

• We described a new strategy to derive an achievable bound for the rate-distortion function for WZ coding under the binary asymmetric correlationchannel case by considering the auxiliary random variable (see eq. (12) in [103])to be binary. The resulting formulation does not generally lead to an analyticalbound. Nevertheless, the numerical comparison with the result of the Blahut-Arimoto algorithm [16] allows us to conjecture that the proposed bound istight. (Chapter 5 and [75,76,78])

• Taking a step further, we proposed a novel alternative analytical bound on therate-distortion function of WZ coding, which has an error of less than 10−3 bpswith respect to the bound calculated with the Blahut-Arimoto algorithm [16].(Chapter 5 and [76])

• We assessed the rate-distortion functions of predictive and WZ coding underall possible instantiations of the correlation channel. We show that, for bothpredictive and WZ coding, the minimum encoding rate corresponds to theZ-channel correlation. (Chapter 6 and [76])

139

Chapter 7

• We computed the rate loss of WZ coding with respect to predictive coding forany binary source and for any binary correlation channel. We show that the Z-channel is the only nontrivial binary channel that has the no-rate-loss property.Moreover, we show that the highest rate loss corresponds to a uniform sourcewith BSC correlations. (Chapter 6 and [76,77])

7.2 Future Work

The theoretical fundaments of distributed source coding were established back in theearly 1970’s. The analysis proposed in this thesis is an instantiation of the Wyner-Ziv coding problem, and can be considered a mathematical exercise with a flavor ofnumerical optimization. It has the merit of proposing a few statements that wereunknown to the community, such as the no-rate-loss property of the Z-channel, themaximum rate loss incurred by binary Wyner-Ziv coding, or the sufficiency of thebinary auxiliary variable to achieve the rate-distortion function.

The topic itself is self-contained, but there are some resulting open questionswhich would deserve some effort investment in order to consider the binary Wyner-Ziv coding a closed chapter.

• coding duality: source coding and channel coding are information-theoreticalduals, in the sense that reversing one scheme could be functionally identicalto the dual.

The classical example of no-rate-loss Wyner-Ziv coding is the doubly Gaussiancase, and this property was mentioned in [103]. It’s dual was presented in acelebrated paper by Costa [24], and is commonly referred to as ”dirty papercoding”. This duality received a considerable amount of attention due to theapplicability of the theory, as Gaussian models are a common tool.

Extending this argument to the no-rate-loss property of the Z-channel, it isknown (but not formally stated) that its dual channel coding problem willalso achieve the channel capacity. A formal proof of this fact and possibleapplication scenarios have the potential to be of interest to the source/channelcoding community

• auxiliary variable cardinality: the fact that a binary auxiliary variablemay be sufficient to achieve the rate-distortion bound in the binary Wyner-Zivcoding problem is not straight-forward for the information theory community.As a matter of fact, our work was criticised precisely for this aspect in one ofthe early submissions of our work to an international conference.

There exists a similar problem in source coding literature, i.e., lossless binarysource coding with binary side information, for which the cardinality of theauxiliary variable was bounded from |X + 2| = 4 to |X | = 2, i.e., binary. The

140

7. Epilogue

proofs (it was independently proved in [74] and [49]) are not extendable toour problem in a straightforward manner. Nevertheless, proper cardinalitybounding techniques (see [38] - Appendix C) may be applied in order toformally prove our conjecture.

• practical coding schemes: even though the theoretical methodology forachieving the rate-distortion bound in binary Wyner-Ziv coding is well inves-tigated, practical realisations are difficult to deploy.

The scheme proposed by Liveris in [61] requires a specific fixed coding rate forevery distortion level desired, so it can only be used for a limited number ofR(D) points. Polar codes are known to be inefficient for low-sized codewords.The most appealing construction, but also the most complex, remains theLDGM-LDPC compound. Such a construction would be extremely interestingas it would offer a more accurate bound on how close a practical system canapproach the theoretical limits of binary Wyner-Ziv coding.

141

142

List of publications

Publications in international journals

1. Deligiannis, N., Sechelea, A., Munteanu, A., Cheng, S. (2014). The no-rate-loss property of Wyner-Ziv coding in the Z-channel correlation case. IEEECommunications Letters, 18(10), 1675-1678.

2. Sechelea, A., Munteanu, A., Cheng, S., Deligiannis, N. (2016). On the Rate-Distortion Function for Binary Source Coding With Side Information. IEEETransactions on Communications, 64(12), 5203-5216.

Conference publications with peer review

1. Sechelea, A., Cheng, S., Munteanu, A., Deligiannis, N. (2015, December).Binary rate distortion with side information: the asymmetric correlationchannel case. In 2015 IEEE Global Conference on Signal and InformationProcessing (GlobalSIP) (pp. 235-239).

2. Sechelea, A., Munteanu, A., Cheng, S., Deligiannis, N. (2016, April). The rateloss in binary source coding with side information. In 2016 Proceedings ofData Compression Conference (DCC) (pp. 633).

3. Sechelea, A., Munteanu, A., Pizurica, A., Deligiannis, N. (2016, May). Achiev-ability of the rate-distortion function in binary uniform source coding with sideinformation. In 23rd International Conference on Telecommunications (ICT)(pp 461-465).

4. Sechelea, A., Do Huu, T., Zimos, E., Deligiannis, N. (2016, May). Twitter dataclustering and visualization. In 23rd International Conference on Telecommu-nications (ICT) (pp 195-199).

143

144

References

[1] High efficiency video coding: document ITU-T Rec. H.265, 23008–2, 2013.

[2] A. Aaron and B. Girod. Compression with side information using turbo codes. In

IEEE Data Compression Conf. (DCC), pages 252–261, Apr. 2002.

[3] A. Aaron, S. Rane, and B. Girod. Wyner-Ziv video coding with hash-based motion

compensation at the receiver. In Image Processing, 2004. ICIP’04. 2004 International

Conference on, volume 5, pages 3097–3100. IEEE, 2004.

[4] A. Aaron, S. D. Rane, E. Setton, and B. Girod. Transform-domain Wyner-Ziv codec

for video. In Electronic Imaging 2004, pages 520–528. International Society for Optics

and Photonics, 2004.

[5] A. Aaron, R. Zhang, and B. Girod. Wyner-Ziv coding of motion video. In Signals,

Systems and Computers, 2002. Conference Record of the Thirty-Sixth Asilomar

Conference on, volume 1, pages 240–244. IEEE, 2002.

[6] E. Arikan. Channel polarization: A method for constructing capacity-achieving codes

for symmetric binary-input memoryless channels. IEEE Transactions on Information

Theory, 55(7):3051–3073, 2009.

[7] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memory-

less channels. IEEE Transactions on Information Theory, 18(1):14–20, 1972.

[8] X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov, and M. Quaret. The

DISCOVER codec: Architecture, techniques and evaluation. In Picture Coding Symp.

(PCS), Nov. 2007.

[9] J. Ascenso, C. Brites, and F. Pereira. Improving frame interpolation with spatial

motion smoothing for pixel domain distributed video coding. In 5th EURASIP Con-

ference on Speech and Image Processing, Multimedia Communications and Services ,

pages 1–6. Smolenice, Slovak Republic, 2005.

[10] J. Ascenso, C. Brites, and F. Pereira. Content adaptive Wyner-Ziv video coding

driven by motion activity. In 2006 International Conference on Image Processing,

pages 605–608. IEEE, 2006.

[11] J. Ascenso, C. Brites, and F. Pereira. A flexible side information generation frame-

work for distributed video coding. Multimedia Tools and Applications, 48(3):381–409,

2010.

145

References

[12] T. Berger. Rate-distortion theory. Encyclopedia of Telecommunications, 1971.

[13] T. Berger. Multiterminal source coding. The information theory approach to

communications, 229:171–231, 1978.

[14] T. Berger and R. W. Yeung. Multiterminal source encoding with one distortion

criterion. IEEE Transactions on Information Theory, 35(2):228–236, 1989.

[15] C. Berrou and A. Glavieux. Turbo codes. Encyclopedia of Telecommunications, 2003.

[16] R. E. Blahut. Computation of channel capacity and rate-distortion functions. IEEE

Trans. Inf. Theory, 18(4):460–473, 1972.

[17] G. Braeckman, W. Chen, J. Hanca, N. Deligiannis, F. Verbist, and A. Munteanu.

Demo: Intra-frame compression for 1k-pixel visual sensors. In International Confer-

ence on Distributed Smart Cameras (ICDSC), pages 1–6. IEEE, 2013.

[18] C. Brites, J. Ascenso, and F. Pereira. Improving transform domain Wyner-Ziv video

coding performance. In 2006 IEEE International Conference on Acoustics Speech

and Signal Processing Proceedings, volume 2, pages II–II. IEEE, 2006.

[19] C. Brites, J. Ascenso, and F. Pereira. Side information creation for efficient Wyner-

Ziv video coding: classifying and reviewing. Signal Processing: Image Communica-

tion, 28(7):689–726, 2013.

[20] C. Brites and F. Pereira. Correlation noise modeling for efficient pixel and transform

domain Wyner-Ziv video coding. IEEE Trans. Circuits Syst. Video Technol.,

18(9):1177–1190, Sep. 2008.

[21] C. Brites and F. Pereira. An efficient encoder rate control solution for transform

domain wyner–ziv video coding. IEEE Transactions on Circuits and Systems for

Video Technology, 21(9):1278–1292, 2011.

[22] S. Cheng, V. Stankovic, and Z. Xiong. Computing the channel capacity and rate-

distortion function with two-sided state information. IEEE Trans. Inf. Theory,

51(12):4418–4425, 2005.

[23] S. Cheng and Z. Xiong. Successive refinement for the Wyner-Ziv problem and layered

code design. IEEE Trans. Signal Process., 53(8):3269–3281, Aug. 2005.

[24] M. Costa. Writing on dirty paper (corresp.). IEEE Transactions on Information

theory, 29(3):439–441, 1983.

[25] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in

Telecommunications. Wiley, New York, USA, 1991.

[26] I. Csiszar. On the computation of rate-distortion functions. IEEE Transactions on

Information Theory, 20(1):122–124, 1974.

[27] N. Deligiannis. Distributed Video Coding for Wireless Lightweight Multimedia

Applications. PhD thesis, Vrije Universiteit Brussel, 2012.

[28] N. Deligiannis, J. Barbarien, M. Jacobs, A. Munteanu, A. Skodras, and P. Schelkens.

Side-information-dependent correlation channel estimation in hash-based distributed

video coding. IEEE Trans. Image Process., 21(4):1934–1949, Apr. 2012.

146

References

[29] N. Deligiannis, M. Jacobs, J. Barbarien, F. Verbist, J. Skorupa, R. Van de Walle,

A. Skodras, P. Schelkens, and A. Munteanu. Joint DC coefficient band decoding

and motion estimation in Wyner-Ziv video coding. In Int. Conf. on Digital Signal

Process. (DSP), pages 1–6. IEEE, 2011.

[30] N. Deligiannis, A. Munteanu, T. Clerckx, J. Cornelis, and P. Schelkens. On the side-

information dependency of the temporal correlation in Wyner-Ziv video coding. In

IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), pages 709–712.

IEEE, 2009.

[31] N. Deligiannis, A. Munteanu, T. Clerckx, J. Cornelis, and P. Schelkens. Overlapped

block motion estimation and probabilistic compensation with application in dis-

tributed video coding. IEEE Signal Process. Lett., 16(9):743–746, 2009.

[32] N. Deligiannis, A. Munteanu, T. Clerckx, P. Schelkens, and J. Cornelis. Modeling the

correlation noise in spatial domain distributed video coding. In Data Compression

Conf. (DCC), page 443. IEEE Computer Society, 2009.

[33] N. Deligiannis, A. Sechelea, A. Munteanu, and S. Cheng. The no-rate-loss property

of Wyner-Ziv coding in the Z-Channel correlation case. IEEE Commun. Lett.,

18(10):1675–1678, 2014.

[34] N. Deligiannis, F. Verbist, J. Barbarien, J. Slowack, R. Van de Walle, P. Schelkens,

and A. Munteanu. Distributed coding of endoscopic video. In IEEE Int. Conf. Image

Process. (ICIP), Sep. 2011.

[35] N. Deligiannis, F. Verbist, A. Iossifides, J. Slowack, R. Van de Walle, P. Schelkens,

and A. Munteanu. Wyner-Ziv video coding for wireless lightweight multimedia

applications. EURASIP Jrnl. on Wireless Commun. and Netw., Special Issue on

Recent Advances in Mobile Lightweight Wireless Systems, (106), 2012.

[36] N. Deligiannis, F. Verbist, J. Slowack, R. v. d. Walle, P. Schelkens, and A. Munteanu.

Progressively refined Wyner-Ziv video coding for visual sensors. ACM Transactions

on Sensor Networks (TOSN), 10(2):21, 2014.

[37] T. N. Dinh, G. Lee, J.-Y. Chang, and H.-J. Cho. Side information generation

using extra information in distributed video coding. In 2007 IEEE International

Symposium on Signal Processing and Information Technology, pages 138–143. IEEE,

2007.

[38] A. El Gamal and Y.-H. Kim. Network information theory, 2011.

[39] M. Eldib, N. B. Bo, F. Deboeverie, J. Nino, J. Guan, S. Van de Velde, H. Steendam,

H. Aghajan, and W. Philips. A low resolution multi-camera system for person

tracking. In 2014 IEEE International Conference on Image Processing (ICIP), pages

378–382. IEEE, 2014.

[40] X. Fan, O. C. Au, and N. M. Cheung. Transform-domain adaptive correlation

estimation (TRACE) for Wyner-Ziv video coding. IEEE Trans. Circuits Syst. Video

Technol., 20(11):1423–1436, Nov. 2010.

[41] A. Fuldseth, G. Bjøntegaard, M. Budagavi, and V. Sze. Ce10: Core transform design

for HEVC. JCTVC-G495, Nov, 2011.

147

References

[42] R. Gallager. Low-Density Parity-Check Codes. Info Theory IT-8, pages 21–28, 1962.

[43] J. Garcia-Frias, Y. Zhao, and W. Zhong. Turbo-like codes for transmission of

correlated sources over noisy channels. IEEE Signal Processing Magazine, 24(5):58–

66, 2007.

[44] M. Gastpar. To code or not to code. PhD thesis, Citeseer, 2002.

[45] M. Gastpar, B. Rimoldi, and M. Vetterli. To code, or not to code: Lossy source-

channel communication revisited. IEEE Trans. Inf. Theory, 49(5):1147–1158, 2003.

[46] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero. Distributed video coding.

Proc. IEEE, 93(1):71–83, Jan. 2005.

[47] A. Grange and H. Alvestrand. A vp9 bitstream overview. 2013.

[48] S. Gruenwedel, V. Jelaca, P. Van Hese, R. Kleihorst, and W. Philips. Phd forum:

Multi-view occupancy maps using a network of low resolution visual sensors. In Dis-

tributed Smart Cameras (ICDSC), 2011 Fifth ACM/IEEE International Conference

on, pages 1–2. IEEE, 2011.

[49] W. Gu, R. Koetter, M. Effros, and T. Ho. On source coding with coded side

information for a binary source with binary side information. In Int. Symp. on

Inf. Theory, ISIT, pages 1456–1460. IEEE, 2007.

[50] J. Han, A. Saxena, and K. Rose. Towards jointly optimal spatial prediction and

adaptive transform in video/image coding. In 2010 IEEE International Conference

on Acoustics, Speech and Signal Processing, pages 726–729. IEEE, 2010.

[51] J. Hanca, N. Deligiannis, and A. Munteanu. Real-time distributed video coding

simulator for 1k-pixel visual sensor. In ACM International Conference on Distributed

Smart Cameras, pages 199–200. ACM, 2015.

[52] J. Hanca, N. Deligiannis, and A. Munteanu. Real-time distributed video coding for

1k-pixel visual sensor networks. SPIE Journal of Electronic Imaging, accepted for

publication.

[53] X. Huang and S. Forchhammer. Cross-band noise model refinement for transform

domain Wyner-Ziv video coding. Sig. Proc.: Image Commun., 27(1):16–30, Jan.

2012.

[54] N. Hussami, S. B. Korada, and R. Urbanke. Performance of polar codes for channel

and source coding. In 2009 IEEE International Symposium on Information Theory,

pages 1488–1492. IEEE, 2009.

[55] D. ITU-T. recommendation and final draft international standard of joint video

specification (itu-t rec. h. 264— iso/iec 14496-10 avc). Joint Video Team (JVT) of

ISO/IEC MPEG and ITU-T VCEG, JVTG050, 33, 2003.

[56] JCT-VC. Encoder-side description of test model under consideration. In Proc. JCT-

VC Meeting, 2010.

[57] S. B. Korada. Polar codes for channel and source coding. PhD thesis, Ecole

Polytechnique Federale de Lausanne, 2009.

148

References

[58] S. B. Korada and R. L. Urbanke. Polar codes are optimal for lossy source coding.

IEEE Transactions on Information Theory, 56(4):1751–1768, 2010.

[59] Z. Li, L. Liu, and E. J. Delp. Rate distortion analysis of motion side estimation

in Wyner–Ziv video coding. IEEE Transactions on Image Processing, 16(1):98–113,

2007.

[60] A. Liveris, Z. Xiong, and C. Georghiades. Compression of binary sources with side

information at the decoder using LDPC codes. IEEE Commun. Lett., 6(10):440–442,

Oct. 2002.

[61] A. D. Liveris, Z. Xiong, and C. N. Georghiades. Nested convolutional/turbo codes for

the binary Wyner-Ziv problem. In Image Processing, 2003. ICIP 2003. Proceedings.

2003 International Conference on, volume 1, pages I–601. IEEE, 2003.

[62] M. W. Marcellin and T. R. Fischer. Trellis coded quantization of memoryless and

Gauss-Markov sources. IEEE Transactions on Communications, 38(1):82–93, 1990.

[63] E. Martinian, A. Vetro, J. S. Yedidia, J. Ascenso, A. Khisti, and D. Malioutov. Hybrid

distributed video coding using SCA codes. In 2006 IEEE Workshop on Multimedia

Signal Processing, pages 258–261. IEEE, 2006.

[64] E. Martinian and M. J. Wainwright. Low-density constructions can achieve the

Wyner-Ziv and Gelfand-Pinsker bounds. In 2006 IEEE International Symposium on

Information Theory, pages 484–488. IEEE, 2006.

[65] T. Maugey, G. Petrazzuoli, P. Frossard, M. Cagnazzo, and B. Pesquet-Popescu. Key

view selection in distributed multiview coding. In IEEE Visual Commun. and Image

Process. Conf., pages 486–489. IEEE, 2014.

[66] F. Pereira, L. Torres, C. Guillemot, T. Ebrahimi, R. Leonardi, and S. Klomp.

Distributed video coding: selecting the most promising application scenarios. Signal

Processing: Image Communication, 23(5):339–352, 2008.

[67] E. Perron, S. Diggavi, and E. Telatar. The Kaspi rate-distortion problem with

encoder side-information: Binary erasure case. Technical report, 2006.

[68] E. Perron, S. Diggavi, and E. Telatar. Lossy source coding with gaussian or erased

side-information. In 2009 IEEE International Symposium on Information Theory,

pages 1035–1039. IEEE, 2009.

[69] G. Petrazzuoli, T. Maugey, M. Cagnazzo, and B. Pesquet-Popescu. Depth-based

multiview distributed video coding. IEEE Trans. Multimedia, 16(7):1834–1848, 2014.

[70] M. T. Pourazad, C. Doutre, M. Azimi, and P. Nasiopoulos. HEVC - the new gold

standard for video compression: How does HEVC compare with H. 264/AVC? IEEE

consumer electronics magazine, 1(3):36–46, 2012.

[71] S. Pradhan and K. Ramchandran. Distributed source coding using syndromes

(DISCUS): Design and construction. IEEE Trans. Inf. Theory, 49(3):626–643, Mar.

2003.

[72] S. S. Pradhan, J. Chou, and K. Ramchandran. Duality between source coding and

channel coding and its extension to the side information case. IEEE Transactions

on Information Theory, 49(5):1181–1203, 2003.

149

References

[73] R. Puri and K. Ramchandran. Prism: A new robust video coding architecture

based on distributed compression principles. In Proceedings of the annual allerton

conference on communication control and computing, volume 40, pages 586–595.

Citeseer, 2002.

[74] M. Salehi. Cardinality bounds on auxiliary variables in multiple-user theory via the

method of Ahlswede and Korner. Dept. Statistics, Stanford Univ., Stanford, CA,

Tech. Rep, 33, 1978.

[75] A. Sechelea, S. Cheng, A. Munteanu, and N. Deligiannis. Binary rate distortion with

side information: The asymmetric correlation channel case. Proc. IEEE Global Conf.

on Signal and Inf. Process. (Global SIP), Dec. 2015, arXiv preprint arXiv:1510.07517.

[76] A. Sechelea, A. Munteanu, S. Cheng, and N. Deligiannis. On the rate-distortion

function for binary source coding with side information. IEEE Transactions on

Communications, 64(12):5203–5216, 2016.

[77] A. Sechelea, A. Munteanu, S. Cheng, and N. Deligiannis. The rate loss in binary

source coding with side information. In Data Compression Conference (DCC), 2016.

[78] A. Sechelea, A. Munteanu, A. Pizurica, and N. Deligiannis. Achievability of the rate-

distortion function in binary uniform source coding with side information. In 23rd

International Conference on Telecommunications (ICT), 2016.

[79] S. Shamai, S. Verdu, and R. Zamir. Systematic lossy source/channel coding. IEEE

Transactions on Information Theory, 44(2):564–579, 1998.

[80] C. E. Shannon. A mathematical theory of communication. The Bell System Technical

Journal, 27:379–423, 1948.

[81] C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE

Nat. Conv. Rec, 4(1):142–163, 1959.

[82] R. Sidhu, D. S. Sanders, and M. E. McAlindon. Gastrointestinal capsule endoscopy:

from tertiary centres to primary care. Bmj, 332(7540):528–531, 2006.

[83] D. Slepian and J. K. Wolf. Noiseless coding of correlated information sources. IEEE

Trans. Inf. Theory, 19(4):471–480, Apr. 1973.

[84] Y. Steinberg. Coding and common reconstruction. IEEE Trans. Inf. Theory,

55(11):4995–5010, 2009.

[85] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high

efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol.,

22(12):1649–1668, 2012.

[86] V. Toto-Zarasoa, A. Roumy, and C. Guillemot. Source modeling for distributed video

coding. IEEE Trans. Circuits Syst. Video Techn., 22(2):174–187, Feb. 2012.

[87] S.-Y. Tung. Multiterminal source coding, 1978.

[88] J.-M. Valin, T. B. Terriberry, N. E. Egge, T. Daede, Y. Cho, C. Montgomery, and

M. Bebenita. Daala: Building a next-generation video codec from unconventional

technology. arXiv preprint arXiv:1608.01947, 2016.

150

References

[89] X. H. Van, J. Ascenso, and F. Pereira. HEVC backward compatible scalability: A

low encoding complexity distributed video coding based approach. Signal Process.:

Image Commun., 33:51–70, 2015.

[90] H. Van Luong, S. Forchhammer, J. Slowack, J. De Cock, and R. Van de Walle.

Adaptive mode decision with residual motion compensation for distributed video

coding. APSIPA Transactions on Signal and Information Processing, 4:e1, 2015.

[91] H. Van Luong, L. L. Raket, and S. Forchhammer. Re-estimation of motion and

reconstruction for distributed video coding. IEEE Trans. Image Process., 23(7):2804–

2819, 2014.

[92] D. Varodayan, A. Aaron, and B. Girod. Rate-adaptive codes for distributed source

coding. Signal Process., Special Issue on Distributed Source Coding, 86(11):3123–

3130, Nov. 2006.

[93] F. Verbist, N. Deligiannis, W. Chen, P. Schelkens, and A. Munteanu. Transform-

domain Wyner-Ziv video coding for 1k-pixel visual sensors. In Int. Conf. on

Distributed Smart Cameras (ICDSC), pages 1–6. IEEE, 2013.

[94] M. J. Wainwright. Sparse graph codes for side information and binning. IEEE Signal

Processing Magazine, 24(5):47–57, 2007.

[95] M. J. Wainwright and E. Maneva. Lossy source encoding via message-passing and

decimation over generalized codewords of LDGM codes. In Proceedings. International

Symposium on Information Theory, 2005. ISIT 2005., pages 1493–1497. IEEE, 2005.

[96] M. J. Wainwright, E. Maneva, and E. Martinian. Lossy source compression using

low-density generator matrix codes: Analysis and algorithms. IEEE Transactions on

Information Theory, 56(3):1351–1368, 2010.

[97] Z. Wang and A. C. Bovik. A universal image quality index. IEEE signal processing

letters, 9(3):81–84, 2002.

[98] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment:

from error visibility to structural similarity. IEEE transactions on image processing,

13(4):600–612, 2004.

[99] R. P. Westerlaken, R. K. Gunnewiek, and R. L. Lagendijk. The role of the virtual

channel in distributed source coding of video. In IEEE Int. Conf. Image Process.

(ICIP), volume 1, pages 581–584, Sep. 2005.

[100] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra. Overview of the

H. 264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol.,

13(7):560–576, 2003.

[101] A. Wyner. Recent results in the Shannon theory. IEEE Trans. Inf. Theory, 20(1):2–

10, Jan. 1974.

[102] A. D. Wyner. The rate-distortion function for source coding with side information

at the decoder\ 3-ii: General sources. Information and control, 38(1):60–80, 1978.

[103] A. D. Wyner and J. Ziv. The rate-distortion function for source coding with side

information at the decoder. IEEE Trans. Inf. Theory, 22(1):1–10, Jan. 1976.

151

References

[104] Q. Xu, V. Stankovic, and Z. Xiong. Layered Wyner-Ziv video coding for transmission

over unreliable channels. Signal Processing, 86(11):3212–3225, 2006.

[105] R. Zamir. The rate loss in the Wyner-Ziv problem. IEEE Trans. Inf. Theory,

42(6):2073–2084, 1996.

[106] R. Zamir, S. Shamai, and U. Erez. Nested linear/lattice codes for structured

multiterminal binning. IEEE Trans. Inf. Theory, 48(6):1250–1276, Jun. 2002.

152

Documents

BINARY SOURCE CODING WITH SIDE INFORMATION · Slepian-Wolf (SW) coding, which is concerned with separate lossless compression of correlated sources with joint decoding, forms the