Unusual coding possibilities information theory in practicechaos.if.uj.edu.pl/ZOA/files/semianria/chaos/23.11.2015.pdf · Symbol of probability 𝒑 㑅contains /𝒑㑆 bits Encoding

Unusual coding possibilities –

information theory in practice statistical physics on hypercube of 2𝑛 sequences

1) Entropy coding: remove statistical redundancy in data compression:

Huffman, arithmetic coding, ANS

2) (Forward) Error correction: add deterministic redundancy to repair

damaged message (erasure, BSC, deletion channels):

LDPC, sequential decoding, fountain codes and expansions

3) Rate distortion: shorter description approximation (lossy

compression) , Kuznetsov-Tsybakov: (statistical) constraints known

only to sender (steganography, QR codes) and their “duality”

4) Distributed source coding – for multiple correlated sources that

don’t communicate with each other.

Jarosław Duda, Kraków, 23.XI.2015

1. ENTROPY CODING – DATA COMPRESSION

We need 𝒏 bits of information to choose one of 𝟐𝒏 possibilities.

For length 𝒏 0/1 sequences with 𝒑𝒏 of “1”, how many bits we need to choose one?

𝑛!

(𝑝𝑛)! ((1−𝑝)𝑛)!≈

(𝑛/𝑒)𝑛

(𝑝𝑛/𝑒)𝑝𝑛((1−𝑝)𝑛/𝑒)(1−𝑝)𝑛 = 2𝑛 lg(𝑛)−𝑝𝑛 lg(𝑝𝑛)−(1−𝑝)𝑛 lg((1−𝑝)𝑛) = 2𝑛(−𝑝 lg(𝑝)−(1−𝑝)lg (1−𝑝))

Entropy coder: encodes sequence with (𝑝𝑠)𝑠=0..𝑚−1 probability distribution

using asymptotically at least 𝑯 = ∑ 𝒑𝒔 𝐥𝐠(𝟏/𝒑𝒔)𝒔 bits/symbol (𝐻 ≤ lg(𝑚))

Seen as weighted average: symbol of probability 𝒑 contains 𝐥𝐠(𝟏/𝒑) bits

symbol of probability 𝒑

contains 𝐥𝐠(𝟏/𝒑) bits

symbol → length 𝑟 bit sequence

assumes: Pr(symbol) ≈ 2−𝑟

Huffman coding:

- Sort symbols by frequencies

- Replace two least frequent

with tree of them and so on

generally: prefix codes

defined by a tree

unique decoding:

no sequence is a prefix of another

http://en.wikipedia.org/wiki/Huffman_coding

𝐴 → 0, 𝐵 → 100, 𝐶 → 101, 𝐷 → 110, 𝐸 → 111

http://en.wikipedia.org/wiki/Huffman_coding

Symbol of probability 𝒑 contains 𝐥𝐠(𝟏/𝒑) bits

Encoding (𝑝𝑠) distribution with entropy coder optimal for (𝑞𝑠) distribution costs

Δ𝐻 = ∑ 𝑝𝑠lg(1/𝑞𝑠) − ∑ 𝑝𝑠lg(1/𝑝𝑠) = ∑ 𝑝𝑠lg (𝑝𝑠

𝑞𝑠) ≈

1

ln (4)𝑠𝑠 ∑(𝑝𝑠−𝑞𝑠)2

𝑝𝑠 𝑠 𝑠

more bits/symbol – so called Kullback-Leibler “distance” (not symmetric).

Huffman/prefix codes – encode symbols directly as bit sequences

Approximates probabilities as powers of 1/2

e.g. p(A) =1/2, p(B)=1/4, p(C)=1/4

example of Huffman penalty

for 𝝆(𝟏 − 𝝆)𝒏 distribution on 256 size alphabet

Huffman: 8 bits → at least 1 bit

ratio ≤ 8 here

terrible at encoding very frequent events

Ratio → 8/H Huffman ANS 𝜌 = 0.5 4.001 3.99 4.00 𝜌 = 0.6 4.935 4.78 4.93 𝜌 = 0.7 6.344 5.58 6.33 𝜌 = 0.8 8.851 6.38 8.84 𝜌 = 0.9 15.31 7.18 15.29 𝜌 = 0.95 26.41 7.58 26.38 𝜌 = 0.99 96.95 7.90 96.43

0

0

1

1A

B C

A 0

C 11

B 10

Huffman coding (HC), prefix codes, needs sorting: example of Huffman penalty

most of everyday compression, for truncated 𝜌(1 − 𝜌)𝑛 distribution

e.g. zip, gzip, cab, jpeg, gif, png, tiff, pdf, mp3… (can’t use less than 1 bits/symbol) Zlibh (the fastest generally available implementation):

Encoding ~ 320 MB/s (/core, 3GHz)

Decoding ~ 300-350 MB/s

Range coding (RC): large alphabet arithmetic coding,

needs multiplication, e.g. 7-Zip, VP Google video codecs

(e.g. YouTube, Hangouts).

Encoding ~ 100-150 MB/s tradeoffs

Decoding ~ 80 MB/s

(binary) Arithmetic coding (AC):

H.264, H.265 video, ultracompressors e.g. PAQ Huffman: 1byte → at least 1 bit

Encoding/decoding ~ 20-30MB/s ratio ≤ 8 here

Asymmetric Numeral Systems (ANS) tabled (tANS) - no multiplication or sorting

FSE implementation of tANS: Encoding ~ 350 MB/s Decoding ~ 500 MB/s

RC → ANS: ~7x decoding speedup, no multiplication (switched e.g. in LZA compressor)

HC → ANS means better compression and ~ 1.5x decoding speedup (e.g. zhuff, lzturbo)

Ratio → 8/H zlibh FSE 𝜌 = 0.5 4.001 3.99 4.00 𝜌 = 0.6 4.935 4.78 4.93 𝜌 = 0.7 6.344 5.58 6.33 𝜌 = 0.8 8.851 6.38 8.84 𝜌 = 0.9 15.31 7.18 15.29

𝜌 = 0.95 26.41 7.58 26.38 𝜌 = 0.99 96.95 7.90 96.43

Huffman vs ANS in compressors (LZ + entropy coder): from Matt Mahoney benchmarks http://mattmahoney.net/dc/text.html

Enwiki8 100,000,000B

encode time [ns/byte]

decode time [ns/byte]

LZA 0.82b –mx9 –b7 –h7 26,396,613 449 9.7 lzturbo 1.2 –39 –b24 26,915,461 582 2.8

WinRAR 5.00 –ma5 –m5 27,835,431 1004 31 WinRAR 5.00 –ma5 –m2 29,758,785 228 30

lzturbo 1.2 –32 30,979,376 19 2.7 zhuff 0.97 –c2 34,907,478 24 3.5 gzip 1.3.5 –9 36,445,248 101 17

pkzip 2.0.4 –ex 36,556,552 171 50 ZSTD 0.0.1 40,024,854 7.7 3.8

WinRAR 5.00 –ma5 –m1 40,565,268 54 31

zhuff, ZSTD (Yann Collet): LZ4 + tANS (switched from Huffman)

lzturbo (Hamid Bouzidi): LZ + tANS (switched from Huffman)

LZA (Nania Francesco): LZ + rANS (switched from range coding)

e.g. lzturbo vs gzip: better compression, 5x faster encoding, 6x faster decoding

saving time and energy in extremely frequent task

http://mattmahoney.net/dc/text.html

Apple LZFSE = Lempel-Ziv + Finite State Entropy

Default in iOS9 and OS X 10.11 “matching the compression ratio of

ZLIB level 5, but with much higher

energy efficiency and speed

(between 2x and 3x) for both

encode and decode operation”

Finite State Entropy is Yann Collet’s (known from e.g. LZ4)

implementation of tANS

CRAM 3.0 of European Bioinformatics Institute for DNA compression

Format Size Encoding(s) Decoding(s)

BAM 8164228924 1216 111

CRAM v2 (LZ77+Huff) 5716980996 1247 182

CRAM v3 (order 1 rANS) 4922879082 574 190

gzip used everywhere … but it’s ancient (1993)… what next?

Google – first Zopfli (2013, gzip compatible) Now Brotli (incompatible, Huffman) – Firefox support, all news:

“Google’s new Brotli compression algorithm is 26 percent better than

existing solutions” (22.09.2015)

First tests of new ZSTD_HC (Yann Collet, open source, tANS, 28.10.2015):

zstd v0.2 427 ms (239 MB/s), 51367682, 189 ms (541 MB/s)

zstd_HC v0.2 -1 996 ms (102 MB/s), 48918586, 244 ms (419 MB/s)









brotli 2015-10-11 level 0 1202 ms (85 MB/s), 47882059, 489 ms (209 MB/s)





lzturbo (tANS) seems even better, but close-source.

Operating on fractional number of bits

restricting range 𝑁 to length 𝐿 subrange contains 𝐥𝐠(𝑵/𝑳) bits

number 𝑥 contains 𝐥𝐠(𝒙) bits

adding symbol of probability 𝑝 - containing 𝐥𝐠(𝟏/𝒑) bits

lg(𝑁/𝐿) + lg(1/𝑝) ≈ lg(𝐿/𝐿′) for 𝑳′ ≈ 𝒑 ⋅ 𝑳 lg(𝑥) + lg(1/𝑝) = lg(𝑥′) for 𝒙′ ≈ 𝒙/𝒑

RENORMALIZATION to prevent 𝑥 → ∞

Enforce 𝑥 ∈ 𝐼 = {𝐿, … ,2𝐿 − 1} by

transmitting lowest bits to bitstream

“buffer” 𝒙 contains 𝐥𝐠(𝒙) bits of information

– produces bits when accumulated

Symbol of 𝑃𝑟 = 𝑝 contains lg (1/𝑝) bits:

lg(𝑥) → lg(𝑥) + lg (1/𝑝) modulo 1

What symbol spread should we choose?

(using PRNG seeded with cryptkey to add encryption)

for example: rate loss for 𝑝 = (0.04, 0.16, 0.16, 0.64)

𝐿 = 16 states, 𝑞 = (1, 3, 2, 10), 𝑞/𝐿 = (0.0625, 0.1875, 0.125, 0.625)

method symbol spread dH/H rate

loss

- - ~0.011 penalty of quantizer itself

Huffman 0011222233333333 ~0.080 would give Huffman

decoder

spread_range_i() 0111223333333333 ~0.059 Increasing order

spread_range_d() 3333333333221110 ~0.022 Decreasing order

spread_fast() 0233233133133133 ~0.020 fast

spread_prec() 3313233103332133 ~0.015 close to quantization dH/H

spread_tuned() 3233321333313310 ~0.0046 better than quantization dH/H due to using also 𝑝

spread_tuned_s() 2333312333313310 ~0.0040 L log L complexity (sort)

spread_tuned_p() 2331233330133331 ~0.0058 testing 1/(p ln(1+1/i)) ~ i/p

approximation

2. (forward) ERROR CORRECTION

Add deterministic redundancy to repair eventual errors

Binary Symmetric Channel (BSC): (e.g. 0101 → 0111)

Every bit has independent 𝑝𝑏 probability of being flipped (0 ↔ 1)

simplest error correction - Triple Modular Redundacy:

Use 3 (or more) identical copies, decode using Majority Voting

0 → 000, 1 → 111 … 001 → 0, 110 → 1

Codewords: the used coding sequences (000, 111)

Decoder chooses the closest codeword (in its ball)

Hamming distance: the number of different bits (e. g. 𝑑(001,000) = 1)

Triple Modular Redundancy:

2 radius 1 balls (000,111) cover all 3 bits sequences

2 codewords ⋅ (1 + 3) ball size = 23 codes

(7,4) Hamming codes: 𝑑(𝑐1, 𝑐2) ≥ 2 (≠)

7 bits – ball of radius 1 contains 8 sequences

24 codewords, union of their balls: 24 ⋅ 8 = 27

Finally: 27 codes are divided among 24 radius 1 balls around codewords

Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

...

Encoded data

bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15

Parity

bit

coverage

p1 X X X X X X X X X X

p2 X X X X X X X X X X

p4 X X X X X X X X X

p8 X X X X X X X X

p16 X X X X X

Triple Modular Redundancy: rate 1/3 (we need to transmit 3 times more)

Assume 𝜖 = 0.01, probability of changed 2 of 3 bits is ≈ 0.0003 > 0

Theory: rate 1 − ℎ(0.01) ≈ 0.919 can fully repair the message

The size of radius 𝜖𝑛 ball in the space of length 𝑛 sequences (hypercube):

(𝑛

𝜖𝑛) ≈ 2𝑛ℎ(𝜖)

The number of such balls: 2𝑛/2𝑛ℎ(𝜖) = 2𝑛(1−ℎ(𝜖))

We can send 𝑛(1 − ℎ(𝜖)) bits

by choosing one of balls (codewords)

Alternative explanation of rate 𝑹(𝝐) = 𝟏 − 𝒉(𝝐) bits/symbol:

additionally having bitflip positions, worth ℎ(𝜖) bits/symbol,

we would get 1 bit/symbol

Theoretical rate is asymptotic – required long bit blocks

Directly long blocks - Low Density Parity Check (LDPC)

randomly distributed parity bits

Correction uses belief propagation – operates on Pr(ci = 1)

Ci = Pr(ci = 1) for input message (initialization of the process)

qij (1) (= Ci ) – ‘belief’ from ci to fj rji (b) – Pr(ci = b) from fj to ci

Sequential decoding – search tree of possible corrections

𝑠𝑡𝑎𝑡𝑒 = 𝑓(𝑖𝑛𝑝𝑢𝑡_𝑏𝑙𝑜𝑐𝑘, 𝑠𝑡𝑎𝑡𝑒)

e.g. 𝑜𝑢𝑡𝑝𝑢𝑡_𝑏𝑙𝑜𝑐𝑘 =

”𝑖𝑛𝑝𝑢𝑡_𝑏𝑙𝑜𝑐𝑘 | (𝑠𝑡𝑎𝑡𝑒 & 1)”

𝒔𝒕𝒂𝒕𝒆 – depends on the history – connects redundancies,

guards the proper correction, uniformly spreading checksums

Encoding uses only some subset of 𝑠𝑡𝑎𝑡𝑒s (e.g. last bit = 0)

Channels

Binary Symmetric Channel (difficult to repair, well understood) –

bits have independent Pr = 𝑝𝑏 of being flipped (e.g. 0101 → 0111)

rate = 1 − ℎ(𝑝𝑏)

Erasure Channel (simple to repair, well understood) -

bits have independent Pr = 𝑝𝑒 of being erased (e.g. 0101 → 01? 1)

rate = 1 − 𝑝𝑒

synchronization errors , for example Deletion Channel -

Bits have independent Pr = 𝑝𝑑 of being deleted (e.g. 0101 → 011)

Extremely difficult to repair, rate = ???

Rateless erasure codes: produce some large number of packets,

any large enough subset (≥ 𝑁) is sufficient for reconstruction

For example Reed-Salomon codes (slow):

𝑁 − 1 degree polynomial is defined by values in any 𝑁 points

Fountain codes (from 2002, Michael Luby)

𝒀 are XORs of some (random) subsets of 𝑿 (𝑋1, 𝑋2, … ) → (𝑌1, 𝑌2, … )

e.g. if 𝑌1 = 𝑋1 , 𝑌2 = 𝑋1 XOR 𝑋2 then 𝑋1 = 𝑌1 , 𝑋2 = 𝑌2 XOR 𝑋1

What densities of XORred numbers should we chose?

Pr(1) =1

𝑁 Pr(𝑑) =

1

𝑑(𝑑−1) for 𝑑 = 2, … , 𝑁

Quick, but needs (1 + 𝜖)𝑁 to reliably decode (matrix invertible?)

If packet wasn’t lost, it is probable damaged

protecting with Error Correction requires a priori damage level knowledge

unavailable in: data storage, watermarking, unknown packet route

Joint Reconstruction Codes: the same using a posteriori damage level knowledge

Some packets are lost, some are damaged

Adding redundancy (sender)

requires knowing the damage level

Not available in many situations (JRC…):

- damage of storage medium depends

on its history,

- broadcaster doesn’t know individual

damage levels,

- damage of watermarking

depends on capturing conditions

- damage of packet in network

depends on its route

- rapidly varying conditions

can prevent adaptation

We could add some statistical/deterministic knowledge to decoder,

Getting compression/entropy coding without encoder’s knowledge

JRC : undamaged case when received 𝑁 + 1 packets

Large 𝑁: Gaussian distribution, on average 1.5 more nodes to test

Damaged case (𝑁 = 3, 2: 𝜖 = 0, 𝑑: 𝜖 = 0.2): ≈ Pareto disstribution

Sequential decoding: choose and expand most promising candidate (largest weight)

Rate is 𝑅𝑐(𝜖) = 1 − ℎ1/(1+𝑐)(𝜖) for Pareto coefficient 𝑐

where ℎ𝑢(𝜖) =𝜖𝑢+(1−𝜖)𝑢

1−𝑢 is Renyi entropy, ℎ1 is Shannon entropy

𝑅0(𝜖) = 1 − ℎ(𝜖)

𝑅1(𝜖) = lg (1 + 2√𝜖(1 − 𝜖))

Pr(# steps > 𝑠) ∝ 𝑠−𝑐

3. KUZNETSOV-TSYBAKOV PROBLEM, rate distortion

Imagine we can send 𝒏 bits, but 𝒌 = 𝒑𝒇𝒏 of them are damaged (fixed)

For example QR-like codes with fixed some fraction of pixels (e.g. as contour of a plane):

If the receiver would know positions of these strong constraints,

we could just use the remaining 𝑛 − 𝑘 = (1 − 𝑝𝑓)𝑛 bits

Kuznetsov-Tsybakov problem: what if only the sender knows the constraints?

Surprisingly, we can still approach the same capacity.

But the cost (paid only by the sender!) grows to infinity there.

Weak/statistical constraints can be applied to all positions

e.g. standard steganography problem:

To avoid detection, we would like to modify at least as possible

Entropy coder: difference map carries ℎ(𝑝) ≈ 0.5 bits/symbol … picture needed for XOR

Kuznetsov-Tsybakov … do receiver have to know the constraints?

Homogeneous contrast case: use (𝑐, 1 − 𝑐) or (1 − 𝑐, 𝑐) distributions

General constraints: grayness of a pixel is probability of using 1 there

Approaching the optimum: use the freedom of choosing the exact encoding sequence!

Find the most suitable encoding sequence (F) for our constraints.

The cost (search) is paid while encoding, decoding is straightforward.

Decoder cannot directly deduce the constraints.

Uniform grayness: probability that a random length 𝑛 bit sequence have 𝒈 of “1” is

1

2𝑛 ( 𝑛𝑔𝑛

) ≈ 2−𝑛(1−ℎ(𝑔))

so if we had essentially more than 𝟐𝒏(𝟏−𝒉(𝒈)) different (random) ways to encode

single message, with high probability some of them will use 𝒈 of “1”.

The cost: the number of different messages reduces to at most

2𝑛/2𝑛(1−ℎ(𝑔)) = 2𝑛ℎ(𝑔) giving rate limit exactly as if the receiver would know 𝒈

Kuznetsov-Tsybakov: probability of accidently fulfilling all 𝒏𝒑𝒇 constraints is 𝟐−𝒏𝒑𝒇

so we need to have more than 2𝑛𝑝𝑓 different (pseudorandom) ways of encoding –

we could encode 2𝑛/2𝑛𝑝𝑓 = 2𝑛(1−𝑝𝑓) different messages there –

the rate limit is 𝟏 − 𝒑𝒇 as if the receiver would know positions of fixed bits.

Constrained Coding: find the closest 𝒀 = 𝑻(𝐩𝐚𝐲𝐥𝐨𝐚𝐝_𝐛𝐢𝐭𝐬, 𝐟𝐫𝐞𝐞𝐝𝐨𝐦_𝐛𝐢𝐭𝐬), send 𝒀

Rate Distortion: find the closest 𝑻(𝟎, 𝐟𝐫𝐞𝐞𝐝𝐨𝐦_𝐛𝐢𝐭𝐬), send freedom_bits

Because T is bijection, for the same constraints/message to fulfill/resemble,

density of freedom bits + density of payload bits = 1

rate of RD + rate of CC = 1

we can translate previous calculations by just using 1 - rate

B&W picture: 1 bit per pixel = visual aspect (RD) + hidden message (CC)

4) Distributed Source Coding – separated compression and

joint decompression of uncommunicated correlated sources

e.g. for correlated sensors, stereo recording (sound, video)

Statistical physics on 2𝑛 hypercube of sequences

( 𝑛𝑝𝑛

) ≈ 2𝑛ℎ(𝑝) ℎ(𝑝) = −𝑝 lg(𝑝) − (1 − 𝑝)lg (1 − 𝑝)

Entropy coding: symbol of probability 𝑝 contains lg (1

𝑝) bits of information

Error correction: 𝜖-bitflip probability bit carries 1 − ℎ(𝜖) bits of information

encode as one of codewords: centers of size 2𝑛ℎ(𝜖) balls

JRC: reconstruct if sum of 1 − ℎ(𝜖) over received packets is sufficient

Rate distortion: approximate sequence as the closest codeword

Constrained coding: for each message there is a (disjoint) set of codewords

for given constraints, we find the closest codeword representing our message

Distributed source coding: e.g. codewords include correlations

Documents

Unusual coding possibilities information theory in practicechaos.if.uj.edu.pl/ZOA/files/semianria/chaos/23.11.2015.pdf · Symbol of probability 𝒑 㑅contains /𝒑㑆 bits Encoding