Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Unusual coding possibilities –
information theory in practice statistical physics on hypercube of 2𝑛 sequences
1) Entropy coding: remove statistical redundancy in data compression:
Huffman, arithmetic coding, ANS
2) (Forward) Error correction: add deterministic redundancy to repair
damaged message (erasure, BSC, deletion channels):
LDPC, sequential decoding, fountain codes and expansions
3) Rate distortion: shorter description approximation (lossy
compression) , Kuznetsov-Tsybakov: (statistical) constraints known
only to sender (steganography, QR codes) and their “duality”
4) Distributed source coding – for multiple correlated sources that
don’t communicate with each other.
Jarosław Duda, Kraków, 23.XI.2015
1. ENTROPY CODING – DATA COMPRESSION
We need 𝒏 bits of information to choose one of 𝟐𝒏 possibilities.
For length 𝒏 0/1 sequences with 𝒑𝒏 of “1”, how many bits we need to choose one?
𝑛!
(𝑝𝑛)! ((1−𝑝)𝑛)!≈
(𝑛/𝑒)𝑛
(𝑝𝑛/𝑒)𝑝𝑛((1−𝑝)𝑛/𝑒)(1−𝑝)𝑛 = 2𝑛 lg(𝑛)−𝑝𝑛 lg(𝑝𝑛)−(1−𝑝)𝑛 lg((1−𝑝)𝑛) = 2𝑛(−𝑝 lg(𝑝)−(1−𝑝)lg (1−𝑝))
Entropy coder: encodes sequence with (𝑝𝑠)𝑠=0..𝑚−1 probability distribution
using asymptotically at least 𝑯 = ∑ 𝒑𝒔 𝐥𝐠(𝟏/𝒑𝒔)𝒔 bits/symbol (𝐻 ≤ lg(𝑚))
Seen as weighted average: symbol of probability 𝒑 contains 𝐥𝐠(𝟏/𝒑) bits
symbol of probability 𝒑
contains 𝐥𝐠(𝟏/𝒑) bits
symbol → length 𝑟 bit sequence
assumes: Pr(symbol) ≈ 2−𝑟
Huffman coding:
- Sort symbols by frequencies
- Replace two least frequent
with tree of them and so on
generally: prefix codes
defined by a tree
unique decoding:
no sequence is a prefix of another
http://en.wikipedia.org/wiki/Huffman_coding
𝐴 → 0, 𝐵 → 100, 𝐶 → 101, 𝐷 → 110, 𝐸 → 111
Symbol of probability 𝒑 contains 𝐥𝐠(𝟏/𝒑) bits
Encoding (𝑝𝑠) distribution with entropy coder optimal for (𝑞𝑠) distribution costs
Δ𝐻 = ∑ 𝑝𝑠lg(1/𝑞𝑠) − ∑ 𝑝𝑠lg(1/𝑝𝑠) = ∑ 𝑝𝑠lg (𝑝𝑠
𝑞𝑠) ≈
1
ln (4)𝑠𝑠 ∑(𝑝𝑠−𝑞𝑠)2
𝑝𝑠 𝑠 𝑠
more bits/symbol – so called Kullback-Leibler “distance” (not symmetric).
Huffman/prefix codes – encode symbols directly as bit sequences
Approximates probabilities as powers of 1/2
e.g. p(A) =1/2, p(B)=1/4, p(C)=1/4
example of Huffman penalty
for 𝝆(𝟏 − 𝝆)𝒏 distribution on 256 size alphabet
Huffman: 8 bits → at least 1 bit
ratio ≤ 8 here
terrible at encoding very frequent events
Ratio → 8/H Huffman ANS 𝜌 = 0.5 4.001 3.99 4.00 𝜌 = 0.6 4.935 4.78 4.93 𝜌 = 0.7 6.344 5.58 6.33 𝜌 = 0.8 8.851 6.38 8.84 𝜌 = 0.9 15.31 7.18 15.29 𝜌 = 0.95 26.41 7.58 26.38 𝜌 = 0.99 96.95 7.90 96.43
0
0
1
1A
B C
A 0
C 11
B 10
Huffman coding (HC), prefix codes, needs sorting: example of Huffman penalty
most of everyday compression, for truncated 𝜌(1 − 𝜌)𝑛 distribution
e.g. zip, gzip, cab, jpeg, gif, png, tiff, pdf, mp3… (can’t use less than 1 bits/symbol) Zlibh (the fastest generally available implementation):
Encoding ~ 320 MB/s (/core, 3GHz)
Decoding ~ 300-350 MB/s
Range coding (RC): large alphabet arithmetic coding,
needs multiplication, e.g. 7-Zip, VP Google video codecs
(e.g. YouTube, Hangouts).
Encoding ~ 100-150 MB/s tradeoffs
Decoding ~ 80 MB/s
(binary) Arithmetic coding (AC):
H.264, H.265 video, ultracompressors e.g. PAQ Huffman: 1byte → at least 1 bit
Encoding/decoding ~ 20-30MB/s ratio ≤ 8 here
Asymmetric Numeral Systems (ANS) tabled (tANS) - no multiplication or sorting
FSE implementation of tANS: Encoding ~ 350 MB/s Decoding ~ 500 MB/s
RC → ANS: ~7x decoding speedup, no multiplication (switched e.g. in LZA compressor)
HC → ANS means better compression and ~ 1.5x decoding speedup (e.g. zhuff, lzturbo)
Ratio → 8/H zlibh FSE 𝜌 = 0.5 4.001 3.99 4.00 𝜌 = 0.6 4.935 4.78 4.93 𝜌 = 0.7 6.344 5.58 6.33 𝜌 = 0.8 8.851 6.38 8.84 𝜌 = 0.9 15.31 7.18 15.29
𝜌 = 0.95 26.41 7.58 26.38 𝜌 = 0.99 96.95 7.90 96.43
Huffman vs ANS in compressors (LZ + entropy coder): from Matt Mahoney benchmarks http://mattmahoney.net/dc/text.html
Enwiki8 100,000,000B
encode time [ns/byte]
decode time [ns/byte]
LZA 0.82b –mx9 –b7 –h7 26,396,613 449 9.7 lzturbo 1.2 –39 –b24 26,915,461 582 2.8
WinRAR 5.00 –ma5 –m5 27,835,431 1004 31 WinRAR 5.00 –ma5 –m2 29,758,785 228 30
lzturbo 1.2 –32 30,979,376 19 2.7 zhuff 0.97 –c2 34,907,478 24 3.5 gzip 1.3.5 –9 36,445,248 101 17
pkzip 2.0.4 –ex 36,556,552 171 50 ZSTD 0.0.1 40,024,854 7.7 3.8
WinRAR 5.00 –ma5 –m1 40,565,268 54 31
zhuff, ZSTD (Yann Collet): LZ4 + tANS (switched from Huffman)
lzturbo (Hamid Bouzidi): LZ + tANS (switched from Huffman)
LZA (Nania Francesco): LZ + rANS (switched from range coding)
e.g. lzturbo vs gzip: better compression, 5x faster encoding, 6x faster decoding
saving time and energy in extremely frequent task
Apple LZFSE = Lempel-Ziv + Finite State Entropy
Default in iOS9 and OS X 10.11 “matching the compression ratio of
ZLIB level 5, but with much higher
energy efficiency and speed
(between 2x and 3x) for both
encode and decode operation”
Finite State Entropy is Yann Collet’s (known from e.g. LZ4)
implementation of tANS
CRAM 3.0 of European Bioinformatics Institute for DNA compression
Format Size Encoding(s) Decoding(s)
BAM 8164228924 1216 111
CRAM v2 (LZ77+Huff) 5716980996 1247 182
CRAM v3 (order 1 rANS) 4922879082 574 190
gzip used everywhere … but it’s ancient (1993)… what next?
Google – first Zopfli (2013, gzip compatible) Now Brotli (incompatible, Huffman) – Firefox support, all news:
“Google’s new Brotli compression algorithm is 26 percent better than
existing solutions” (22.09.2015)
First tests of new ZSTD_HC (Yann Collet, open source, tANS, 28.10.2015):
zstd v0.2 427 ms (239 MB/s), 51367682, 189 ms (541 MB/s)
zstd_HC v0.2 -1 996 ms (102 MB/s), 48918586, 244 ms (419 MB/s)
zstd_HC v0.2 -2 1152 ms (88 MB/s), 47193677, 247 ms (414 MB/s)
zstd_HC v0.2 -3 2810 ms (36 MB/s), 45401096, 249 ms (411 MB/s)
zstd_HC v0.2 -4 4009 ms (25 MB/s), 44631716, 243 ms (421 MB/s)
zstd_HC v0.2 -5 4904 ms (20 MB/s), 44529596, 244 ms (419 MB/s)
zstd_HC v0.2 -6 5785 ms (17 MB/s), 44281722, 244 ms (419 MB/s)
zstd_HC v0.2 -7 7352 ms (13 MB/s), 44108456, 241 ms (424 MB/s)
zstd_HC v0.2 -8 9378 ms (10 MB/s), 43978979, 237 ms (432 MB/s)
zstd_HC v0.2 -9 12306 ms (8 MB/s), 43901117, 237 ms (432 MB/s)
brotli 2015-10-11 level 0 1202 ms (85 MB/s), 47882059, 489 ms (209 MB/s)
brotli 2015-10-11 level 3 1739 ms (58 MB/s), 47451223, 475 ms (215 MB/s)
brotli 2015-10-11 level 4 2379 ms (43 MB/s), 46943273, 477 ms (214 MB/s)
brotli 2015-10-11 level 5 6175 ms (16 MB/s), 43363897, 528 ms (193 MB/s)
brotli 2015-10-11 level 6 9474 ms (10 MB/s), 42877293, 463 ms (221 MB/s)
lzturbo (tANS) seems even better, but close-source.
Operating on fractional number of bits
restricting range 𝑁 to length 𝐿 subrange contains 𝐥𝐠(𝑵/𝑳) bits
number 𝑥 contains 𝐥𝐠(𝒙) bits
adding symbol of probability 𝑝 - containing 𝐥𝐠(𝟏/𝒑) bits
lg(𝑁/𝐿) + lg(1/𝑝) ≈ lg(𝐿/𝐿′) for 𝑳′ ≈ 𝒑 ⋅ 𝑳 lg(𝑥) + lg(1/𝑝) = lg(𝑥′) for 𝒙′ ≈ 𝒙/𝒑
RENORMALIZATION to prevent 𝑥 → ∞
Enforce 𝑥 ∈ 𝐼 = {𝐿, … ,2𝐿 − 1} by
transmitting lowest bits to bitstream
“buffer” 𝒙 contains 𝐥𝐠(𝒙) bits of information
– produces bits when accumulated
Symbol of 𝑃𝑟 = 𝑝 contains lg (1/𝑝) bits:
lg(𝑥) → lg(𝑥) + lg (1/𝑝) modulo 1
What symbol spread should we choose?
(using PRNG seeded with cryptkey to add encryption)
for example: rate loss for 𝑝 = (0.04, 0.16, 0.16, 0.64)
𝐿 = 16 states, 𝑞 = (1, 3, 2, 10), 𝑞/𝐿 = (0.0625, 0.1875, 0.125, 0.625)
method symbol spread dH/H rate
loss
- - ~0.011 penalty of quantizer itself
Huffman 0011222233333333 ~0.080 would give Huffman
decoder
spread_range_i() 0111223333333333 ~0.059 Increasing order
spread_range_d() 3333333333221110 ~0.022 Decreasing order
spread_fast() 0233233133133133 ~0.020 fast
spread_prec() 3313233103332133 ~0.015 close to quantization dH/H
spread_tuned() 3233321333313310 ~0.0046 better than quantization dH/H due to using also 𝑝
spread_tuned_s() 2333312333313310 ~0.0040 L log L complexity (sort)
spread_tuned_p() 2331233330133331 ~0.0058 testing 1/(p ln(1+1/i)) ~ i/p
approximation
2. (forward) ERROR CORRECTION
Add deterministic redundancy to repair eventual errors
Binary Symmetric Channel (BSC): (e.g. 0101 → 0111)
Every bit has independent 𝑝𝑏 probability of being flipped (0 ↔ 1)
simplest error correction - Triple Modular Redundacy:
Use 3 (or more) identical copies, decode using Majority Voting
0 → 000, 1 → 111 … 001 → 0, 110 → 1
Codewords: the used coding sequences (000, 111)
Decoder chooses the closest codeword (in its ball)
Hamming distance: the number of different bits (e. g. 𝑑(001,000) = 1)
Triple Modular Redundancy:
2 radius 1 balls (000,111) cover all 3 bits sequences
2 codewords ⋅ (1 + 3) ball size = 23 codes
(7,4) Hamming codes: 𝑑(𝑐1, 𝑐2) ≥ 2 (≠)
7 bits – ball of radius 1 contains 8 sequences
24 codewords, union of their balls: 24 ⋅ 8 = 27
Finally: 27 codes are divided among 24 radius 1 balls around codewords
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
...
Encoded data
bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15
Parity
bit
coverage
p1 X X X X X X X X X X
p2 X X X X X X X X X X
p4 X X X X X X X X X
p8 X X X X X X X X
p16 X X X X X
Triple Modular Redundancy: rate 1/3 (we need to transmit 3 times more)
Assume 𝜖 = 0.01, probability of changed 2 of 3 bits is ≈ 0.0003 > 0
Theory: rate 1 − ℎ(0.01) ≈ 0.919 can fully repair the message
The size of radius 𝜖𝑛 ball in the space of length 𝑛 sequences (hypercube):
(𝑛
𝜖𝑛) ≈ 2𝑛ℎ(𝜖)
The number of such balls: 2𝑛/2𝑛ℎ(𝜖) = 2𝑛(1−ℎ(𝜖))
We can send 𝑛(1 − ℎ(𝜖)) bits
by choosing one of balls (codewords)
Alternative explanation of rate 𝑹(𝝐) = 𝟏 − 𝒉(𝝐) bits/symbol:
additionally having bitflip positions, worth ℎ(𝜖) bits/symbol,
we would get 1 bit/symbol
Theoretical rate is asymptotic – required long bit blocks
Directly long blocks - Low Density Parity Check (LDPC)
randomly distributed parity bits
Correction uses belief propagation – operates on Pr(ci = 1)
Ci = Pr(ci = 1) for input message (initialization of the process)
qij (1) (= Ci ) – ‘belief’ from ci to fj rji (b) – Pr(ci = b) from fj to ci
Sequential decoding – search tree of possible corrections
𝑠𝑡𝑎𝑡𝑒 = 𝑓(𝑖𝑛𝑝𝑢𝑡_𝑏𝑙𝑜𝑐𝑘, 𝑠𝑡𝑎𝑡𝑒)
e.g. 𝑜𝑢𝑡𝑝𝑢𝑡_𝑏𝑙𝑜𝑐𝑘 =
”𝑖𝑛𝑝𝑢𝑡_𝑏𝑙𝑜𝑐𝑘 | (𝑠𝑡𝑎𝑡𝑒 & 1)”
𝒔𝒕𝒂𝒕𝒆 – depends on the history – connects redundancies,
guards the proper correction, uniformly spreading checksums
Encoding uses only some subset of 𝑠𝑡𝑎𝑡𝑒s (e.g. last bit = 0)
Channels
Binary Symmetric Channel (difficult to repair, well understood) –
bits have independent Pr = 𝑝𝑏 of being flipped (e.g. 0101 → 0111)
rate = 1 − ℎ(𝑝𝑏)
Erasure Channel (simple to repair, well understood) -
bits have independent Pr = 𝑝𝑒 of being erased (e.g. 0101 → 01? 1)
rate = 1 − 𝑝𝑒
synchronization errors , for example Deletion Channel -
Bits have independent Pr = 𝑝𝑑 of being deleted (e.g. 0101 → 011)
Extremely difficult to repair, rate = ???
Rateless erasure codes: produce some large number of packets,
any large enough subset (≥ 𝑁) is sufficient for reconstruction
For example Reed-Salomon codes (slow):
𝑁 − 1 degree polynomial is defined by values in any 𝑁 points
Fountain codes (from 2002, Michael Luby)
𝒀 are XORs of some (random) subsets of 𝑿 (𝑋1, 𝑋2, … ) → (𝑌1, 𝑌2, … )
e.g. if 𝑌1 = 𝑋1 , 𝑌2 = 𝑋1 XOR 𝑋2 then 𝑋1 = 𝑌1 , 𝑋2 = 𝑌2 XOR 𝑋1
What densities of XORred numbers should we chose?
Pr(1) =1
𝑁 Pr(𝑑) =
1
𝑑(𝑑−1) for 𝑑 = 2, … , 𝑁
Quick, but needs (1 + 𝜖)𝑁 to reliably decode (matrix invertible?)
If packet wasn’t lost, it is probable damaged
protecting with Error Correction requires a priori damage level knowledge
unavailable in: data storage, watermarking, unknown packet route
Joint Reconstruction Codes: the same using a posteriori damage level knowledge
Some packets are lost, some are damaged
Adding redundancy (sender)
requires knowing the damage level
Not available in many situations (JRC…):
- damage of storage medium depends
on its history,
- broadcaster doesn’t know individual
damage levels,
- damage of watermarking
depends on capturing conditions
- damage of packet in network
depends on its route
- rapidly varying conditions
can prevent adaptation
We could add some statistical/deterministic knowledge to decoder,
Getting compression/entropy coding without encoder’s knowledge
JRC : undamaged case when received 𝑁 + 1 packets
Large 𝑁: Gaussian distribution, on average 1.5 more nodes to test
Damaged case (𝑁 = 3, 2: 𝜖 = 0, 𝑑: 𝜖 = 0.2): ≈ Pareto disstribution
Sequential decoding: choose and expand most promising candidate (largest weight)
Rate is 𝑅𝑐(𝜖) = 1 − ℎ1/(1+𝑐)(𝜖) for Pareto coefficient 𝑐
where ℎ𝑢(𝜖) =𝜖𝑢+(1−𝜖)𝑢
1−𝑢 is Renyi entropy, ℎ1 is Shannon entropy
𝑅0(𝜖) = 1 − ℎ(𝜖)
𝑅1(𝜖) = lg (1 + 2√𝜖(1 − 𝜖))
Pr(# steps > 𝑠) ∝ 𝑠−𝑐
3. KUZNETSOV-TSYBAKOV PROBLEM, rate distortion
Imagine we can send 𝒏 bits, but 𝒌 = 𝒑𝒇𝒏 of them are damaged (fixed)
For example QR-like codes with fixed some fraction of pixels (e.g. as contour of a plane):
If the receiver would know positions of these strong constraints,
we could just use the remaining 𝑛 − 𝑘 = (1 − 𝑝𝑓)𝑛 bits
Kuznetsov-Tsybakov problem: what if only the sender knows the constraints?
Surprisingly, we can still approach the same capacity.
But the cost (paid only by the sender!) grows to infinity there.
Weak/statistical constraints can be applied to all positions
e.g. standard steganography problem:
To avoid detection, we would like to modify at least as possible
Entropy coder: difference map carries ℎ(𝑝) ≈ 0.5 bits/symbol … picture needed for XOR
Kuznetsov-Tsybakov … do receiver have to know the constraints?
Homogeneous contrast case: use (𝑐, 1 − 𝑐) or (1 − 𝑐, 𝑐) distributions
General constraints: grayness of a pixel is probability of using 1 there
Approaching the optimum: use the freedom of choosing the exact encoding sequence!
Find the most suitable encoding sequence (F) for our constraints.
The cost (search) is paid while encoding, decoding is straightforward.
Decoder cannot directly deduce the constraints.
Uniform grayness: probability that a random length 𝑛 bit sequence have 𝒈 of “1” is
1
2𝑛 ( 𝑛𝑔𝑛
) ≈ 2−𝑛(1−ℎ(𝑔))
so if we had essentially more than 𝟐𝒏(𝟏−𝒉(𝒈)) different (random) ways to encode
single message, with high probability some of them will use 𝒈 of “1”.
The cost: the number of different messages reduces to at most
2𝑛/2𝑛(1−ℎ(𝑔)) = 2𝑛ℎ(𝑔) giving rate limit exactly as if the receiver would know 𝒈
Kuznetsov-Tsybakov: probability of accidently fulfilling all 𝒏𝒑𝒇 constraints is 𝟐−𝒏𝒑𝒇
so we need to have more than 2𝑛𝑝𝑓 different (pseudorandom) ways of encoding –
we could encode 2𝑛/2𝑛𝑝𝑓 = 2𝑛(1−𝑝𝑓) different messages there –
the rate limit is 𝟏 − 𝒑𝒇 as if the receiver would know positions of fixed bits.
Constrained Coding: find the closest 𝒀 = 𝑻(𝐩𝐚𝐲𝐥𝐨𝐚𝐝_𝐛𝐢𝐭𝐬, 𝐟𝐫𝐞𝐞𝐝𝐨𝐦_𝐛𝐢𝐭𝐬), send 𝒀
Rate Distortion: find the closest 𝑻(𝟎, 𝐟𝐫𝐞𝐞𝐝𝐨𝐦_𝐛𝐢𝐭𝐬), send freedom_bits
Because T is bijection, for the same constraints/message to fulfill/resemble,
density of freedom bits + density of payload bits = 1
rate of RD + rate of CC = 1
we can translate previous calculations by just using 1 - rate
B&W picture: 1 bit per pixel = visual aspect (RD) + hidden message (CC)
4) Distributed Source Coding – separated compression and
joint decompression of uncommunicated correlated sources
e.g. for correlated sensors, stereo recording (sound, video)
Statistical physics on 2𝑛 hypercube of sequences
( 𝑛𝑝𝑛
) ≈ 2𝑛ℎ(𝑝) ℎ(𝑝) = −𝑝 lg(𝑝) − (1 − 𝑝)lg (1 − 𝑝)
Entropy coding: symbol of probability 𝑝 contains lg (1
𝑝) bits of information
Error correction: 𝜖-bitflip probability bit carries 1 − ℎ(𝜖) bits of information
encode as one of codewords: centers of size 2𝑛ℎ(𝜖) balls
JRC: reconstruct if sum of 1 − ℎ(𝜖) over received packets is sufficient
Rate distortion: approximate sequence as the closest codeword
Constrained coding: for each message there is a (disjoint) set of codewords
for given constraints, we find the closest codeword representing our message
Distributed source coding: e.g. codewords include correlations