Entropy Coding

Vicente González Ruiz - Departamento de Informática - UAL

March 21, 2025

1 What is entropy coding?
2 Why EC?
3 A classification of entropy encoders
4 Huffman Coding (HC)
5 Arithmetic Coding (AC)
6 zlib
7 Portable Network Graphics (PNG)
8 References

1 What is entropy coding?

Entropy Coding (EC) [4] encompasses a whole series of coding techniques that exploit the statistical redundancy in data with the ultimate goal of finding, without lossing information, a more compact representation. EC is related to the definition of entropy in the context of the Information Theory [6]. In this area, the entropy quantifies the average amount of information represented by a data set, so that the higher the entropy, the better the efficiency of such representation ( data).

2 Why EC?

All data (text, audio, image, video, etc.) compression techniques rely on EC to achieve an effective reduction of bits (of data). For example, JPEG uses a combination of Huffman Coding (see below) and Run-Length Encoding to represent the quantized DCT coefficients.

3 A classification of entropy encoders

General speaking, data are sequences of symbols. There are basically two types of entropy encoders depending on how the symbols are processed:

Symbol encoders: Those that process the sequence symbol by symbol. Examples of this type of algorithms are Huffman Coding [5] and Arithmetic Coding [2]. These encoders are also called “probabilistic encoders” because the number of bits of code assigned to a symbol \(s\) depends on the probability of the symbol, \(p(s)\). This probability is generated by a probabilistic model.
String encoders: Those that process the sequence by blocks of symbols (strings). Examples are Run-Length Encoding (RLE) [9] and all the dictionary-based text compressors, such as LZW [7].

4 Huffman Coding (HC)

A Huffman code [5, 1] is a prefix code that, for each code-word (a binary sequence), allows to “navigate” through the so called Huffman tree from the trunk to one of its leaves, without uncertainty. The Huffman tree satisfies that

\begin{equation} l(c(s)) = \lceil I(s)\rceil , \label {eq:huffman_performance} \end{equation}

where \(l(c(s))\) is the length of the (Huffman) code-word, usually in bits of data, assigned to the symbol \(s\), and \(I(s)\) is the amount of information represented by \(s\), measured bits of information [6].

Notice that the minimum number of bits that can be used for representing a symbol (using Huffman) is 1, which can be rather inefficient when the size of the alphabet is small.¹ Another drawback of HC is that the Huffman tree must be updated each time the probabilistic model is updated, action very frequent when using adaptive models.

Resources

To do

Create a new module named adaptive_Huffman.py that use a context-based adaptive probabilistic model [10] to be used in VCF, where the user can select the order of the model (how many previously encoded symbols use the model to generate the probability of the symbol that is going to be encoded), and HC is used as an entropy encoder. The run-time and the memory required should be as small as possible. 5 points.

5 Arithmetic Coding (AC)

In an arithmetic codec [2, 1], the number of bits of data that are used for representing symbols match exactly the number of bits of information provided by the symbols, i.e,

\begin{equation} l(c(s)) = I(s). \end{equation}

This also means that, even if the size of the alphabet is small, the coding performance of an arithmetic code is optimal, although this optimality is only fully satisfied if the number of symbols to encode is infinite. Notice also that, if the alphabet is large, the encoding performance difference between a Huffman code (or any other prefix code) and an arithmetic code, vanishes.

Resources

To do

Create a new module named arith.py that implement a static probabilistic model and Arithmetic Coding [10] to be used in VCF. 5 points.
Create a new module named adaptive_arith.py that use a context-based adaptive probabilistic model and Arithmetic Coding [10], where the user can select the order of the model. 5 points.

6 zlib

zlib is based on DEFLATE, which in turn is based on LZ77 [7, 1] and Huffman coding. Therefore, zlib exploits the repetition of symbols and also, the 0-order statistical redundancy². One of the main advantages of zlib is that is quite fast compared to symbol encoders such as HC and AC.

Nowadays, zlib is a keystone in data manipulation because it is the basic library used in such important applications as Zip and Gzip.

Resources

zlib — Compression compatible with gzip.
Lossless JPEG.
Usage of z_lib in VCF.

7 Portable Network Graphics (PNG)

PNG [8] (pronounced “ping”) is a dictionary-based (string encoding) lossless image compression format used for representing digital images and videos [3] in III... format. The entropy encoder of PNG is based on HC and LZSS, and a pixel predictor that removes the spatial redundancy.

We must bear in mind that as such an image compressor, we can only interact with PNG at the image level, that is, it only accepts images (in shades of gray or in color, with the possibility of an alpha channel), and only returns images.

Resources

Usage of PNG in VCF.

8 References

[1] Vicente González-Ruiz. Compresión Reversible y Transmisión de Imágenes. PhD thesis, University of Almería (UAL), 2000.

[2] V. González-Ruiz. Arithmetic Coding.

[3] V. González-Ruiz. Digital Images and Videos.

[4] V. González-Ruiz. Entropy Coding.

[5] V. González-Ruiz. Huffman Coding.

[6] V. González-Ruiz. Information Theory.

[7] V. González-Ruiz. Lempel-Ziv-Welch.

[8] V. González-Ruiz. PNG (Portable Network Graphics).

[9] V. González-Ruiz. Run-Length Encoding.

[10] Nelson M. and Gailly J. The Data Compression Book. M&T Books, 1996.

¹In this case, Huffman can be ineffective. In an extreme case, the alphabet could have only two symbols, and therefore the Huffman encoder does not compress or expand.

²In other words, that some symbols are more frequently than others.

Entropy Coding

Contents

1 What is entropy coding?

2 Why EC?

3 A classification of entropy encoders

4 Huffman Coding (HC)

Resources

To do

5 Arithmetic Coding (AC)

Resources

To do

6 zlib

Resources

7 Portable Network Graphics (PNG)

Resources

8 References