Entropy Coding

Vicente González Ruiz - Departamento de Informática - UAL

November 4, 2024

Contents

 1 What is entropy coding?
 2 Why EC?
 3 A classification of entropy encoders
 4 Huffman Coding (HC)
 5 Arithmetic Coding (AC)
 6 zlib
 7 Portable Network Graphics (PNG)
 8 References

1 What is entropy coding?

Entropy Coding (EC) [4] encompasses a whole series of coding techniques that exploit the statistical redundancy in data with the ultimate goal of finding a more compact representation, but without lossing information. EC is related to the definition of entropy in the context of the Information Theory [6]. In this area, the entropy quantifies the average amount of information represented by a data set, so that the higher the entropy, the better the efficiency of such representation (data).

2 Why EC?

All data (text, audio, image, video, etc.) compression techniques rely on EC to achieve an effective reduction of bits (of data). For example, JPEG uses a combination of Huffman Coding (see below) and Run-Length Encoding to represent the quantized DCT coefficients.

3 A classification of entropy encoders

General speaking, data are sequences of symbols. There are basically two types of entropy encoders depending on how the symbols are processed:

  1. Symbol encoders: Those that process the sequence symbol by symbol. Examples of this type of algorithms are Huffman Coding [5] and Arithmetic Coding [2]. These encoders are also called “probabilistic encoders” because the number of bits of code assigned to a symbol \(s\) depends on the probability of the symbol \(p(s)\).
  2. String encoders: Those that process the sequence by blocks of symbols (strings). Examples are Run-Length Encoding (RLE) [9] and all the dictionary-based text compressors, such as LZW [7].

4 Huffman Coding (HC)

A Huffman code [51] is a prefix code that, for each code-word, allows to “navigate” through the so called Huffman tree from the trunk to one of its leaves, without uncertainty. The Huffman tree satisfies that \begin {equation} l(c(s)) = \lceil I(s)\rceil , \label {eq:huffman_performance} \end {equation} where \(l(c(s))\) is the length of the (Huffman) code-word, usually in bits of data, assigned to the symbol \(s\), and \(I(s)\) is the amount of information represented by \(s\), measured bits of information [6].

Notice that the minimum number of bits that can be used for representing a symbol (using Huffman) is 1, which can be a inefficient when the length of the alphabet is small.1 Another drawback of HC is that the Huffman tree must be updated each time the probabilistic model is updated, action very frequent when using adaptive models.

Resources

  1. Huffman.py: Implementation in VCF.
  2. Notebook showing the use of Huffman.py.
  3. Compression algorithms in Python.
  4. Huffman Coding Implementation in Python with Examples.
  5. Huffman Coding.
  6. Prediction by Partial Matching.
  7. Lossless JPEG.
  8. Context-Adaptive Variable-Length Coding.

To do

  1. Create a new module named adaptive_Huffman.py that use a context-based adaptive probabilistic model [10], where the user can select the order of the model.

5 Arithmetic Coding (AC)

In an arithmetic codec [21], the number of bits of data that are used for representing symbols match exactly the number of bits of information provided by the symbols, i.e, \begin {equation} l(c(s)) = I(s). \end {equation}

This also means that, even if the size of the alphabet is small, the coding performance of an arithmetic code is optimal, although this optimality is only fully satisfied if the number of symbols to encode is infinite. Notice also that, if the alphabet is large, the encoding performance difference between a Huffman code (or any other prefix code) and an arithmetic code, vanishes.

Resources

  1. Arithmetic Coding (notebook).
  2. Reference Arithmetic Coding/Reference Arithmetic Coding.
  3. Compression Algorithms in Python.
  4. Prediction by Partial Matching.
  5. Lossless JPEG.
  6. Context-Adaptive Binary Arithmetic Coding.

To do

  1. Create a new module named arith.py that use a context-based adaptive probabilistic model and Arithmetic Coding [10], where the user can select the order of the model.

6 zlib

zlib is based on DEFLATE, which in turn is based on LZ77 [71] and Huffman coding. Therefore, zlib exploits the repetition of symbols and also, the 0-order statistical redundancy2. One of the main advantages of zlib is that is quite fast compared to symbol encoders such as HC and AC.

Nowadays, zlib is a keystone in data manipulation because it is the basic library used in such important applications as Zip and Gzip.

Resources

  1. zlib — Compression compatible with gzip.
  2. Lossless JPEG.

7 Portable Network Graphics (PNG)

PNG [8] (pronounced “ping”) is a dictionary-based (string encoding) lossless image compression format used for representing digital images and videos [3] in III... format [?]. The entropy encoder of PNG is based on HC and LZSS, and a pixel predictor that removes the spatial redundancy.

We must bear in mind that as such an image compressor, we can only interact with PNG at the image level, that is, it only accepts images (in shades of gray or in color, with the possibility of an alpha channel), and only returns images.

8 References

[1]   Vicente González-Ruiz. Compresión Reversible y Transmisión de Imágenes. PhD thesis, University of Almería (UAL), 2000.

[2]   V. González-Ruiz. Arithmetic Coding.

[3]   V. González-Ruiz. Digital Images and Videos.

[4]   V. González-Ruiz. Entropy Coding.

[5]   V. González-Ruiz. Huffman Coding.

[6]   V. González-Ruiz. Information Theory.

[7]   V. González-Ruiz. Lempel-Ziv-Welch.

[8]   V. González-Ruiz. PNG (Portable Network Graphics).

[9]   V. González-Ruiz. Run-Length Encoding.

[10]   Nelson M. and Gailly J. The Data Compression Book. M&T Books, 1996.

1In this case, Huffman can be ineffective. In an extreme case, the alphabet could have only two symbols, and therefore the Huffman encoder does not compress or expand.

2In other words, that some symbols are more frequently than others.