Sistemas Multimedia - Spatial Transforms

Vicente González Ruiz - Departamento de Informática - UAL

August 6, 2024

Contents

 1 Spatial decorrelation
 2 Benefits of spatial transforms
 3 Scalar quantization and rate/distortion optimization in the transform domain
 4 Learned transforms
 5 2D-partitioning
 6 Resources
 7 To-Do
 8 References

1 Spatial decorrelation

Spatial transforms used in image and video compression exploit the statistical correlation (and perceptual redundancy1) that the pixels show as a consequence of the spatial (2D) correlation that is present in most of the images (and video frames). For example, some areas of an image can occur more than once (in the same image). Also, usually happens that neighbor pixels have similar values.

While color transforms are pixel-wise computations, spatial transforms are image-wise. This means that the transform inputs an image (of pixels) and outputs a matrix of coefficients, which generally express the resemblance between the image and a set of basis functions, usually orthogonal. For example, after using the 2D-DCT (two-dimensional Discrete Cosine Transform) [2], (the index of) each coefficient represents a different spatial frequency2 and its value, the amount of the corresponding basis found in the image. In the case of the 2D-DWT [3], the coefficients “speak” additionally about a spatial resolution in the image pyramid and position inside of the pyramid level.

2 Benefits of spatial transforms

Most spatial transforms provide:

  1. Energy concentration: Usually, a small set of (low-frequency) coefficients represents most of the information (energy) of the image. This decreases the entropy3 and increases the range of the quantization step sizes, because the dynamic range of the coefficients is higher than the dynamic range of the pixels4.
  2. Low/High frequency analysis: The human visual system is more sensitive to the low frequencies (for this reason, the contrast sensitiviy function is not flat). This means that we can quantize more severely the high frequencies without generating a perceptible distortion.
  3. Multiresolution: Depending on the transform, it is possible to reconstruct the original image by resolution levels [3]. This option can be interesting when the resolution at which the image must be reconstructed is not known a priori. For example, JPEG 2000 (which is based on the 2D-DWT) is used in digital cinema because, although movie players do not have the same resolution in all movie theaters, the same code-stream (with the maximum resolution) can be used in all of them.

3 Scalar quantization and rate/distortion optimization in the transform domain

Scalar quantization is efficient in the transform domain because the coefficients are decorrelated. The next logical step (after quantization) is the entropy coding of the quantization indexes. Here, depending on how the coefficients are quantized, we can trace different RD curves (all of them starting (and finising) in the same distortion). For example, if we compress each subband5 independently, we must find the quantization step sizes that select the same slope in the RD curve of each subband [4]. A RD curve is a discrete convex function where each line that connects adjacent RD points has a slope. Futhermore, each RD point generate a line (with a corresponding slope) between it and the lowest-quality RD point. In this last context, given a target distortion or rate, the quantization step size used in each subband should generate the same slope.

4 Learned transforms

Using machine learning techniques (for example, with an artificial neural network), it is possible to build a machine-krafted transform, specifically tuned for some type of images, or at least, capable of determining specific features in the images.

Potentially, learned-based image (and video) compressors are adaptive algorithms that can be more efficient than those in which the transforms are pre-defined.

5 2D-partitioning

Depending on the content of the image, it can be necessary to divide the image into 2D chunks (usually called tiles or blocks), and encode each chunk independently. In general:

  1. Tiles are used when the image is made up of very different areas (for example, with text and natural content). Tiles are usually rectangular but can have any size, and are usually defined attending to RD or perceptual issues (for example, text is not well compressed by lossy configurations).
  2. Blocks are smaller than tiles and, in most of cases, squared. The block partition can be adaptive and, in this case, should be found using RDO (Rate Distortion Optimization).

6 Resources

  1. Use of 2D-DWT in VCF.
  2. Image Compression with YCoCg + 2D-DCT.
  3. Learned data compression.
  4. Learned Image Compression (LIC) using auto-encoders.
  5. Companion Jupyter notebooks for the book “Deep Learning with Python” [1].

7 To-Do

  1. Modify VCF to use the block-based 2D-DCT in the compression pipeline. Complexity 3.
  2. Using RDO, determine the optimal color transform in the codec 2D-DCT.py. Complexity 2.
  3. Using RDO, determine the optimal color transform in the codec 2D-DWT.py. Complexity 2.
  4. Using RDO, determine the optimal number of levels of the 2D-DWT in the codec 2D-DWT.py. Complexity 2.
  5. Using RDO, determine the optimal DWT basis (defined in PyWavelets) in the codec 2D-DWT.py. Complexity 3.
  6. Create a new image codec (similar to 2D-DWT.py and 2D-DCT.py) where the latent space generated by an autoencoder is entropy encoded (in other words, replace the 2D-DCT or the 2D-DWT by an autoencoder). Complexity 15.
  7. Autoencoders can learn to analyze signals represented in any domain. Create a new DCT- or DWT-based image codec where the autoencoder is applied to the transform domain. Complexity 20.

8 References

[1]   Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.

[2]   V. González-Ruiz. The DCT (Discrete Cosine Transform).

[3]   V. González-Ruiz. The DWT (Discrete Wavelet Transform).

[4]   V. González-Ruiz. Information Theory.

1We will see this with more detail later in this course.

2That depends on the position of the coefficient in the transformed domain.

3When the entropy is decreased while the information is preserved, this usually means that an entropy encoding algorithm will perform better.

4Quantization is a discrete operation constrained by the number of bits used to represent the quantization indexes. When the dynamic range of a signal is high, this makes possible to use more quantization levels and therefore, a higher number of available RD points.

5In the case of a spatial transform, a subband is form by all the coefficients that describe the same frequency components in different areas or resolutions (when available) of the image.