Spatial Transforms

Vicente González Ruiz - Departamento de Informática - UAL

December 11, 2025

Contents

1 Spatial decorrelation
2 Benefits of spatial transforms
3 Scalar quantization and rate-distortion optimization in the transform domain
4 Learned transforms
5 2D-partitioning
6 Resources
7 References

1 Spatial decorrelation

Spatial transforms used in image and video compression exploit the statistical correlation (and perceptual redundancy1) that the pixels show as a consequence of the spatial (2D) correlation that is present in most of the images (and video frames). For example, some textures of an image can occur more than once (in the same image). Also, usually happens that neighbor pixels have similar values.

While color transforms are pixel-wise computations, spatial transforms are image-wise (or at least, block-wise). This means that the transform inputs an image (of pixels) and outputs a matrix of coefficients, which generally express the resemblance between the image and a set of basis functions, usually orthogonal. For example, after using the 2D-DCT (two-dimensional Discrete Cosine Transform) [2], (the index of) each coefficient represents a different spatial frequency2 and its value, the amount of the corresponding basis found in the image. In the case of the dyadic 2D-DWT [3], the coefficients “speak” additionally about a spatial resolution in the image pyramid and position inside of the pyramid level.

2 Benefits of spatial transforms

Most spatial transforms provide:

  1. Energy concentration: Usually, a small set of (low-frequency) coefficients represents most of the information (energy) of the image. This decreases the entropy3 and increases the range of the quantization step sizes, because the dynamic range of the coefficients is higher than the dynamic range of the pixels4.
  2. Low/High frequency analysis: The human visual system is more sensitive to the low frequencies (for this reason, the contrast sensitiviy function is not flat). This means that we can quantize more severely the high frequencies without generating a perceptible distortion.
  3. Multiresolution: Depending on the transform, it is possible to reconstruct the original image by resolution levels [3]. This option can be interesting when the resolution at which the image must be reconstructed is not known a priori. For example, JPEG 2000 (which is based on the 2D-DWT) is used in digital cinema because, although movie players do not have the same resolution in all movie theaters, the same code-stream (with the maximum resolution) can be used in all of them.

3 Scalar quantization and rate-distortion optimization in the transform domain

Scalar quantization is efficient in the transform domain because the coefficients are decorrelated. The next logical step (after quantization) is the entropy coding of the quantization indexes. Here, depending on how the coefficients are quantized, we can trace different RD curves (all of them starting (and finising) in the same (rate, distortion) point). For example, if we compress each subband5 independently, we must find the quantization step sizes that select the same slope in the RD curve of each subband [4]. A RD curve is a discrete convex function where each line that connects adjacent RD points has a slope. In this context, given a target distortion or rate, the quantization step size used in each subband should generate the same slope.

4 Learned transforms

Using machine learning techniques (for example, with an artificial neural network), it is possible to build a machine-krafted transform, specifically tuned for some type of images, or at least, capable of determining specific features in the images. This technique receive the name of analysis–synthesis dictionary learning, linear autoencoders, or learned transform coding.

Potentially, learned-based image (and video) compressors are adaptive algorithms that can be more efficient than those in which the transforms are pre-defined.

5 2D-partitioning

Depending on the content of an image, it can be necessary to divide the image into 2D chunks (usually called tiles or blocks), and encode each chunk independently. In general:

  1. Tiles are used when the image is made up of very different areas (for example, with text and natural picture content). Tiles are usually rectangular but can have any size, and are usually defined attending to RD or perceptual issues (for example, text is not well compressed by lossy codecs).
  2. Blocks are smaller than tiles and, in most of cases, squared. The block partition can be adaptive and, in this case, should be found using RDO (Rate Distortion Optimization)6.

To-Do

  1. Learned Block Transform (LBT): As other transforms do, a neural network can learn to perform energy concentration. Implement a 3-layers autoencoder trained (using supervised learning) to maximize, for example7, the transform coding gain [5], for each different image to compress. The size (and shape) of the input layer and the output layer should the same, but configurable (typically between 4x4, 8x8 and 16x16). Notice that, the output of the neurons of the center layer should define the transform coefficients, that the (learned) weights of the connections between the input layer and the center layer should conform the forward transform matrix, and that the weights of the connections between the center layer and the output layer should define the inverse transform matrix.

6 Resources

  1. Usage of 2D-DWT in VCF.
  2. Usage of 2D-DCT in VCF.
  3. Image Compression with YCoCg + 2D-DCT.
  4. Learned data compression.
  5. Learned Image Compression (LIC) using auto-encoders.
  6. AutoencoderBlockCompression.ipynb.
  7. Companion Jupyter notebooks for the book “Deep Learning with Python” [1].

7 References

[1]   Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.

[2]   V. González-Ruiz. The DCT (Discrete Cosine Transform).

[3]   V. González-Ruiz. The DWT (Discrete Wavelet Transform).

[4]   V. González-Ruiz. Information Theory.

[5]   K. Sayood. Introduction to Data Compression (Slides). Morgan Kaufmann, 2017.

1We will see this with more detail later in this course.

2That depends on the position of the coefficient in the transformed domain.

3When the entropy is decreased while the information is preserved, this usually means that an entropy encoding algorithm will perform better.

4Quantization is a discrete operation constrained by the number of bits used to represent the quantization indexes. When the dynamic range of a signal is high, this makes possible to use more quantization levels and therefore, a higher number of available RD points.

5In the case of a spatial transform, a subband is form by all the coefficients that describe the same frequency components in different areas or resolutions (when available) of the image.

6If no other more important requirement exists, such as multiresolution.

7The function to optimize during the training of the network can be any, as long as such function increases the sparsity of the transform coefficients.