Spatial Transforms

Vicente González Ruiz - Departamento de Informática - UAL

March 22, 2025

1 Spatial decorrelation
2 Benefits of spatial transforms
3 Scalar quantization and rate-distortion optimization in the transform domain
4 Learned transforms
5 2D-partitioning
6 Resources
7 References

1 Spatial decorrelation

Spatial transforms used in image and video compression exploit the statistical correlation (and perceptual redundancy¹) that the pixels show as a consequence of the spatial (2D) correlation that is present in most of the images (and video frames). For example, some textures of an image can occur more than once (in the same image). Also, usually happens that neighbor pixels have similar values.

While color transforms are pixel-wise computations, spatial transforms are image-wise (or at least, block-wise). This means that the transform inputs an image (of pixels) and outputs a matrix of coefficients, which generally express the resemblance between the image and a set of basis functions, usually orthogonal. For example, after using the 2D-DCT (two-dimensional Discrete Cosine Transform) [2], (the index of) each coefficient represents a different spatial frequency² and its value, the amount of the corresponding basis found in the image. In the case of the dyadic 2D-DWT [3], the coefficients “speak” additionally about a spatial resolution in the image pyramid and position inside of the pyramid level.

2 Benefits of spatial transforms

Most spatial transforms provide:

Energy concentration: Usually, a small set of (low-frequency) coefficients represents most of the information (energy) of the image. This decreases the entropy³ and increases the range of the quantization step sizes, because the dynamic range of the coefficients is higher than the dynamic range of the pixels⁴.
Low/High frequency analysis: The human visual system is more sensitive to the low frequencies (for this reason, the contrast sensitiviy function is not flat). This means that we can quantize more severely the high frequencies without generating a perceptible distortion.
Multiresolution: Depending on the transform, it is possible to reconstruct the original image by resolution levels [3]. This option can be interesting when the resolution at which the image must be reconstructed is not known a priori. For example, JPEG 2000 (which is based on the 2D-DWT) is used in digital cinema because, although movie players do not have the same resolution in all movie theaters, the same code-stream (with the maximum resolution) can be used in all of them.

3 Scalar quantization and rate-distortion optimization in the transform domain

Scalar quantization is efficient in the transform domain because the coefficients are decorrelated. The next logical step (after quantization) is the entropy coding of the quantization indexes. Here, depending on how the coefficients are quantized, we can trace different RD curves (all of them starting (and finising) in the same (rate, distortion) point). For example, if we compress each subband⁵ independently, we must find the quantization step sizes that select the same slope in the RD curve of each subband [4]. A RD curve is a discrete convex function where each line that connects adjacent RD points has a slope. In this context, given a target distortion or rate, the quantization step size used in each subband should generate the same slope.

4 Learned transforms

Using machine learning techniques (for example, with an artificial neural network), it is possible to build a machine-krafted transform, specifically tuned for some type of images, or at least, capable of determining specific features in the images.

Potentially, learned-based image (and video) compressors are adaptive algorithms that can be more efficient than those in which the transforms are pre-defined.

5 2D-partitioning

Depending on the content of an image, it can be necessary to divide the image into 2D chunks (usually called tiles or blocks), and encode each chunk independently. In general:

Tiles are used when the image is made up of very different areas (for example, with text and natural content). Tiles are usually rectangular but can have any size, and are usually defined attending to RD or perceptual issues (for example, text is not well compressed by lossy codecs).
Blocks are smaller than tiles and, in most of cases, squared. The block partition can be adaptive and, in this case, should be found using RDO (Rate Distortion Optimization)⁶.

6 Resources

To-Do

Add a new option to the VCF codec 2D-DWT.py, that throught RDO, determine the optimal DWT basis (defined in PyWavelets). 2 points.
Create a new VCF image codec (similar to 2D-DCT.py) where the 2D-DCT is replaced by an ANN (artificial neural network). The forward transform should maximize the concentration of energy, and the inverse transform should restore the original block. To avoid transmit the coefficients of the ANN, use a model previously trained to compute the 2D-DCT, and retrain the model using only the previously encoded block of the image. For example, when encoding the second block, use the first block to fine-tune the model. For the third block, use the second block to retrain the model, and so on. 10 points.
Create a new image codec (similar to 2D-DWT.py) where the coefficients of the filters used in the 2D-DWT are determined by an ANN trained to maximize the transform gain (energy concentration). Initially use the coefficients of the filters of a well-known DWT kernel, and in subsequent iterations (levels) of the transform used the previously analyzed spatial resolution levels to fine-tune the coefficients. 10 points.

7 References

[1] Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.

[2] V. González-Ruiz. The DCT (Discrete Cosine Transform).

[3] V. González-Ruiz. The DWT (Discrete Wavelet Transform).

[4] V. González-Ruiz. Information Theory.

¹We will see this with more detail later in this course.

²That depends on the position of the coefficient in the transformed domain.

³When the entropy is decreased while the information is preserved, this usually means that an entropy encoding algorithm will perform better.

⁴Quantization is a discrete operation constrained by the number of bits used to represent the quantization indexes. When the dynamic range of a signal is high, this makes possible to use more quantization levels and therefore, a higher number of available RD points.

⁵In the case of a spatial transform, a subband is form by all the coefficients that describe the same frequency components in different areas or resolutions (when available) of the image.

⁶If no other more important requirement exists, such as multiresolution.