Spatial Transforms

Vicente González Ruiz - Departamento de Informática - UAL

March 4, 2026

Contents

1 Spatial decorrelation
2 Benefits of spatial transforms
3 Scalar quantization and rate-distortion optimization in the transform domain
4 Block-based transforms
5 “Hand-crafted” VS learned transforms
6 Linearity
7 Orthogonality
8 Filter symmetry
9 Bi-orthogonality
10 Orthonormality
11 Separability
12 Gaussian sources and the Central Limit Theorem
13 Discrete Cosine Transform (DCT)
14 Modified DCT (MDCT)
15 Karhunen–Loève Transform (KLT)
16 Discrete Wavelet Transform (DWT)
17 RD-optimized Learned Block Transform (LBT)
18 To-Do
19 Resources
20 References

1 Spatial decorrelation

Spatial transforms used in image and video compression exploit the statistical correlation (and perceptual redundancy1) that the pixels show as a consequence of the spatial (2D) correlation that is present in most of the images (and video frames). For example, some textures of an image can occur more than once (in the same image). Also, usually happens that neighbor pixels have similar values.

While color transforms are pixel-wise computations, spatial transforms are image-wise (or at least, block-wise). This means that the transform inputs an image (of pixels) and outputs a matrix of coefficients, which generally express the resemblance between the image and a set of basis functions, usually orthogonal. For example, after using the 2D-DCT (two-dimensional Discrete Cosine Transform) [2], (the index of) each coefficient represents a different spatial frequency2 and its value, the amount of the corresponding basis found in the image. In the case of the dyadic 2D-DWT [3], the coefficients “speak” additionally about a spatial resolution in the image pyramid and position inside of the pyramid level.

2 Benefits of spatial transforms

Most spatial transforms provide:

  1. Energy concentration: Usually, a small set of (low-frequency) coefficients represents most of the information (energy) of the image. This decreases the entropy3 and increases the range of the quantization step sizes, because the dynamic range of the coefficients is higher than the dynamic range of the pixels4.
  2. Low/High frequency analysis: The human visual system is more sensitive to the low frequencies (for this reason, the contrast sensitiviy function is not flat). This means that we can quantize more severely the high frequencies without generating a perceptible distortion.
  3. Multiresolution: Depending on the transform, it is possible to reconstruct the original image by resolution levels [3]. This option can be interesting when the resolution at which the image must be reconstructed is not known a priori. For example, JPEG 2000 (which is based on the 2D-DWT) is used in digital cinema because, although movie players do not have the same resolution in all movie theaters, the same code-stream (with the maximum resolution) can be used in all of them.

3 Scalar quantization and rate-distortion optimization in the transform domain

Scalar quantization is efficient in the transform domain because the coefficients are decorrelated. The next logical step (after quantization) is the entropy coding of the quantization indexes. Here, depending on how the coefficients are quantized, we can trace different RD curves (all of them starting (and finising) in the same (rate, distortion) point). For example, if we compress each subband5 independently, we must find the quantization step sizes that select the same slope in the RD curve of each subband [4]. A RD curve is a discrete convex function where each line that connects adjacent RD points has a slope. In this context, given a target distortion or rate, the quantization step size used in each subband should generate the same slope.

4 Block-based transforms

Depending on the content of an image and the spatial transform, it can be necessary to divide the image into 2D chunks (usually called tiles or blocks), and encode each chunk independently. In general:

  1. Tiles are used when the image is made up of very different areas (for example, with text and natural picture content). Tiles are usually rectangular but can have any size, and are usually defined attending to RD or perceptual issues (for example, text is not well compressed by lossy codecs).
  2. Blocks are smaller than tiles and, in most of cases, squared. The block partition can be adaptive and, in this case, should be found using RDO (Rate Distortion Optimization)6.

5 “Hand-crafted” VS learned transforms

Using machine learning techniques (for example, with an artificial neural network), it is possible to build a machine-krafted transform, specifically tuned for some type of images, or at least, capable of determining specific features in the images. This technique receive the name of analysis–synthesis dictionary learning, linear autoencoders, or learned transform coding.

Potentially, learned-based image (and video) compressors are adaptive algorithms that can be more efficient than those in which the transforms are pre-defined.

6 Linearity

Linear systems satisfy the Superposition Principle that basically sais that the transform of linear combination of signals is equal to the linear combination of the transform of the signals. This makes the systems predictable and easier to analyze. Thus for example, if we double the input of a linear system we now that the output will be also doubled.

7 Orthogonality

In Transform Coding, (among other reasong) orthogonality is interesting because the resulting coefficients will not share information (in terms the similarity between the input signal and the basis functions). This maximizes energy accumulation in the transform domain. Orthogonality implies linearity.

8 Filter symmetry

From a signal processing point of view, the basis functions that conform the transform matrices can also be considered as filters. When the filters are symmetric (linear in phase) all the filters of the transform generate the same shift of the phase of the signal, but when the filter are asymmetric this does not happen and ringing artifacts are generated at the edges of the objects of the encoded image.

9 Bi-orthogonality

It is mathematically proven that you cannot have a wavelet that is finite, orthogonal, and symmetric all at the same time (with the exception of the very blocky Haar wavelet). Bi-orthogonal transforms, on the other hand, can be created using finite-support symmetric filters at the cost of using filters with different gains that usually are nor orthogonal. A bi-orthogonal transform can be easely recognized because the synthesis matrix is not the transpose of the analysis matrix.

10 Orthonormality

An orthonormal transform is a orthogonal transform that satisfy that after using the analysis and the synthesis transform, we obtain exactly the same input signal. In a orthogonal transform, the output could be amplified (depending on the transform)

11 Separability

A separable transform is those in which the transform of a multidimensional signal is equal to the unidimensional transform applied to each dimension.

12 Gaussian sources and the Central Limit Theorem

In Gaussian signals, the samples follow a Gaussian distribution. In the real world, signals are rarely perfectly perfectly mathematical. However, the Central Limit Theorem states that when you add many independent random signals together, their sum naturally gravitates toward a Gaussian distribution.

13 Discrete Cosine Transform (DCT)

The DCT is orthonormal and separable. It is usually applied by blocks of 8x8 pixels because:

  1. In most natural images the spatial correlation is only significative in small regions (we are not going to gain in compression ratio using the full-resolution DCT).
  2. It is faster: the DCT requires \(N\log _2(N)\) operations while blocking only \(N\).

In VCF has been implemented the 2D-DCT.

14 Modified DCT (MDCT)

To avoid blockiness we can overlap the blocks. The MDCT was speficically designed to avoid this artifact. See 2D-MDCT.

15 Karhunen–Loève Transform (KLT)

In the KLT the basis funtions are the eigenvectors of the covariance matrix of the input data. In terms of energy concentration (also called transform coding gain [5]), for a zero-mean Gaussian sources the KLT is optimal among all orthogonal transforms.

Like the other transforms studied here, the KLT is orthonormal but in general not separable, and applied block-by-block for two main reasons:

  1. The basis (eigenvectors) are signal dependent and therefore, the decoder needs the inverse transform matrix, whose size is \(N^2\) for an input of \(N\) pixels (\(N\) eigenvectors of length \(N\)). Remember that the KLT is not separable.
  2. Most natural images are non-stationary, meaning their statistics change drastically from one area to another (the top half of a photo might be a smooth, highly correlated blue sky, while the bottom half might be highly detailed, uncorrelated grass). Therefore, if you apply a global KLT to the whole image, it creates a set of basis functions that average out the sky and the grass. This "average" basis isn’t optimal for either region.

A 2D-blocks version of the KLT can be found in 2D-KLT. Notice that only a single analysis matrix has been used when encoding the whole image (and only a synthesis matrix can be used for decoding).

16 Discrete Wavelet Transform (DWT)

The DWT is linear, separable and at least bi-orthogonal. Compared to the DCT, in the DWT the basis functions have a local support (are not infinite as the cosine functions), and the DWT coefficients express both frequency and locality (in the original image). This has two main advantages:

  1. The DWT can be applied without using blocks, avoiding the blocking artifact usually visible at very low bit-rates.
  2. In general, the energy compactation generated by the DWT is superior to the achieved by the DCT.

VCF implements 2D-DWT.

17 RD-optimized Learned Block Transform (LBT)

Neural networks are good learning things. Specifically, autoencoders are capable of matching correspondences between input and output tensors. If we design a 2-layers autoencoder where the neurons do not use any activation function and we train it to minimize the cost function

\begin{equation} J = D + \lambda R, \end{equation}

where \(D\) represents distortion and \(R\) bit-rate, we will build a Rate/Distortion (RD) optimized transformation. Notice that, unlike the KLT, now we are considering the quantization and the entropy coding subsystems (and their parameters, such as the quantization step size and the order of the context modeling) to minimize the \(J\) tradeoff between \(D\) and \(R\) (in the KLT we only minimize \(D\)). \(\lambda \) controls our preference:

  1. If \(\lambda \) is small, for a given quantization step size and entropy encoding parameters, we are interested in finding a transformation matrix that should be efficient for high quality reconstruction, at the expense of increasing the bit-rate.
  2. If \(\lambda \) is big, you are telling to the optimizer: “Bit-rate is expensive”. The autoencoder will update its weights to learn a transform that aggressively groups and compress features (“transform coefficients”, in this case) even a the cost of a high distortion. Therefore, the transform will be optimized for low bit-rates.

In VCF has been implemented a block-wise 2D-LBT.

18 To-Do

  1. The current implementation of LBT is not separable (the transform matrices has been found for the complete blocks). If \(T(\mathbf {X})\) represents the transform of the block \(\mathbf {X}\) modify 2D-LBT to train \(\mathbf {A}\in \mathbb {R}^{B\times B}\), where \(B\) is the side of the square blocks, being7

    \begin{equation} T(\mathbf {X}) = \mathbf {A}\mathbf {X}\mathbf {A}^T. \end{equation}

    Sumarizing, the transform is applied first by rows and the by columns. The transform matrix is smaller.

  2. In the current implementation

    \begin{equation} R = \frac {1}{N}\sum _{i=1}^{N} \log (\sigma _i^2), \end{equation}
    where \(N\) is the number of pixels/block and \(\sigma ^2_i\) is the variance of the \(i\)-the coefficient inside of each block. Obviously, \(R\) is only an estimation of the (inverse of) real bit-rate, which depends also on the QSS and the entropy encoder parameter(s). Try to improve the accuracy of this term of the cost function. Modify any other training parameter (such as the number of epochs) to achive this goal.
  3. Remove from the code all the part related with the perceptual quantization. The QSS matrices are specific for the 2D-DCT.

19 Resources

  1. Image Compression with YCoCg + 2D-DCT.
  2. Learned data compression.
  3. Learned Image Compression (LIC) using auto-encoders.
  4. AutoencoderBlockCompression.ipynb.
  5. Companion Jupyter notebooks for the book “Deep Learning with Python” [1].
  6. Learned transform autoencoder.

20 References

[1]   Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.

[2]   V. González-Ruiz. The DCT (Discrete Cosine Transform).

[3]   V. González-Ruiz. The DWT (Discrete Wavelet Transform).

[4]   V. González-Ruiz. Information Theory.

[5]   K. Sayood. Introduction to Data Compression (Slides). Morgan Kaufmann, 2017.

1We will see this with more detail later in this course.

2That depends on the position of the coefficient in the transformed domain.

3When the entropy is decreased while the information is preserved, this usually means that an entropy encoding algorithm will perform better.

4Quantization is a discrete operation constrained by the number of bits used to represent the quantization indexes. When the dynamic range of a signal is high, this makes possible to use more quantization levels and therefore, a higher number of available RD points.

5In the case of a spatial transform, a subband is form by all the coefficients that describe the same frequency components in different areas or resolutions (when available) of the image.

6If no other more important requirement exists, such as multiresolution.

7In the current implementation,

\begin{equation} T(\mathbf {X}) = \mathbf {B}\mathbf {X}, \end{equation}
where \(\mathbf {B}\in \mathbb {R}^{B^2\times B^2}\).