Spatial transforms used in image and video compression exploit the statistical correlation (and perceptual redundancy1) that the pixels show as a consequence of the spatial (2D) correlation that is present in most of the images (and video frames). For example, some textures of an image can occur more than once (in the same image). Also, usually happens that neighbor pixels have similar values.
While color transforms are pixel-wise computations, spatial transforms are image-wise (or at least, block-wise). This means that the transform inputs an image (of pixels) and outputs a matrix of coefficients, which generally express the resemblance between the image and a set of basis functions, usually orthogonal. For example, after using the 2D-DCT (two-dimensional Discrete Cosine Transform) [2], (the index of) each coefficient represents a different spatial frequency2 and its value, the amount of the corresponding basis found in the image. In the case of the dyadic 2D-DWT [3], the coefficients “speak” additionally about a spatial resolution in the image pyramid and position inside of the pyramid level.
Most spatial transforms provide:
Scalar quantization is efficient in the transform domain because the coefficients are decorrelated. The next logical step (after quantization) is the entropy coding of the quantization indexes. Here, depending on how the coefficients are quantized, we can trace different RD curves (all of them starting (and finising) in the same (rate, distortion) point). For example, if we compress each subband5 independently, we must find the quantization step sizes that select the same slope in the RD curve of each subband [4]. A RD curve is a discrete convex function where each line that connects adjacent RD points has a slope. In this context, given a target distortion or rate, the quantization step size used in each subband should generate the same slope.
Depending on the content of an image and the spatial transform, it can be necessary to divide the image into 2D chunks (usually called tiles or blocks), and encode each chunk independently. In general:
Using machine learning techniques (for example, with an artificial neural network), it is possible to build a machine-krafted transform, specifically tuned for some type of images, or at least, capable of determining specific features in the images. This technique receive the name of analysis–synthesis dictionary learning, linear autoencoders, or learned transform coding.
Potentially, learned-based image (and video) compressors are adaptive algorithms that can be more efficient than those in which the transforms are pre-defined.
Linear systems satisfy the Superposition Principle that basically sais that the transform of linear combination of signals is equal to the linear combination of the transform of the signals. This makes the systems predictable and easier to analyze. Thus for example, if we double the input of a linear system we now that the output will be also doubled.
In Transform Coding, (among other reasong) orthogonality is interesting because the resulting coefficients will not share information (in terms the similarity between the input signal and the basis functions). This maximizes energy accumulation in the transform domain. Orthogonality implies linearity.
From a signal processing point of view, the basis functions that conform the transform matrices can also be considered as filters. When the filters are symmetric (linear in phase) all the filters of the transform generate the same shift of the phase of the signal, but when the filter are asymmetric this does not happen and ringing artifacts are generated at the edges of the objects of the encoded image.
It is mathematically proven that you cannot have a wavelet that is finite, orthogonal, and symmetric all at the same time (with the exception of the very blocky Haar wavelet). Bi-orthogonal transforms, on the other hand, can be created using finite-support symmetric filters at the cost of using filters with different gains that usually are nor orthogonal. A bi-orthogonal transform can be easely recognized because the synthesis matrix is not the transpose of the analysis matrix.
An orthonormal transform is a orthogonal transform that satisfy that after using the analysis and the synthesis transform, we obtain exactly the same input signal. In a orthogonal transform, the output could be amplified (depending on the transform)
A separable transform is those in which the transform of a multidimensional signal is equal to the unidimensional transform applied to each dimension.
In Gaussian signals, the samples follow a Gaussian distribution. In the real world, signals are rarely perfectly perfectly mathematical. However, the Central Limit Theorem states that when you add many independent random signals together, their sum naturally gravitates toward a Gaussian distribution.
The DCT is orthonormal and separable. It is usually applied by blocks of 8x8 pixels because:
In VCF has been implemented the 2D-DCT.
To avoid blockiness we can overlap the blocks. The MDCT was speficically designed to avoid this artifact. See 2D-MDCT.
In the KLT the basis funtions are the eigenvectors of the covariance matrix of the input data. In terms of energy concentration (also called transform coding gain [5]), for a zero-mean Gaussian sources the KLT is optimal among all orthogonal transforms.
Like the other transforms studied here, the KLT is orthonormal but in general not separable, and applied block-by-block for two main reasons:
A 2D-blocks version of the KLT can be found in 2D-KLT. Notice that only a single analysis matrix has been used when encoding the whole image (and only a synthesis matrix can be used for decoding).
The DWT is linear, separable and at least bi-orthogonal. Compared to the DCT, in the DWT the basis functions have a local support (are not infinite as the cosine functions), and the DWT coefficients express both frequency and locality (in the original image). This has two main advantages:
VCF implements 2D-DWT.
Neural networks are good learning things. Specifically, autoencoders are capable of matching correspondences between input and output tensors. If we design a 2-layers autoencoder where the neurons do not use any activation function and we train it to minimize the cost function
where \(D\) represents distortion and \(R\) bit-rate, we will build a Rate/Distortion (RD) optimized transformation. Notice that, unlike the KLT, now we are considering the quantization and the entropy coding subsystems (and their parameters, such as the quantization step size and the order of the context modeling) to minimize the \(J\) tradeoff between \(D\) and \(R\) (in the KLT we only minimize \(D\)). \(\lambda \) controls our preference:
In VCF has been implemented a block-wise 2D-LBT.
The current implementation of LBT is not separable (the transform matrices has been found for the complete blocks). If \(T(\mathbf {X})\) represents the transform of the block \(\mathbf {X}\) modify 2D-LBT to train \(\mathbf {A}\in \mathbb {R}^{B\times B}\), where \(B\) is the side of the square blocks, being7
Sumarizing, the transform is applied first by rows and the by columns. The transform matrix is smaller.
In the current implementation
[1] Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.
[2] V. González-Ruiz. The DCT (Discrete Cosine Transform).
[3] V. González-Ruiz. The DWT (Discrete Wavelet Transform).
[4] V. González-Ruiz. Information Theory.
[5] K. Sayood. Introduction to Data Compression (Slides). Morgan Kaufmann, 2017.
1We will see this with more detail later in this course.
2That depends on the position of the coefficient in the transformed domain.
3When the entropy is decreased while the information is preserved, this usually means that an entropy encoding algorithm will perform better.
4Quantization is a discrete operation constrained by the number of bits used to represent the quantization indexes. When the dynamic range of a signal is high, this makes possible to use more quantization levels and therefore, a higher number of available RD points.
5In the case of a spatial transform, a subband is form by all the coefficients that describe the same frequency components in different areas or resolutions (when available) of the image.
6If no other more important requirement exists, such as multiresolution.
7In the current implementation,