Spatial transforms used in image and video compression exploit the statistical correlation (and perceptual redundancy1) that the pixels show as a consequence of the spatial (2D) correlation that is present in most of the images (and video frames). For example, some areas of an image can occur more than once (in the same image). Also, usually happens that neighbor pixels have similar values.
While color transforms are pixel-wise computations, spatial transforms are image-wise. This means that the transform inputs an image (of pixels) and outputs a matrix of coefficients, which generally express the resemblance between the image and a set of basis functions, usually orthogonal. For example, after using the 2D-DCT (two-dimensional Discrete Cosine Transform) [2], (the index of) each coefficient represents a different spatial frequency2 and its value, the amount of the corresponding basis found in the image. In the case of the 2D-DWT [3], the coefficients “speak” additionally about a spatial resolution in the image pyramid and position inside of the pyramid level.
Most spatial transforms provide:
Scalar quantization is efficient in the transform domain because the coefficients are decorrelated. The next logical step (after quantization) is the entropy coding of the quantization indexes. Here, depending on how the coefficients are quantized, we can trace different RD curves (all of them starting (and finising) in the same distortion). For example, if we compress each subband5 independently, we must find the quantization step sizes that select the same slope in the RD curve of each subband [4]. A RD curve is a discrete convex function where each line that connects adjacent RD points has a slope. Futhermore, each RD point generate a line (with a corresponding slope) between it and the lowest-quality RD point. In this last context, given a target distortion or rate, the quantization step size used in each subband should generate the same slope.
Using machine learning techniques (for example, with an artificial neural network), it is possible to build a machine-krafted transform, specifically tuned for some type of images, or at least, capable of determining specific features in the images.
Potentially, learned-based image (and video) compressors are adaptive algorithms that can be more efficient than those in which the transforms are pre-defined.
Depending on the content of the image, it can be necessary to divide the image into 2D chunks (usually called tiles or blocks), and encode each chunk independently. In general:
[1] Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.
[2] V. González-Ruiz. The DCT (Discrete Cosine Transform).
[3] V. González-Ruiz. The DWT (Discrete Wavelet Transform).
[4] V. González-Ruiz. Information Theory.
1We will see this with more detail later in this course.
2That depends on the position of the coefficient in the transformed domain.
3When the entropy is decreased while the information is preserved, this usually means that an entropy encoding algorithm will perform better.
4Quantization is a discrete operation constrained by the number of bits used to represent the quantization indexes. When the dynamic range of a signal is high, this makes possible to use more quantization levels and therefore, a higher number of available RD points.
5In the case of a spatial transform, a subband is form by all the coefficients that describe the same frequency components in different areas or resolutions (when available) of the image.