So far, we have focused on minimizing the Lagrangian [6] \begin {equation} J = R + \lambda D, \label {eq:RD} \end {equation} where \(R\) is the data rate, and \(D\) is an additive1 distance metric, such as the RMSE, the PSNR or the SSIM index [7]. However, the way in which human beings perceive distortion is generally different from how these metrics express it. This chapter introduces some of the most common ways of exploiting the visual distortion perceived by humans.
Notice that if, according to the requirements of the encoding process, \(D\) is below a Threshold of Noticeable Distortion (ToND), the RDO process described by Eq. \eqref{eq:RD} boilds down to select the option with smaller \(R\).
The Weber-Fechner law states that the minimum perceivable visual stimulus difference increases with background luminance2 [5], up to a point in which it decreases. Therefore, the perception of the distortion generated by the lossy coding of an image is smaller in areas with higher and lower intensity values. For this reason, one of the most used quantizers is the deadzone, which also in general changes signal noise (for example, electronic noise) by quantization noise, where the SNR of the signal is smaller (arroung 0).
The HVS can be modeled as a low-pass filter whose cutoff frequency depends on the distance between the observer and the content (in terms of frequency) of the image.
Some DCT-based image and video coding standards, such as JPEG and H.264, define quantization matrices designed for perceptual coding [2]. These matrices indicate a different quantization step size for each 8x8-DCT coefficient, whose values were found through a study of the subjective impact of the quantization of each coefficient in the ToND. In the case of H.264, such matrices can change between images [5].
In the case of JPEG 2000, each subband uses a different quantization step size [4]. However, note that these values depend on the selected DWT filter.
Quantization noise is generated by the quantizer, producing different coding artifacts3, which are hardly perceived when the (area of the) encoded image is textured [8]. This effect can occur up to reach the ToND.
Another important aspect of our perception is its directionality, which leads the HVS to be more sensitive to distortions added to horizontal and vertical frequencies than to diagonal frequencies [5].
Finally, the rationale behind temporal masking is that the HVS sensitivity to coding artifacts is lower in areas with very high motion activity.
In video, modeling temporal masking is more challenging because the spatio-temporal sensitivity function of the HVS is not separable, i.e., it depends on both the spatial4 and temporal frequencies [5]. However, sources of distortion such as mosquito noise can be hardly perceived in video because this type of noise is temporally uncorrelated.
Loop filters are used in motion-compensated video codecs to improve visual quality (and also the encoding RD performance). For example, H.264/AVC uses (usually directional5) deblocking filters in the encoding loop to smooth the transitions between the blocks, when the boundaries between them become perceptible. Loop filters improve significantly the perceived quality of the video in “flat” areas, where the blocking can be more easely appreciated.
The HVS can perceive only a finite number of different intensities (luma). This number depends on the dynamic range of the pixels, but, in general, we are unable to distinguish more than 64 intensity values [3].
Humans do not perceive (spatial) detail in chrominance as well as in luminance [1]. For this reason, the croma can be downsampled to 1/4 of the original sampling rate without noticeable distortion. This feature is used in most of lossy image and video encoding algoritms.
[1] W. Burger and M.J. Burge. Digital Image Processing: An Algorithmic Introduction Using Java. Springer, 2016.
[2] FERDA Ernawan and SITI HADIATI Nugraini. The optimal quantization matrices for JPEG image compression from psychovisual threshold. Journal of Theoretical and Applied Information Technology, 70(3):566–572, 2014.
[3] V. González-Ruiz. Visual Redundancy.
[4] Feng Liu, Eze Ahanonu, Michael W Marcellin, Yuzhang Lin, Amit Ashok, and Ali Bilgin. Visibility of quantization errors in reversible JPEG2000. Signal Processing: Image Communication, 84:115812, 2020.
[5] M. Naccari and M. Mrak. Perceptually optimized video compression. In Academic Press Library in Signal Processing, volume 5, pages 155–196. Elsevier, 2014.
[6] G.J. Sullivan and T. Wiegand. Rate-distortion optimization for video compression. IEEE signal processing magazine, 15(6):74–90, 1998.
[7] Z. Wang, A.C. Bovik, H.R Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[8] H.R. Wu and K.R. Rao. Digital video image quality and perceptual coding. CRC press, 2017.