Perceptual Coding

So far, we have focused on minimizing the Lagrangian [6] \begin {equation} J = R + \lambda D, \label {eq:RD} \end {equation} where \(R\) is the data rate, and \(D\) is an additive¹ distance metric, such as the RMSE, the PSNR or the SSIM index [7]. However, the way in which human beings perceive distortion is generally different from how these metrics express it. This chapter introduces some of the most common ways of exploiting the visual distortion perceived by humans.

Notice that if, according to the requirements of the encoding process, \(D\) is below a Threshold of Noticeable Distortion (ToND), the RDO process described by Eq. \eqref{eq:RD} boilds down to select the option with smaller \(R\).

2 ToND varies with the luma intensity

The Weber-Fechner law states that the minimum perceivable visual stimulus difference increases with background luminance² [5], up to a point in which it decreases. Therefore, the perception of the distortion generated by the lossy coding is smaller in areas (regions) with high and low intensity values. For this reason, one of the most used quantizers is the deadzone, which also, in general, changes signal noise (for example, electronic noise) by quantization noise where the SNR of the signal is smaller (arround 0).

3 ToND varies with the spatial frequency

The HVS can be modeled as a low-pass filter whose cut-off frequency depends on the distance between the observer and the content (in terms of frequency) of the visual stimulus.

Some DCT-based image and video coding standards, such as JPEG and H.264, define quantization matrices designed for perceptual coding [2]. These matrices indicate a different quantization step size for each 8x8-DCT coefficient, whose values were found through a study of the subjective impact of the quantization of each coefficient in the ToND. In the case of H.264, such matrices can change between images [5].

In the case of JPEG2000, each subband uses a different quantization step size [4]. However, note that these values depend on the selected DWT filter.

4 Visual masking of the quantization noise

Quantization noise is generated by the quantizer, producing different coding artifacts³, which are hardly perceived when the (area of the) encoded image is textured [8]. This effect can occur until reaching the ToND.

Another important aspect of our perception is its directionality, which leads the HVS to be more sensitive to distortions added to horizontal and vertical frequencies than to diagonal frequencies [5]. This basically means that, for example, in the 2D-DWT domain, the subband HH can be more severely quantized than the other subbands.

Finally, the rationale behind temporal masking is that the HVS sensitivity to coding artifacts is lower in areas with very high motion activity.

In video, modeling temporal masking is more challenging because the spatio-temporal sensitivity function of the HVS is not separable, i.e., it depends on both the spatial⁴ and temporal frequencies [5]. However, sources of distortion such as mosquito noise can be hardly perceived in video because this type of noise is temporally uncorrelated.

5 Loop filters

Loop filters are used in motion-compensated video codecs to improve visual quality (and also the encoding RD performance). For example, H.264/AVC uses (usually directional⁵) deblocking filters in the encoding loop to smooth the transitions between the blocks, when the boundaries between them become perceptible. Loop filters improve significantly the perceived quality of the video in “flat” areas, where the blocking can be more easely appreciated.

6 Luma redundancy

The HVS can perceive only a finite number of different intensities. This number depends on the dynamic range of the pixels, but, in general, we are unable to distinguish more than 64 intensity values [3].

7 Chroma redundancy

Humans do not perceive (spatial) detail in chrominance as well as in luminance [1]. For this reason, the croma can be downsampled to (for example) 1/4 of the original sampling rate without noticeable distortion. This feature is used in most of lossy image and video encoding algoritms.

8 Resources

9 To-Do

10 References

[1] W. Burger and M.J. Burge. Digital Image Processing: An Algorithmic Introduction Using Java. Springer, 2016.

[2] FERDA Ernawan and SITI HADIATI Nugraini. The optimal quantization matrices for JPEG image compression from psychovisual threshold. Journal of Theoretical and Applied Information Technology, 70(3):566–572, 2014.

[3] V. González-Ruiz. Visual Redundancy.

[4] Feng Liu, Eze Ahanonu, Michael W Marcellin, Yuzhang Lin, Amit Ashok, and Ali Bilgin. Visibility of quantization errors in reversible JPEG2000. Signal Processing: Image Communication, 84:115812, 2020.

[5] M. Naccari and M. Mrak. Perceptually optimized video compression. In Academic Press Library in Signal Processing, volume 5, pages 155–196. Elsevier, 2014.

[6] G.J. Sullivan and T. Wiegand. Rate-distortion optimization for video compression. IEEE signal processing magazine, 15(6):74–90, 1998.

[7] Z. Wang, A.C. Bovik, H.R Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.

[8] H.R. Wu and K.R. Rao. Digital video image quality and perceptual coding. CRC press, 2017.

¹The total distortion of two (or more) sources of distortion is the sum of the distortions of these two (or more) sources.

²In general, this is not true for the chroma.

³“Random” noise, blocking, ringing, etc.

⁴Which in turn depends on the distance between the user and the display.

⁵Anisotropic.