Code-stream Scalabilty

Vicente González Ruiz - Depto Informática - UAL

March 22, 2025

1 What means code-stream scalability?
2 Temporal scalability in video coding [5]
  2.1 GOF-level scalabilty
  2.2 Frame-level scalabilty using MCTF
3 Spatial scalability in image coding [1]
  3.1 Using the LPT (Laplacian Pyramid Transform)
  3.2 Using the DWT (Discrete Wavelet Transform) [1]
4 Spatial scalability in video coding [5]
5 Quality scalability in image coding [1]
  5.1 Using the DCT
  5.2 Using the DWT [4]
6 Quality scalability in video coding [5]
7 Simulcast VS Adaptive bit-rate streaming VS data-scalability
8 To-Do
9 References

1 What means code-stream scalability?

Image and video codecs represent multidimensional signals, and this enables to decode such information in several ways. When the code-stream allows for this, we say that the code-stream generated by such scheme is scalable.

Scalability is interesting in many contexts, but specially in streaming¹, and has been developed on most of the image and video encoding standards.

As a general remark, data scalability in media coding implies some loss of RD efficiency.

2 Temporal scalability in video coding [5]

In video coding, temporal scalability provides flexibility with the number of decoded frames.²

2.1 GOF-level scalabilty

GOF-splitting (see temporal transforms) provides basic temporal scalability (GOFs can be decoded independently), and this is used in video streaming services (such as YouTube) and video players to move along time using fast forward and fast backward modes.

Notice that in this context, the maximum temporal scalability is achieved when we use the intra-coding mode (III...), which provides the maximum degree of temporal scalability.

2.2 Frame-level scalabilty using MCTF

Random access modes can provide dyadic temporal scalability in each GOF if only B-type frames are used and generated using Motion Compensated Temporal Filtering (MCTF) [3, 2].

3 Spatial scalability in image coding [1]

Compressed images can be partially decoded, resulting in a reconstruction with smaller resolution or a reconstruction of a WOI (Window Of Interest) [1]. Such forms of scalability are used in interactive streaming to minimize latency and to avoid sending resolutions that some devices cannot display. An example of spatial scalability using JPEG2000 [1] can be found in the JHelioviewer service.

3.1 Using the LPT (Laplacian Pyramid Transform)

Laplacian pyramids are 2D multiresolution structures that can be used to provide spatial scalability in image coding. The main issue to solve here is the data redundancy overhead of the LPT domain (in the transform domain we have more coefficients than pixels in the image domain).

3.2 Using the DWT (Discrete Wavelet Transform) [1]

2D-DWT domains are 2D multiresolution data structures that enable spatial scalability, and in this case, compared to LPT, the data redundancy overhead is avoided. JPEG2000 [1] is based on DWT.

4 Spatial scalability in video coding [5]

In the case of video, spatial scalability provides 2D multiresolution rendering using only one (partially decoded) code-stream. This possibility is usually generated using the LPT because the DWT domain is not shift invariant (the DWT is not invariant to displacement of the pixels in the image domain).³ The idea here is to apply MC [3] to each level of the Laplacian pyramid.

Spatial scalability can be used in video streaming to avoid interruptions during the playing of the videos by switching between resolutions, in video databases to save memory, and in the rendering of the videos in displays with different resolutions.

5 Quality scalability in image coding [1]

Quality scalability allows the possibility of adding or substracting more or less visual information, depending on the amount of rendered code-stream. The spatial and temporal resolutions remain constant.

5.1 Using the DCT

Some DCT-based image coding standards, such as JPEG, allow for progressive decoding, where a increasing number of coefficients or bit planes of those coefficients are rendered. Note that if for example, \(11\) is the number of bit planes required to represent the coefficients, a total of \(11\times 64\times 64\) quality levels is possible (considering a block size of \(64\times 64\)).

5.2 Using the DWT [4]

The idea of bit-plane encoding in the DWT domain is used in JPEG 2000 [1]. Compared to JPEG, the number of quality levels is much higher, because in this case we can have up to \(R\times C\times B\), where \(R\) is the number of rows, \(C\) the number of columns, and \(B\) the number of bit planes in the DWT domain. In JPEG 2000, the code-stream that represents a quality level is called quality layer. We find that all the code-streams that belong to the quality layer \(L\) generate points in the RD curve that are to the left of the RD points generated by the layer \(L+1\). In other words, the quality layers are sorted by their contribution to the quality of the reconstruction.

6 Quality scalability in video coding [5]

In the case of video, most video standards provide quality scalability by applying MC to successive refinements of the reconstructed sequence, at the maximum spatial resolution. This idea can be easily understood if we imagine a spatially scalable code-stream generated with the LPT, but in this case all the levels of the pyramid have the same spatial resolution⁴.

7 Simulcast VS Adaptive bit-rate streaming VS data-scalability

Although both terms can be confused, simulcast is an streaming technique, while data scalability is a coding technique.

Simulcast (used, for example, in the DVB, ATSC and ISDB) is the process of parallel transmission of media with different resolutions and/or qualities, and this is usually deployed using different code-streams on the sender side, although it could also be done using only one scalable code-stream in spatial resolution.

Adaptive bit-rate streaming allows us to adapt the transmission bit-rate in a point-to-point communication (of digital media) to the available capacity of the link (which is typically time-varying). This technique is used, for example, in the DASH standard, which is used, for example, by YouTube.

8 To-Do

The \(L\)-levels DWT provides \(L+1\) spatial resolution levels of an image. Modify 2D-DWT.py to include this functionality (the possibility of decoding a reduced resolution version of the original image), which basically consist of ignoring the high-frequency subbands. 5 points.
Modify 2D-DWT.py to allows the possibility of decoding by bit-planes (given as a parameter the number of bit-planes to decode). In order to achieve this, we must use a bit-plane entropy encoder (we can encode bits using an arithmetic coding, or conform symbols (symbol-blocks) of more bits by taking small blocks of bits (of size 4x4, for example), in whose case we could use PNG of TIFF, for example). Notice that this is not incompatible to decode by spatial resolution levels (see previous To-Do). 8 points.
The \(2^n\times 2^n\)-DCT domain can be decoded by resolution levels using an inverse \(2^m\times 2^m\)-DCT to the lower frequency subbands, where \(m=0,1,\cdots ,n\) (notice that the inverse \(2^0\times 2^0\)-DCT does not require any computation because the DC coefficients represent the values of the pixels of the image at the resolution level \(n\)). Implement in 2D-DCT.py such image decoder. See the notebook Image Compression with YCoCg + 2D-DCT. 6 points.
Modify 2D-DCT.py to allows the decoding by bit-planes (see To-Do 2). Complexity 10.
It is possible to add WOI-decoding (Window Of Interest) to 2D-DCT.py and 2D-DWT.py if the bit-planes are encoded by (using the same nomenclature than JPEG2000) code-blocks (notice that the size of the code-blocks must be a multiple of the symbol-blocks used to create the symbols used during the entropy encoding). 5 points
In intra (III...) video coding, we can obtain spatial scalability if we build a Laplacian pyramid of the frames and compress each level of the sequence using a standard image encoder (such for example 2D-DCT.py or 2D-DWT.py). Create a new Python module named SS-III.py (Spatial Scalable III... video encoding) with such functionality. 6 points.
The Laplacian pyramid can be also used in motion compensated (IPP..., IBP..., and MCTF) video coding to provide spatial scalability. The key here is to encode each resolution level using a standard (non-scalable) video codec (see the Temporal Transforms), starting at the lowest level. Name the corresponding Python module LPIPP.py, LPIBP.py or LPMCTF.py, depending on the motion compensation scheme. 8 points.
In the previous proposals there is redundancy between the motion vectors of the different spatial resolution levels because the motion vector \((x, y)\rightarrow (x', y')\) at resolution level \(l\) will be close to the motion vector \((2x, 2y)\rightarrow (2x',2y')\) at resolution level \(l-1\). Exploit such redundancy by adding the suitable functionality to the corresponding module LPIPP.py, LPIBP.py or LPMCTF.py. 7 points.
The number of coefficients in a (2D) Laplacian pyramid is bigger than the number of coefficients in a 2D-DWT (the Laplacian pyramid domain in more redundant than the discrete wavelet domain). However, it is possible to avoid such redundancy when the filters used to build the Laplacian pyramid are the same than the filters used to compute the 2D-DWT because, in this case, when the 1-levels 2D-DWT is applied to any high-frequency level of the Laplacian pyramid, then the corresponding \(LL\) subband is \(\mathbf {0}\). Explore such improvement in the corresponding LPIPP.py, LPIBP.py or LPMCTF.py module. See this. 10 points.

9 References

[1] V. González-Ruiz. The JPEG2000 Standard.

[2] V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).

[3] V. González-Ruiz. Motion Compensation.

[4] V. González-Ruiz. Transform Coding.

[5] V. González-Ruiz. Video Scalability.

¹Specifically, in real-time streaming scenarios we cannot prefetch much data before start the rendering of the image or video (casically, because we cannot wait too much. In this case, we can adapt the quality of the rendering to the available bandwidth, a factor that we cannot control in most of the situations.

²Notice that the concept of temporal scalability cannot be applied to image coding.

³The DWT domain is not redundant, but the shift invariant feature is not satisfied. To solve this problem, the DWT subbands must be interpolated to restore the lost phases. In this overcomplete domain, the ME/MC algorithms work, but the phase used of the predicted images must be represented in the code-stream.

⁴Notice that in this case, we should use the term “cubic building” instead of “pyramid”.