Image and video codecs represent multidimensional signals, and this makes it possible to decode such information in several ways. When the code-stream allows for this, we say that the code-stream generated by such scheme is scalable.
Scalability is interesting in several contexts, but specially in streaming1, and has been developed on most of the image and video encoding standards.
As a general remark, data scalability in media coding implies some loss of RD efficiency.
In video coding, temporal scalability provides flexibility with the number of decoded frames.2
GOF-splitting (see temporal transforms) provides basic temporal scalability (GOFs can be decoded independently), and this is used in video streaming services (such as YouTube) and video players to move along time using fast forward and fast backward modes.
Notice that in this context, the maximum temporal scalability is achieved when we use the intra-coding mode (III...), which provides the maximum degree of temporal scalability.
Random access modes can provide dyadic temporal scalability in each GOF if only B-type frames are used and generated using Motion Compensated Temporal Filtering (MCTF) [3, 2].
Compressed images can be partially decoded, resulting in a reconstruction with smaller resolution or a reconstruction of a WOI (Window Of Interest) [1]. Such forms of scalability are used in interactive streaming to minimize latency and avoid sending resolutions that some devices cannot display. An example of spatial scalability using JPEG2000 [1] can be found in the JHelioviewer service.
Laplacian pyramids are 2D multiresolution structures that can be used to provide spatial scalability in image coding. The main issue to solve here is the data redundancy overhead of the LPT domain (in the transform domain we have more coefficients than pixels in the image domain).
2D-DWT domains are 2D multiresolution data structures that enable spatial scalability, and in this case, compared to LPT, the data redundancy overhead is avoided. JPEG2000 [1] is based on DWT.
In the case of video, spatial scalability provides 2D multiresolution rendering using only one (partially decoded) code-stream. This possibility is usually generated using the LPT because the DWT domain is not shift invariant (the DWT is not invariant to displacement of the pixels in the image domain).3 The concept here is to apply MC [3] to each level of the Laplacian pyramid.
Spatial scalability can be used in video streaming to avoid interruptions during the playing of the videos by switching between resolutions, in video databases to save memory, and in the rendering of the videos in displays with different resolutions.
Quality scalability allows the possibility of adding or substracting more or less visual information, depending on the amount of rendered code-stream. The spatial and temporal resolutions remain constant.
Some DCT-based image coding standards, such as JPEG, allow for progressive decoding, where a variable number of coefficients or bit planes of those coefficients are rendered. Note that if \(11\) is the number of bit planes required to represent the coefficients, a total of \(11\times 64\times 64\) quality levels is possible.
The idea of bit-plane encoding in the DWT domain is used in JPEG 2000 [1]. Compared to JPEG, the number of quality levels is much higher, because in this case we can have up to \(R\times C\times B\), where \(R\) is the number of rows, \(C\) the number of columns, and \(B\) the number of bit planes in the DWT domain. In JPEG 2000, the code-stream that represents a quality level is called quality layer. We find that all the code-streams that belong to the quality layer \(L\) generate points in the RD curve that are to the left of the RD points generated by the layer \(L+1\). In other words, the quality layers are sorted by their contribution to the quality of the reconstruction.
In the case of video, most video standards provide quality scalability by applying MC to successive refinements of the reconstructed sequence, at the maximum spatial resolution. This idea can be easily understood if we imagine a spatially scalable code-stream generated with the LPT, but in this case all the levels of the pyramid have the same spatial resolution4.
Although both terms can be confused, simulcast is an streaming technique and data scalability is a coding technique.
Simulcast (used, for example, in the DVB, ATSC and ISDB) is the process of parallel transmission of media with different resolutions and/or qualities, and this is usually deployed using different code-streams on the sender side, although it could also be done using only one scalable code-stream in spatial resolution.
Adaptive bit-rate streaming allows us to adapt the transmission bit-rate in a point-to-point communication (of digital media) to the available capacity of the link (which is typically time-varying). This technique is used, for example, in the DASH standard, which is used, for example, on YouTube.
[1] V. González-Ruiz. The JPEG2000 Standard.
[2] V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).
[3] V. González-Ruiz. Motion Compensation.
[4] V. González-Ruiz. Transform Coding.
[5] V. González-Ruiz. Video Scalability.
1Specifically, in real-time streaming scenarios we cannot prefetch much data before start the rendering of the image or video (casically, because we cannot wait too much. In this case, we can adapt the quality of the rendering to the available bandwidth, a factor that we cannot control in most of the situations.
2Notice that the concept of temporal scalability cannot be applied to image coding.
3The DWT domain is not redundant, but the shift invariant feature is not satisfied. To solve this problem, the DWT subbands must be interpolated to restore the lost phases. In this overcomplete domain, the ME/MC algorithms work, but the phase used of the predicted images must be represented in the code-stream.
4Notice that in this case, we should use the term “cubic building” instead of “pyramid”.
5Obviously, expecting worse motion fields that if we estimate them at the encoder because, in the decoder, the available information has been reduced due to lossy coding.