Temporal Transforms

In general, neighbor frames (or images) in (video) sequences exhibit a high temporal correlation degree that can be exploited to improve significantly the RD curves. This correlation produces a temporal redundancy that can be removed using a (temporal) transform.

A temporal transform inputs two or more frames¹, and outputs at least one residual (frame) in which the residual pixels have a higher dynamic range but, in general, also a lower entropy (see the spatial transform theory).

2 Motion Compensation (MC)

Most video coding standards use Motion Compensation (MC) to generate the residual frames [2]. MC exploits the temporal correlation and reduces the entropy of the residuals². Basically, MC consists in subtracting from each original frame a prediction (frame) built with the information that must be also avaliable³ at the decoder. Notice that, after using MC, the number of residual pixels is equal to the number of pixels in the compensated frame.⁴

3 Motion Estimation (ME)

To compensate the motion we need first to estimate⁵ it using Motion Estimation (ME) techniques [3]. Using the motion fields generated by the motion estimator, both the encoder and the decoder generate the predictions that will be added (by the decoder) from the predicted images [2]. However, notice that in most of the video coding standards, ME is only performed by the encoder because it is a costly operation, and for this reason, the motion vector fields must be transmitted to the decoder. This responds to the idea of “compress one, decompress many”.

4 GOF-ing

The RD performance of ME/MC depends on the amount of temporal redundancy in the sequence. If such an amount is low, it can be more RD-efficient to interrupt the (ME/)MC process. The set of consecutive frames in which MC is active is usually known as a GOF⁶ (Group of Frames). Notice that (under the RD prism) the length of the GOFs is variable, and therefore, the GOF partition should be an adaptive process controlled by a RDO algorithm.

However, in some contexts⁷ it may be necessary to use a fixed GOP partition [2]. For example, if we want to give the option to the users to move fast forward or backward along the sequence, we need to set some given GOF size. Another reason to use a defined GOF size is to limit the propagation of decoding errors (for example, because in a streaming session we have not received some data). When a new GOF starts, the propagation of such errors is stopped.

5 Block-based MC and RDO

The MC schemes used in most video coding standards compensate blocks of pixels [3]. In this context, depending of the block decision mode implemented in the RDO procedure⁸, blocks can be of different type (I (intra), P (predicted), B (bidirectionally predicted) and S (skipped)) [2]. A I-block is used when we do not found enough temporal correlation between frames and from a RD perspective, it is more advantagous to use intra-coding. When we found one or more reference blocks to perform a good prediction, we are using predictive-coding. Notice that the number of reference blocks can be higher than one, a number also controlled by RDO. In the intra-coding mode, all the frames are I-type because otherwise we could not reset the propagation errors. In Motion Compensated Temporal Filtering [1], the frames are I or B.

6 Frame types

Depending on the type of blocks used in the frames, we have different types of frames: I, P, and B [2]. In a I-frame, all blocks are I-type. In a P frame, I- and P-type blocks can be found. In a B-frame all types of blocks can be used.

7 Resources

8 To-Do

9 References

[1] V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).

[2] V. González-Ruiz. Motion Compensation.

[3] V. González-Ruiz. Motion Estimation.

¹With pixels or coefficients, depending on the current domain in which the frame has been represented.

²The better the prediction, the lower the entropy of the residuals.

³In order to make a reversible process.

⁴At least, when we compensate in the image domain.

⁵In most of the situations, the determination of the true motion of the objects in a real scene is a ill-posed problem because it is impossible to find it using only a sequence of 2D images. A different situation is when we use al least 2 cameras.

⁶Some standards also use GOP (Group Of Pictures).

⁷Specifically, constant bit-rate encodings.

⁸Obviously, the part of the RDO procedure that controls the block-type.