In general, neighbor frames (or images) in (video) sequences exhibit a high temporal correlation degree that can be exploited to improve significantly the RD curves. This correlation generates a temporal redundancy that can be removed using a (temporal) transfor.
A temporal transform inputs two or more frames1, and outputs at least one residual (frame) in which the residual pixels have a higher dynamic range but, in general, also a lower entropy (see the spatial transform theory).
Most video coding standards use Motion Compensation (MC) to generate the residual frames [2]. MC exploits the temporal correlation and reduces the entropy of the residuals2. Basically, MC consists in subtracting from each original frame a prediction (frame) built with the information that must be also avaliable3 at the decoder. Notice that, after using MC, the number of residual pixels is equal to the number of pixels in the compensated frame.4
To compensate the motion we need first to estimate5 it using Motion Estimation (ME) techniques [3]. Using the motion fields generated by the motion estimator, both the encoder and the decoder generate the predictions that will be desubtracted (added in the case of the decoder) from the predicted images [2]. However, notice that in most of the video coding standards, ME is only performed by the encoder because it is a costly operation, and for this reason, the motion vector fields must be transmitted to the decoder. This responds to the idea of “compress one, decompress many”.
The RD performance of ME/MC depends on the amount of temporal redundancy in the sequence. If such an amount is low, it can be more RD-efficient to interrupt the (ME/)MC process. The set of consecutive frames in which MC is active is usually known as a GOF6 (Group of Frames). Notice that (under the RD prism) the length of the GOFs is variable, and therefore, the GOF partition should be an adaptive process controlled by a RDO algorithm.
However, in some contexts7 it may be necessary to use a fixed GOP partition [2]. For example, if we want to give the option to the users to move fast forward or backward along the sequence, we need to set some maximum GOF size. Another reason to use a maximum GOF size is to limit the propagation of decoding errors (for example, because in a streaming session we have not received some data). When a new GOF stars, the propagation of such errors is stopped.
The MC schemes used in most video coding standards compensate blocks of pixels [3]. In this context, depending of the block decision mode implemented in the RDO procedure8, blocks can be of different type (I (intra), P (predicted), B (bidirectionally predicted) and S (skipped)) [2]. A I-block is used when we do not found enough temporal correlation between frames and from a RD perspective, it is more advantagous to use intra-coding. When we found one or more reference blocks to perform a good prediction, we are using predictive-coding. Notice that the number of reference blocks can be higher than two, a number also controlled by RDO.
Depending on the type of blocks used in the frames, we have different types of frames: I, P, and B [2]. For example, in the intra-coding mode, all the frames are I-type because otherwise we could not reset the propagation errors. In Motion Compensated Temporal Filtering [1], the frames are I or B.
[1] V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).
[2] V. González-Ruiz. Motion Compensation.
[3] V. González-Ruiz. Motion Estimation.
1With pixels or coefficients, depending on the current domain in which the frame has been represented.
2The better the prediction, the lower the entropy of the residuals.
3In order to make a reversible process.
4At least, when we compensate in the image domain.
5In most of the situations, the determination of the true motion of the objects in a real scene is a ill-posed problem because it is impossible to find it using only a sequence of 2D images. A different situation is when we use al least 2 cameras.
6Some standards also use GOP (Group Of Pictures).
7Specifically, constant bit-rate encodings.
8Obviously, the part of the RDO procedure that controls the block-type.