In general, neighbor frames (or images) in (video) sequences exhibit a high temporal correlation degree that can be exploited to improve significantly the RD curves. This correlation produces a temporal redundancy that can be removed using (temporal) decorrelating techniques. Such techniques can be considered as a special type of transform applied to the temporal domain: temporal transfors.
A temporal transform inputs two or more frames1, and outputs at least one residual (frame) in which the residual pixels have a higher dynamic range but, in general, also a high energy concentration.
Most video coding standards use Motion Compensation (MC) to generate the residual frames [3]. MC exploits the temporal correlation and reduces the entropy of the residuals2. Basically, MC consists in subtracting from each original frame a prediction (frame) built with the information that must be also avaliable3 at the decoder. Notice that, after using MC, the number of pixel in the residual frame is equal to the number of pixels in the compensated frame.4
To compensate the motion we need first to estimate5 it using Motion Estimation (ME) techniques [4]. Using the motion fields generated by the motion estimator (and obviously, the pixel data), both the encoder and the decoder generate the predictions that will be added (by the decoder) from the predicted images [3]. However, notice that in most of the video coding standards, ME is only performed by the encoder because it is a costly operation, and for this reason, the motion vector fields must be transmitted to the decoder. This responds to the idea of “compress one, decompress many”.
The RD performance of ME/MC depends on the amount of temporal redundancy in the sequence. If such an amount is low, it can be more RD-efficient to interrupt the (ME/)MC process. The set of consecutive frames in which MC is active is usually known as a GOF6 (Group of Frames). Notice that, under the RD prism, the GOF partitioning (the length of the GOFs) should be an adaptive process controlled by a RD control algorithm.
However, in some contexts7 it may be necessary to use a fixed GOP partition [3]. For example, if we want to give the option to the users to move fast forward or backward along the sequence, we need to set some given GOF size. Another reason to use a constant GOF size is to limit the propagation of decoding errors (for example, because in a streaming session we have not received some data). When a new GOF starts, the propagation of such errors vanishes.
The MC schemes used in most video coding standards compensate blocks of pixels [4]. In this context, depending on the block decision mode implemented in the RDO procedure8, blocks can be of different type (I (intra), P (predicted), B (bidirectionally predicted) and S (skipped)) [3]. A I-block is used when, for this block, we do not found enough temporal correlation between frames and from a RD perspective, it is more advantagous to use intra-coding. When we found one (P) or more (B) reference blocks to perform a good prediction, we are using predictive-coding. Notice that the number of reference blocks can be higher than one, a number that also should be controlled by RDO. In the intra-coding mode, all the frames are I-type because otherwise we could not reset the propagation errors. In Motion Compensated Temporal Filtering [2], the frames can also be I, P or B.
Depending on the type of blocks used in the frames, we have different types of frames: I, P, and B [3]. In a I-frame, all blocks are I-type. In a P frame, I- and P-type blocks can be found. In a B-frame all types of blocks can be used.
[1] Gunnar Farnebäck. Polynomial Expansion for Orientation and Motion Estimation. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, 2002. Dissertation No 790, ISBN 91-7373-475-6.
[2] V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).
[3] V. González-Ruiz. Motion Compensation.
[4] V. González-Ruiz. Motion Estimation.
[5] Samson Joshua Timoner. Subpixel motion estimation from sequences of video images. PhD thesis, Massachusetts Institute of Technology, 2000.
1With pixels or coefficients, depending on the current domain in which the frame has been represented.
2The better the prediction, the lower the entropy of the residuals.
3In order to make a reversible process.
4At least, when we compensate in the image domain.
5In most of the situations, the determination of the true motion of the objects in a real scene is a ill-posed problem because it is impossible to find it using only a sequence of 2D images. A different situation is when we use al least 2 cameras.
6Some standards also use GOP (Group Of Pictures).
7Specifically, constant bit-rate encodings.
8The part of the RDO procedure that controls the block-type.
9Notice that we can estimate the number of bits that a block can require when we encode it with a a 0-order context-based arithmetic using the entropy of the data of the codec. If we have several context, the entropy can be computed context-based also.