Temporal Transforms

Vicente González Ruiz - Depto Informática - UAL

March 9, 2026

Contents

1 Temporal correlation
2 Motion Compensation (MC)
3 Motion Estimation (ME)
4 GOF-ing
5 Block-based MC and Rate/Distortion Optimization (RDO)
6 Frame types
7 Resources
8 To-Do
9 References

1 Temporal correlation

In general, neighbor frames (or images) in (video) sequences exhibit a high temporal correlation degree that can be exploited to improve significantly the RD curves. This correlation produces a temporal redundancy that can be removed using (temporal) decorrelating techniques. Such techniques can be considered as a special type of transform applied to the temporal domain: temporal transfors.

A temporal transform inputs two or more frames1, and outputs at least one residual (frame) in which the residual pixels have a higher dynamic range but, in general, also a high energy concentration.

2 Motion Compensation (MC)

Most video coding standards use Motion Compensation (MC) to generate the residual frames [3]. MC exploits the temporal correlation and reduces the entropy of the residuals2. Basically, MC consists in subtracting from each original frame a prediction (frame) built with the information that must be also avaliable3 at the decoder. Notice that, after using MC, the number of pixel in the residual frame is equal to the number of pixels in the compensated frame.4

3 Motion Estimation (ME)

To compensate the motion we need first to estimate5 it using Motion Estimation (ME) techniques [4]. Using the motion fields generated by the motion estimator (and obviously, the pixel data), both the encoder and the decoder generate the predictions that will be added (by the decoder) from the predicted images [3]. However, notice that in most of the video coding standards, ME is only performed by the encoder because it is a costly operation, and for this reason, the motion vector fields must be transmitted to the decoder. This responds to the idea of “compress one, decompress many”.

4 GOF-ing

The RD performance of ME/MC depends on the amount of temporal redundancy in the sequence. If such an amount is low, it can be more RD-efficient to interrupt the (ME/)MC process. The set of consecutive frames in which MC is active is usually known as a GOF6 (Group of Frames). Notice that, under the RD prism, the GOF partitioning (the length of the GOFs) should be an adaptive process controlled by a RD control algorithm.

However, in some contexts7 it may be necessary to use a fixed GOP partition [3]. For example, if we want to give the option to the users to move fast forward or backward along the sequence, we need to set some given GOF size. Another reason to use a constant GOF size is to limit the propagation of decoding errors (for example, because in a streaming session we have not received some data). When a new GOF starts, the propagation of such errors vanishes.

5 Block-based MC and Rate/Distortion Optimization (RDO)

The MC schemes used in most video coding standards compensate blocks of pixels [4]. In this context, depending on the block decision mode implemented in the RDO procedure8, blocks can be of different type (I (intra), P (predicted), B (bidirectionally predicted) and S (skipped)) [3]. A I-block is used when, for this block, we do not found enough temporal correlation between frames and from a RD perspective, it is more advantagous to use intra-coding. When we found one (P) or more (B) reference blocks to perform a good prediction, we are using predictive-coding. Notice that the number of reference blocks can be higher than one, a number that also should be controlled by RDO. In the intra-coding mode, all the frames are I-type because otherwise we could not reset the propagation errors. In Motion Compensated Temporal Filtering [2], the frames can also be I, P or B.

6 Frame types

Depending on the type of blocks used in the frames, we have different types of frames: I, P, and B [3]. In a I-frame, all blocks are I-type. In a P frame, I- and P-type blocks can be found. In a B-frame all types of blocks can be used.

7 Resources

  1. IPP_DCT.ipynb in VCF.
  2. MCTF.ipynb in VCF.
  3. Full search block-based ME (Motion Estimation).
  4. Full search dense (1x1) ME.
  5. Farnebäck’s motion estimation.
  6. Introducing the Low-delay (IPP...) Mode.
  7. Multi-Resolution Video Coding (MRVC).

8 To-Do

  1. The module used in IPP_DCT.ipynb should determine the block type (I, P, or B) using the number of bits that actually the block requires after cuantization (now an heuristic using the variance of the quantized block is implemented). Notice that this implies that the blocks must be entropy encoded independently (the codec must operate at the block level, not at the image one). For this, arithmetic coding is highly recommended9.
  2. Both, IPP_DCT.ipynb and MCTF.ipynb estimate the motion at the frames resolution. Better predictions can be generated using sub-pixel motion estimation [51].
  3. Both, IPP_DCT.ipynb and MCTF.ipynb load first the complete video in memory to process it. This is a problem when the videos are long or have a high resolution. Work at the frame level. Disks are faster enought.
  4. In the current implementation of MCTF.ipynb the reference frames have the maximum quality when encoding, and this is not the case with the video is decoded (obviously, if the QSS is bigger than 1). To avoid the drift error along temporal resolution levels, use at the encoder the same references that the decoder will use. Notice that this implies that the quantization step size must known at compression time.
  5. Create a video codec similar to the used in IPP_DCT.ipynb, but using an “IBP...” scheme. In this case, the predicted images B use more than one reference image [43]. For now, use only the adjacent frames as reference.

9 References

[1]   Gunnar Farnebäck. Polynomial Expansion for Orientation and Motion Estimation. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, 2002. Dissertation No 790, ISBN 91-7373-475-6.

[2]   V. González-Ruiz. Motion Compensated Temporal Filtering (MCTF).

[3]   V. González-Ruiz. Motion Compensation.

[4]   V. González-Ruiz. Motion Estimation.

[5]   Samson Joshua Timoner. Subpixel motion estimation from sequences of video images. PhD thesis, Massachusetts Institute of Technology, 2000.

1With pixels or coefficients, depending on the current domain in which the frame has been represented.

2The better the prediction, the lower the entropy of the residuals.

3In order to make a reversible process.

4At least, when we compensate in the image domain.

5In most of the situations, the determination of the true motion of the objects in a real scene is a ill-posed problem because it is impossible to find it using only a sequence of 2D images. A different situation is when we use al least 2 cameras.

6Some standards also use GOP (Group Of Pictures).

7Specifically, constant bit-rate encodings.

8The part of the RDO procedure that controls the block-type.

9Notice that we can estimate the number of bits that a block can require when we encode it with a a 0-order context-based arithmetic using the entropy of the data of the codec. If we have several context, the entropy can be computed context-based also.