Historically, video was principally professionally made, using expensive (and heavy) cameras, usually sitting on expensive and very stable tripods. One effect of this was to bias the video content itself to contain little camera movement, with relatively little camera acceleration and almost no camera rotation. Traditional video codecs, such as MPEG, meet the need for streaming such content.
The arrival of camera phones is changing the way video is made. Now, the cameras themselves have relatively low mass compared with traditional cameras, and in general use are often moved erratically—with pans, rotation and, in the case of selfies in particular, rapid zooms. In addition, the relatively small lenses restrict the light which can reach the sensor, making the source content often noisier than on a professional camera. In addition, despite all the additional complications with the source content, mobile devices have limited and sometimes expensive bandwidth and limited processing power. This invention is a video codec designed to address, amongst other things, issues relating to real time capture, compression, sharing and playback of video over limited bandwidth on a modern mobile phone.
For the purposes of this patent, I have called the invention described herein “Blackbird 8”.
The present invention, in accordance with a first aspect is directed to a method of encoding a series of frames in a video or media, including receiving a first key frame, receiving subsequent chunks of frames including at least one key frame, dividing each frame into a plurality of blocks, subdividing a first block of the plurality of blocks into a plurality of pixel groups, and averaging the pixels in each pixel group to generate a single value. The method further includes creating a first mini-block wherein each element of the first mini block corresponds with a pixel group of the corresponding first block and contains the single value, and repeating for each block of each frame of the chunk, and comparing a first of said plurality of mini blocks of a first frame with mini blocks of a second frame, where said second frame mini blocks are not necessarily aligned to mini blocks in the first frame, until a best match is achieved.
Preferably the method further comprises the steps of generating a motion vector to map said first mini block of the first frame to the best match mini block of the second frame.
Preferably the method further comprises generating respective motion vectors to map respective mini blocks of a first frame onto respective achieved best matches of a second frame.
Preferably the achieving a best match relies on an error function.
Preferably the error function includes utilising mean square error or errors between pixel values from the first frame and pixel values from the second frame.
Preferably the first frame is a temporally earlier frame than the second frame of the chunk of frames.
Preferably the first frame is a temporally nearest earlier frame to the second frame of said chunk of frames.
Preferably the first frame is a temporally later frame than the second frame of the chunk of frames.
Preferably the first frame is a temporally nearest later frame to the second frame of the chunk of frames.
Preferably the motion vectors are based on a combination of a temporally earlier and a temporally later frame to the second frame of said chunk of frames.
Preferably separate motion vectors are based on a temporally nearest earlier frame and a temporally nearest later frame to the second frame of the chunk of frames.
Preferably the method further comprises recalculating each mini block of the first frame to be an integer multiple larger, to provide a larger mini block, and comparing each larger mini block of the first frame with the comparisons of larger mini blocks starting in the vicinity of the best match mini block of the second frame, not necessarily aligned to tiled larger block positions, to find a better best match.
Preferably the recalculating step is repeated until the mini block size is the same as the block size, and each pixel being compared is an original pixel not an averaged pixel.
Preferably the step of comparing a first of said plurality of mini blocks of a first frame with respective mini blocks of a second frame includes comparing the first mini block with a mini block in a corresponding position in a second frame, then comparing the first mini block with a block in the second frame with a displacement where both the horizontal and vertical components of the displacement are integer numbers of pixels from the mini block in a corresponding position in the second frame.
Preferably the step of comparing a first of the plurality of mini blocks of a first frame with respective mini blocks of a second frame includes comparing the first mini block with a mini block in a corresponding position in a second frame, then comparing the first mini block with a block in the second frame with a displacement where both the horizontal and vertical components of the displacement are integer numbers of pixels from the mini block in a corresponding position in the second frame.
Preferably the integer number of pixels is one pixel.
Preferably the integer number of pixels is a plurality of pixels.
Preferably the pixel displacement is in a vertical direction. Preferably the pixel displacement is in a horizontal direction. Preferably the pixel displacements are in a horizontal and a vertical direction.
Preferably encoding the chunk of frames includes providing a key frame and motion vectors derived from the comparisons.
Preferably encoding the chunk of frames includes providing block information.
Preferably the block information includes block size.
Preferably the block information indicates the block type such as whether the first frame is and earlier frame, a later frame, than the second frame or both an earlier frame and a later frame are used to generate separate vectors.
Preferably the comparing step is adapted to cover remote matches.
Preferably the encoding method includes an error function to estimate errors over the blocks.
Preferably the method includes filtering, but said filtering steps are carried out after the motion vectors are calculated.
Preferably the number of blocks is sufficient to ensure each block has a match on an earlier or later frame, or both.
Preferably all the blocks in a frame are the same size.
Preferably the block size is a divisor of the image size of the frame.
Preferably the blocks are sized to tile the respective frame.
Preferably all the blocks in all the frames in a chunk are the same size.
Preferably estimates of motion vectors for each frame where the binary number of the frame ends in a ‘1’ are generated in a first pass at a first magnification from raw data of said frame and from temporally adjacent, neighbouring frames, and estimates of motion vectors for each frame where the binary number of the frame ends in a ‘10’ is generated at a second, less granular, magnification from said earlier motion vector estimates, and each frame where the binary number of the frame ends in ‘100’, ‘1000’, 10000, and so on, this list of frame numbers continuing until the last power of two which is not a key frame, is generated at further, less granular, magnifications from earlier motion vector estimates until a final motion estimate for the frame mid-way between the first and last frame depends on earlier estimates, such that each motion vector estimate is based on values calculated from neighbouring estimates.
Preferably the method includes the further step of, before the frames of the chunk are encoded for transmission, reconstructing the video from the compressed chunk by applying the motion vectors thereto; comparing the result to the frames as they will be decompressed, adjusting the motion vectors to obtain a best result, and replacing the original frames to be compressed with the resulting decompressed frames to be used as this process continues.
Preferably the method further including the step of transmitting the encoded chunk, with at least one key frame and motion vector transmitted to the decoder for the decoder to reconstruct the chunk.
Preferably the method further includes the step of adding additional data when encoding, from any number of, bias information, gaps to correction, and corrections to temporal estimates.
Preferably the method includes the further step of receiving a compressed key frame, and at least one subsequent compressed encoded data chunk, including motion vector data, and applying relevant motion vectors to either one or both said key frames, or using other encoded intra frame data, to reconstruct a first delta frame.
Preferably the method further comprises constructing a subsequent delta frame from said constructed delta frame and relevant motion vectors.
Preferably the data chunk relates to an integer number of frames, wherein the number of frames is divisible by 2.
Preferably subsequent delta frames are constructed at mid-points between already constructed delta frames.
The present invention, in accordance with a second aspect, is directed to an apparatus for decoding a received compressed video or other media data, the compressed data including at least two key frames and information identifying, for frames compressed in the compressed data, a block size and, for each block, a block type, the block type indicating whether a first frame to be reconstructed is based on zero, one or two of an earlier or later second frame in the compressed data, motion vectors and other data, comprising: means to receive the compressed data; means to apply the motion vectors to the compressed data to reconstruct frames.
Preferably the apparatus is further adapted to reconstruct delta frames from key frames and the motion vectors.
Preferably the apparatus is further adapted to reconstruct delta frames utilising additional data.
Preferably the apparatus further comprises means to apply the motion vectors to reconstructed delta frames to generate further delta frames.
Preferably the apparatus is adapted to reconstruct delta frames until all the frames have been reconstructed.
Preferably the means is adapted to receive the data key frames first, followed by motion vectors for constructing delta frames therefrom.
Preferably the apparatus is further adapted to receive compressed data chunks including an integer number of frames.
Preferably the integer is a power of 2.
Preferably the delta frames are reconstructed to generate a video or media stream.
The present invention, in accordance with a third aspect is directed to a media player including an encoder as set out above.
The present invention, in accordance with a fourth aspect is directed to a media player including a decoder as set out above.
The present invention, in accordance with a fifth aspect is directed to an apparatus for carrying out any of the steps set out above.
Preferred embodiments of the present invention will now be described by way of example only and with reference to the accompanying drawings in which:
The video codec has two parts—an encoder and a decoder. The decoder must be installed on all playback devices, so is not changed very often. The encoder, on the other hand, only has to be installed on the device making the recording. Provided the encoder stays compatible with the installed base of decoders, it can be changed at will, such as with each update to an app. In accordance with the present disclosure, the present invention (Blackbird 8) has a flexible bitstream format which allows refinements and variations to the compression, or encoding, without needing to change the decoders in any devices—though optimisations to the decoder which read the same or earlier bitstream formats are expected.
The bitstream format provides for videos of a wide range of sizes—such as but not limited to: 3840×2160 (4k), 1920×1080 (1080p), 1280×720 (720p), 640×360 (360p), 432×240, 384×288, 384×216, 320×240 and 320×180—in each case with the video frames split into preferably rectangular blocks. This can be seen in
The bitstream format includes information as to which of several pre-defined formats have been utilised for each block during encoding, and this information is available and used during decoding. In one format, shown in
In addition, the bitstream format includes motion vectors, and the decoding steps include decoding such motion vectors for each block. The bitstream format may also contain simple adjustments across blocks which when decoded and applied, make decoded blocks a more accurate representation of the compressed block. In certain cases, for example in the fourth representation above, the decoding step includes decoding blocks which have been encoded utilising the intra frame compression option, for example utilised for any block whose compressed representation does not depend on any other frame or part of the frame.
Filter on Source
Forbidden's codecs have traditionally included a filter stage to remove noise. Filtering before knowledge of the motion adds compression artefacts to the video, and so to minimise these filtering artefacts caused by motion, in accordance with the present invention, filtering is applied after the motion estimation stages of the codec. The motion search which leads to the motion vectors for each block is quite accurate, and an advantage of filtering after motion estimation can be best illustrated by considering for example an image for an object which may be present in several frames. Each of the several frames will include information about the object and information may be combined and used or relied upon in any, some or all of the appropriate frames in order to create a temporally more consistent representation of the object—but only if the filter handles corresponding pixels from the version of each object in each frame, i.e. only if the filter is applied after motion compensation.
Compressor Chunks
The video frames can be thought of as being grouped in chunks. Each chunk contains a key frame and, apart from the earliest chunk, a number of delta frames. Referring again to
For example, referring again to
The method of compression of the key frames, which use their own intra frame compression techniques, is not described in this invention, which covers the compression and decompression of the delta frames.
YUV
The luminance (Y) and the colour components (e.g. UV) can be compressed individually, or more efficiently from a data rate standpoint, together, in which case the motion vectors for each block are applied to Y, U and V.
Motion Search Blocks
As discussed, the video or other media being compressed is composed of frames, and in accordance with the present invention these frames are split into blocks each of which are contemplated to have their own motion vectors. It is convenient, though not necessary, for these blocks all to be the same size, and for this size to be a divisor of the image size. These blocks are additionally converted into mini blocks, where an integer number of blocks of the size of the mini blocks can tile a block the size of the bigger blocks. These blocks may be square, but this is not necessary.
For example only, some typical video images sizes are set out below, with corresponding big block and mini block sizes (in pixels). The invention is not limited to the exemplary image and block sizes set out herein.
Image size: 1920×1080 (1080p)
Big block size: 30×30
Small block size: 5×5
Image size: 1280×720 (720p)
Big block size: 20×20
Small block size: 4×4
Image size: 640×360 (360p)
Big block size: 20×20
Small block size: 5×5
Image size: 432×240
Big block size: 16×16
Small block size: 4×4
Image size: 384×288
Big block size: 16×16
Small block size: 4×4
Image size: 384×216
Big block size: 12×12
Small block size: 4×4
Image size: 320×240
Big block size: 16×16
Small block size: 4×4
Image size: 320×180
Big block size: 10×10
Small block size: 5×5
Implementation is simplest when the following conditions are met:
1. Big blocks tile the image (i.e. each frame of the video).
2. The horizontal and vertical pixel sizes of the mini blocks are divisors of the horizontal and vertical pixel sizes of the big blocks to which they correspond.
3. There are sufficient big blocks that each big block has a corresponding big block on earlier frames, later frames, or both, which match the similar image content.
4. The mini blocks are large enough to have enough content to match motion vectors accurately.
Where the big block has a different arrangement of pixels, i.e. 30×30, or some such, a correspondingly different arrangement of mini blocks is generated, but the principle is the same.
Adjacent Frame Motion Search (AFMS)
In the case where the frame or frames adjacent to the frame being compressed are being used as predictors as in the first three block formats discussed above and also as shown in
With unlimited processor time, all possible motion vectors for the big blocks could be searched to find the optimal match, for example each pixel of a first big block in a first frame could be compared with each pixel of each big block in a second frame until it can be established that an optimal match has been found: i.e. until the location of the best match of the first big block in the second frame has been established. Once the location of the first big block in the second key frame has been established it is possible to determine exactly the motion vector that resulted in the repositioning of the first big block between the first and second frames.
On current typical devices, the resolution and frame rate of the video expected can preclude the possibility of an exhaustive search. Instead, the process is carried out on the mini blocks on corresponding mini versions of the video frames scaled as described above. When a, possibly limited, search is complete using integer ‘average’ pixel motions of the mini blocks, sub pixel motions of the mini blocks are examined to refine the search. The sub pixel motions utilise the averaged information contained in each of the ‘averaged’ pixels in the for example 4×4 pixel group. The small size of these mini frame images reduces the search time and allows a much wider range of candidate motion vectors to be tested in the available time on a real time system, using an error function to estimate any errors over the blocks, to find a candidate best motion vector for each mini block. The error function can be any suitable error function, for example a mean square error function which may apply for example to pixels between the first frame and pixel values from the second frame.
Once a candidate best motion vector is found for a mini block, the process is repeated with new mini-blocks at a larger size—for example four or nine times the previous mini block area, but never exceeding the original block size area. So a 5×5 mini-block could be replaces with a 10×10 mini block covering the same proportion of a larger version of the video frame. Corresponding motion vectors for these larger mini blocks are tested, and an iterative appropriate adjacent motion vector search (IAAMVS) is repeated until a local minimum for the error is found. For example, as shown in
So in summary: first, the best available motion vector is found with the smallest mini blocks. Then the block size is increased and the motion vector from the smaller motion blocks is used as the basis for the starting point for a new search for the best available motion vector at this new size. This process is repeated until the motion vector found is based on actual pixels in the unscaled video frame being compressed.
Non Adjacent Frame Motion Search (NAFMS)
In the preferred invention, use is made of motion vectors already calculated for adjacent frames when calculating motion vectors between non adjacent frames, where non-adjacent frames include a first frame and a second frame where the first frame is not the nearest earlier or nearest later frame to the second frame. Examples of non-adjacent frames include frame 0 and frame 2, or frame 2 and frame 4. A consequence of using such already calculated motion vectors is that motion vectors for non adjacent frames are calculated later than those of the adjacent frames. In particular it is contemplated that the motion vectors for the non adjacent frames use the sum of the appropriate motion vectors for frames which are half the time apart as a first estimate of their values, with smaller search spaces around these values to improve the accuracy of the search.
Initial Motion Search Vector Estimation
In one preferred implementation, the frames are decoded by the decoder in increasingly fine temporal granularity, starting with key frames, such as frames 0 and 32 in for example
The motion vectors for the delta frames which are furthest apart (and in this implementation decoded first) are generally the largest, and required potentially the largest search to find the best match. For example, referring again to
As frames F0, F1, F2, . . . arrive for compressing (with Eab representing the motion vector between two earlier frames a and b and Lcd representing the motion vector between two later frames c and d), this is one possible order for calculating the motion vectors:
It can be seen that when the number of frames reaches power of two, all the motion vectors can be estimated from the subdivisions.
For example, a simplified way of considering this process is to consider the raw data of a series of frames beginning with frame 0 and ending with frame 32: the intervening delta frames can be built up in the manner as follows.
First it is helpful to set out a binary representation of the frame number, so that instead of referring in decimal to frames 0, 1, 2, 3, . . . 32 we also refer to them as frames 00000 (0) to 100000 (32), including 00000 (0), 00001 (1), 00010 (2), 00011 (3), 00100 (4), 00101 (5), 00110 (6), 00111 (7), 01000 (8), 01001 (9), 01010 (10), 01011 (11), 01100 (12), 01101 (13), 01110 (14), 01111 (15), 10000 (16), and so on to frame 100000 (32).
Based on the raw data in frame 00000 (0) and 00010 (2) we can generate motion vector estimates that provide frame 00001 (1). Similarly based on the raw data in frame 00010 (2) and 00100 (4) we can generate motion vector estimates that provide frame 00011 (3). This may be generalised such that adjacent frames ending in binary 0 (2, 4, 6 . . . ) provide motion vector estimates for intermediate frames ending in binary 1 (1, 3, 5, . . . ). Thus in a first run through a first estimate is generated for odd frames, from respective adjacent frames.
Zooming out, we can in an analogous fashion generate, in a second run through, motion vectors for frames that end in binary 10 (2, 6, 10, . . . ) from these already estimated motion vectors. For example the motion vector estimates for frame 2 will be based on the values estimated in the first run through, where the motion vectors for frame 1 were estimated based on the raw data of frames 0 and 2, and the motion vectors for frame 3 were estimated based on the raw data of frames 2 and 4. The values generated provide motion vector estimates for frame 2 based on the motion vector estimates of frame 0 to 2 and 2 to 4.
We can in an analogous fashion generate motion vectors that end in binary 100 (4, 12, 20 . . . ), binary 1000 (8, 24, . . . ) then binary 10000 (16). It is important to note that in each case except the initial odd frames case, the motion vector search starts with the estimates based on motion vector estimates of neighbouring values.
Confirmation of Motion Vectors
It is important that the compressor sees exactly what the decompressor will see so that it can find the motion vectors and other corrections which will work best on the decompressor. When the chunk of frames has arrived in the compressor, and the motion vectors estimates have been made as above as part of the process of compression, the motion vectors can be verified, while encoding, before the video bitstream is output, for example transmitted to a decoder. The compression is not loss free, and the frames being compressed are updated to match the video being represented in the bitstream i.e. to what the decompressor produces (the decompressor doesn't have access to the exact original source, but only to a compressed and in general degraded version of it). This can have the effect of making the motion vectors with the best matches on the compressed/decompressed video different from those estimated from the original source. So at this stage, the motion vectors are checked using the estimated motion vectors, but in a different order—namely the order in which the frames will be decompressed, starting with the coarsest temporal resolution deltas. The possibly improved motion vectors are searched for using the same search techniques as the original estimates used, but with the decompressed frames replacing the original source frames. Where a better match is found, this replaces the original motion vector estimate.
Bias
Modern mobile phone cameras can have an auto exposure function, which can change the values of pixels while preserving their relative values. Rather than update all the pixels individually, the compressor may adjust Y values by, for example, calculating the average values of Y over each motion block and adjusting for this in the bitstream as a single value for each block. This allows any Y bias in the block estimate to be corrected efficiently, if required.
Block Representations
There are four classes of block representations, or block types, supported by the compressor, making use of information available at the time of decompression:
1. Interpolated. These are based on the average between motion adjusted and bias adjusted blocks from the temporally closest known past and future video frames. For example this block type is suitable where there is some or no motion within a frame between the past and future video frames.
2. Earlier. These are based on the motion adjusted and bias adjusted blocks from the temporally closest known past video frame. For example this block type is suitable where there is motion near an edge of a frame, for example where motion near an edge of a frame in the temporally closest known past video frame will no longer be present in the temporally closest future video frame.
3. Later. These are based on the motion adjusted and bias adjusted blocks from the temporally closest known future video frame. For example this block type is suitable where there is motion near an edge of a frame, for example where motion near an edge of a frame in the temporally closest future video frame is new, i.e. not present in the earlier frame.
4. Intraframe. These are based on high quality intraframe compression. For example this block type is suitable where a small item or part of a frame is moving while the rest remains relatively stable, for example where the video covers a tennis match and the motion of a ball is followed through frames, i.e. when the motion search fails to find a good match from either the previous or later frame for this block.
Cases 1-3 allow for arbitrary pixels to be corrected. The corrections are typically small, so compress well, and in the preferred implementation of the invention are chosen to give more accurate representation in lower contrast areas. In one implementation, these blocks are split up and the variance in each section is used to represent the smoothness. In another implementation, the local contrast is calculated explicitly, and this value is used to set a maximum allowed error for each pixel.
Case 4 allows for filtering of the image to reduce noise, and hence data-rate. When this option is enabled, the filtering generally reduces the contrast of pixels which have locally outlying values, and hence which are more likely to be accentuated by noise. Reducing the contrast in this way has minor effects on the perceived image quality but gives a good saving in data rate.
The bitstream for the delta frames contains information such as the larger block size (i.e. 20×20, 30×30, and so on, as shown above); and also the block type for each block. For types 1-3, for each larger block the bitstream also contains information such as motion vectors, bias, identification of each pixel to be corrected (by for example sending the gap from the previous correction) and corrections to any estimates which used earlier or later frames. For type 4, the bitstream contains a high fidelity representation of each pixel in the block.
Accuracy
Because some (delta) frames are constructed from other (key or delta) frames, the frames used as sources by most other frames (key frames and some delta frames) should be more accurately represented than the remaining frames. The choice of block representation (i.e. block type) is dependent on the accuracy required, with type 4 being used when types 1-3 are insufficiently accurate. The error rate acceptable where options 1-3 (lower data rate options) are selected is lower on delta frames which decoded earlier, and are used in the construction of a higher number of subsequent delta frames, and highest on delta frames which are decoded last, and not used in the construction of any further delta frames, as this reduces the average error rate for a given datarate. Where the blocks are far apart in time and will be used to generate multiple further blocks at decreasing separation, accuracy is required so any errors are not propagated over multiple frames.
The acceptable error can also depend on the variance within the block, where variance is a result of for example contrast within the block, with less varied blocks requiring a more accurate fit.
The compressor generates its corrections using only the information available to the decompressor so that the data sent to the decompressor applies to the information at the decompressor's disposal. This is significant because the information sent is most useful to the decompressor and is independent of any additional information the compressor may have available to it (but which is not available to the decoder).
Sub Pixel
Sub pixel motion can be simulated by averaging two motion blocks each with motion vectors which are an integer number of pixels horizontally and vertically, and which are for example either side of the correct motion. This means that averaging the blocks from earlier and later frames can result in estimates of sub pixel motion.
Temporally close together frames with motions of an integer number of pixels often give the effect of appearing to have sub pixel motion when averaged. Hence the motion estimates between past and future can be chosen to be integer numbers of pixels in such a way that the averages between past and future frames in type 1 blocks effectively gives sub pixel motion estimation—but, importantly, without having to send or search for sub pixel motion vectors.
Transition Tables
In a preferred implementation of the invention, the bitstream encoding uses Forbidden's previously patented loss free “transition table” technology.
Some Advantages of Temporal Compression as Described in this Patent
Firstly, substantially every part of the contents of a middle frame is likely to be represented either as part of a previous frame, or as part of a future frame, or both. This means that representing a delta frame between two key frames almost always does not require completely new pixels to allow for motion, particularly camera motion.
Secondly, the use of averaging simulates sub pixel motion without the computational or data rate costs of sending it explicitly—integer only motion vectors suffice.
Thirdly, for editing or low data rate connections, video can be downloaded and played back easily at a wide range of frame rates without requiring download or decompression of non-displayed frames.
The advantages are not limited to those outlined herein.
The invention is not restricted to the details of the foregoing embodiments. For example the number of frames between key frames may not be a power of two, but may be any other number, and determining the content of delta frames between key frames separated by a number other than the power of two may involve subdividing the time by a different, more suitable number.
| Number | Date | Country | Kind |
|---|---|---|---|
| 1513610.4 | Jul 2015 | GB | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/GB2016/052223 | 7/22/2016 | WO | 00 |