The present application relates to Australian Patent Application No. 2022200086, the contents of which are incorporated herein by reference in its entirety.
The present invention relates generally to digital video signal processing and, in particular, to a method, apparatus and system for encoding and decoding tensors from a convolutional neural network. The present invention also relates to a computer program product including a computer readable medium having recorded thereon a computer program for encoding and decoding tensors from a convolutional neural network using video compression technology.
Video compression is a ubiquitous technology used to support many applications, including applications for transmission and storage of video data. Many video coding standards have been developed and others are currently in development. Recent developments in video coding standardisation have led to the formation of a group called the “Joint Video Experts Team” (JVET). The Joint Video Experts Team (JVET) includes members of two Standards Setting Organisations (SSOs), namely: Study Group 16, Question 6 (SG16/Q6) of the Telecommunication Standardisation Sector (ITU-T) of the International Telecommunication Union (ITU), also known as the “Video Coding Experts Group” (VCEG) and the International Organisation for Standardisation/International Electrotechnical Commission Joint Technical Committee 1/Subcommittee 29/Working Group 11 (ISO/IEC JTC1/SC29/WG11), also known as the “Moving Picture Experts Group” (MPEG).
The Joint Video Experts Team (JVET) has developed a video compression standard, named ‘versatile video coding’ (VVC).
Convolution neural networks (CNNs) are an emerging technology addressing, among other things, use cases involving machine vision such as object recognition, object tracking, human pose estimation and action recognition. CNNs typically include many layers, such as convolution layers and fully connected layers, with data passing from one layer to the next in the form of ‘tensors’. Weights for each of the layers are determined in a training stage, where a very large amount of training data is passed through the CNN and a determined result is compared to ground truth associated with the training data. A process for updating network weights, such as stochastic gradient descent, is applied to iteratively refine the network weights until the network performs at a desired level of accuracy. Where a convolution stage has a ‘stride’ greater than one, an output tensor from the convolution has a lower spatial resolution than a corresponding input tensor. Operations such as ‘max pooling’ also reduce spatial size of the output tensor compared to the input tensor. Max pooling produces an output tensor by dividing the input tensor into groups of data samples (e.g., a 2×2 group of data samples), and from each group selecting a maximum value as output for a corresponding value in the output tensor. The process of executing a CNN with an input and progressively transforming the input into an output is commonly referred to as ‘inferencing’.
Generally, a tensor has four dimensions, namely: batch, channels, height and width. The first dimension, ‘batch’, of size ‘one’ when inferencing on video data indicates that one frame is passed through a CNN at a time. When training a network, the value of the batch dimension may be increased so that multiple frames are passed through the network before the network weights are updated, according to a predetermined ‘batch size’. A multi-frame video may be passed through as a single tensor with the batch dimension increased in size according to the number of frames of a given video. However, for practical considerations relating to memory consumption and access, inferencing on video data is typically performed on a frame-wise basis. The ‘channels’ dimension indicates the number of concurrent ‘feature maps’ for a given tensor and the height and width dimensions indicate the size of the feature maps at the particular stage of the CNN. Channel count varies through a CNN according to the network architecture. Feature map size also varies, depending on subsampling occurring in specific network layers.
Input to the first layer of a CNN is an image or video frame, typically resized for compatibility with the dimensionality of the tensor input to the first layer. The dimensionality of tensors is dependent on the CNN architecture, generally having some dimensions relating to input width and height and a further ‘channel’ dimension.
Slicing a tensor based on channel results in a set of ‘feature maps’, so-called because each slice of the tensor has some relationship to the corresponding input image, capturing some property such as edges. At layers further from the input to the network, the relationship can be more abstract. The ‘task performance’ of a CNN is measured by comparing the result of the CNN in performing a task using specific input with a provided ground truth (i.e., ‘training data’), generally prepared by humans and intended to indicate a ‘correct’ result.
Once a network topology is decided, the network weights may be updated over time as more training data becomes available. It is also possible to retrain a portion of a CNN, leaving weights in other portion(s) of the network unchanged. The overall complexity of the CNN tends to be quite high, with large numbers of multiply-accumulate operations being performed and numerous intermediate tensors being written to and read from memory. In some applications, the CNN is implemented entirely in the ‘cloud’, resulting in a need for high and costly processing power. In other applications, the CNN is implemented in an edge device, such as a camera or mobile phone, resulting in less flexibility but a more distributed processing load.
VVC is anticipated to address ongoing demand for ever-higher compression performance, especially as video formats increase in capability (e.g., with higher resolution and higher frame rate) and to address increasing market demand for service delivery over WANs, where bandwidth costs are relatively high. VVC is implementable in contemporary silicon processes and offers an acceptable trade-off between achieved performance versus implementation cost. The implementation cost may be considered for example, in terms of one or more of silicon area, CPU processor load, memory utilisation and bandwidth. Part of the versatility of the VVC standard is in the wide selection of tools available for compressing video data, as well as the wide range of applications for which VVC is suitable.
Video data includes a sequence of frames of image data, each frame including one or more colour channels. Generally, one primary colour channel and two secondary colour channels are needed. The primary colour channel is generally referred to as the ‘luma’ channel and the secondary colour channel(s) are generally referred to as the ‘chroma’ channels. Although video data is typically displayed in an RGB (red-green-blue) colour space, this colour space has a high degree of correlation between the three respective components. The video data representation seen by an encoder or a decoder is often using a colour space such as YCbCr. YCbCr concentrates luminance, mapped to ‘luma’ according to a transfer function, in a Y (primary) channel and chroma in Cb and Cr (secondary) channels. Due to the use of a decorrelated YCbCr signal, the statistics of the luma channel differ markedly from those of the chroma channels. A primary difference is that after quantisation, the chroma channels contain relatively few significant coefficients for a given block compared to the coefficients for a corresponding luma channel block. Moreover, the Cb and Cr channels may be sampled spatially at a lower rate (subsampled) compared to the luma channel, for example half horizontally and half vertically-known as a ‘4:2:0 chroma format’. The 4:2:0 chroma format is commonly used in ‘consumer’ applications, such as internet video streaming, broadcast television, and storage on Blu-Ray™ disks. When only luma samples are present, the resulting monochrome frames are said to use a “4:0:0 chroma format”.
The VVC standard specifies a ‘block based’ architecture, in which frames are firstly divided into a square array of regions known as ‘coding tree units’ (CTUs). CTUs generally occupy a relatively large area, such as 128×128 luma samples. Other possible CTU sizes when using the VVC standard are 32×32 and 64×64. However, CTUs at the right and bottom edge of each frame may be smaller in area, with implicit splitting occurring the ensure the CBs remain in the frame. Associated with each CTU is a ‘coding tree’ either for both the luma channel and the chroma channels (a ‘shared tree’) or a separate tree each for the luma channel and the chroma channels. A coding tree defines a decomposition of the area of the CTU into a set of blocks, also referred to as ‘coding blocks’ (CBs). When a shared tree is in use a single coding tree specifies blocks both for the luma channel and the chroma channels, in which case the collections of collocated coding blocks are referred to as ‘coding units’ (CUs) (i.e., each CU having a coding block for each colour channel). The CBs are processed for encoding or decoding in a particular order. As a consequence of the use of the 4:2:0 chroma format, a CTU with a luma coding tree for a 128×128 luma sample area has a corresponding chroma coding tree for a 64×64 chroma sample area, collocated with the 128×128 luma sample area. When a single coding tree is in use for the luma channel and the chroma channels, the collections of collocated blocks for a given area are generally referred to as ‘units’, for example, the above-mentioned CUs, as well as ‘prediction units’ (PUs), and ‘transform units’ (TUs). A single tree with CUs spanning the colour channels of 4:2:0 chroma format video data result in chroma blocks half the width and height of the corresponding luma blocks. When separate coding trees are used for a given area, the above-mentioned CBs, as well as ‘prediction blocks’ (PBs), and ‘transform blocks’ (TBs) are used.
Notwithstanding the above distinction between ‘units’ and ‘blocks’, the term ‘block’ may be used as a general term for areas or regions of a frame for which operations are applied to all colour channels.
For each CU, a prediction unit (PU) of the contents (sample values) of the corresponding area of frame data is generated (a ‘prediction unit’). Further, a representation of the difference (or ‘spatial domain’ residual) between the prediction and the contents of the area as seen at input to the encoder is formed. The difference in each colour channel may be transformed and coded as a sequence of residual coefficients, forming one or more TUs for a given CU. The applied transform may be a Discrete Cosine Transform (DCT) or other transform, applied to each block of residual values. The transform is applied separably, (i.e. the two-dimensional transform is performed in two passes). The block is firstly transformed by applying a one-dimensional transform to each row of samples in the block. Then, the partial result is transformed by applying a one-dimensional transform to each column of the partial result to produce a final block of transform coefficients that substantially decorrelates the residual samples. Transforms of various sizes are supported by the VVC standard, including transforms of rectangular-shaped blocks, with each side dimension being a power of two. Transform coefficients are quantised for entropy encoding into a bitstream.
VVC features intra-frame prediction and inter-frame prediction. Intra-frame prediction involves the use of previously processed samples in a frame being used to generate a prediction of a current block of data samples in the frame. Inter-frame prediction involves generating a prediction of a current block of samples in a frame using a block of samples obtained from a previously decoded frame. The block of samples obtained from a previously decoded frame is offset from the spatial location of the current block according to a motion vector, which often has filtering applied. Intra-frame prediction blocks can be (i) a uniform sample value (“DC intra prediction”), (ii) a plane having an offset and horizontal and vertical gradient (“planar intra prediction”), (iii) a population of the block with neighbouring samples applied in a particular direction (“angular intra prediction”) or (iv) the result of a matrix multiplication using neighbouring samples and selected matrix coefficients. Further discrepancy between a predicted block and the corresponding input samples may be corrected to an extent by encoding a ‘residual’ into the bitstream. The residual is generally transformed from the spatial domain to the frequency domain to form residual coefficients in a ‘primary transform domain, which may be further transformed by application of a ‘secondary transform’ to produce residual coefficients in a ‘secondary transform domain’. Residual coefficients are quantised according to a quantisation parameter, resulting in a loss of accuracy of the reconstruction of the samples produced at the decoder but with a reduction in bitrate in the bitstream.
It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.
An aspect of the present disclosure provides an apparatus configured to perform a method of converting a sample value of feature map frame data to a tensor value, the method comprising the steps of: determining a sample sign and a sample magnitude of the sample value; determining an adjusted sample magnitude based on the determined sample magnitude; determining a tensor magnitude based on the adjusted sample magnitude; determining a normalised tensor magnitude based on the determined tensor magnitude; and determining the tensor value based on the normalised tensor magnitude.
Another aspect of the present disclosure provides a method of converting a sample value of feature map frame data to a tensor value, the method comprising the steps of: determining a sample sign and a sample magnitude of the sample value; determining an adjusted sample magnitude based on the determined sample magnitude; determining a tensor magnitude based on the adjusted sample magnitude; determining a normalised tensor magnitude based on the determined tensor magnitude; and determining the tensor value based on the normalised tensor magnitude.
Another aspect of the present disclosure provides a computer-readable storage medium comprising a computer program that is executable by a processor to perform a method of converting a sample value of feature map frame data to a tensor value, the method comprising the steps of: determining a sample sign and a sample magnitude of the sample value; determining an adjusted sample magnitude based on the determined sample magnitude; determining a tensor magnitude based on the adjusted sample magnitude; determining a normalised tensor magnitude based on the determined tensor magnitude; and determining the tensor value based on the normalised tensor magnitude.
Other aspects are also disclosed.
At least one embodiment of the present invention will now be described with reference to the following drawings and an appendix, in which:
Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.
A distributed machine task system may include an edge device, such as a network camera or smartphone producing intermediate compressed data. The distributed machine task system may also include a final device, such as a server farm based (‘cloud’) application, operating on the intermediate compressed data to produce some task result. Additionally, the edge device functionality may be embodied in the cloud and the intermediate compressed data may be stored for later processing, potentially for multiple different tasks depending on need.
A convenient form of intermediate compressed data is a compressed video bitstream, owing to the availability of high-performing compression standards and implementations thereof. Video compression standards typically operate on integer samples of some given bit depth, such as 10 bits, arranged in planar arrays. Colour video has three planar arrays, corresponding, for example, to colour components Y, Cb, Cr, or R, G, B, depending on application. CNNs typically operate on floating point data in the form of tensors, which generally have a much smaller spatial dimensionality compared to incoming video data upon which the CNN operates but having many more channels than the three channels typical of colour video data.
Tensors typically have the following dimensions: Frames, channels, height, and width. For example, a tensor of dimensions [1, 256, 76, 136] would be said to contain two-hundred and fifty-six (256) feature maps, each of size 136×76. For video data, inferencing is typically performed one frame at a time, rather than using tensors containing multiple frames.
VVC encoders and decoders include a capability signalling mechanism known as ‘constraints’. Early in a bitstream, a set of constraints are present indicating which capabilities of the VVC standard are not used in the bitstream. Constraints are signalled along with ‘profile’ and ‘level’ of the bitstream. The profile indicates broadly which set of tools is required to be available to decode the bitstream. Constraints also provide a fine granularity of control of which tools are further constrained in the specified profile. The further constraining is referred to as ‘subprofiling’. Depending on the type of data being encoded by the video encoder, defining a subset of tools using a subprofile allows the decoder to know before commencing bitstream decoding that a subset of the coding tools of the indicated profile of the bitstream are to be used.
The system 100 includes a source device 110 for generating encoded data in the form of encoded video information. The system 100 also includes a destination device 140. A communication channel 130 is used to communicate the encoded video information from the source device 110 to the destination device 130. In some arrangements, the source device 110 and destination device 140 may either or both comprise respective mobile telephone handsets (e.g., “smartphones”) or network cameras and cloud applications. The communication channel 130 may be a wired connection, such as Ethernet, or a wireless connection, such as WiFi or 5G. Moreover, the source device 110 and the destination device 140 may comprise applications where encoded video data is captured on some computer-readable storage medium, such as a hard disk drive in a file server.
As shown in
The CNN backbone 114 receives the video frame data 113 and performs specific layers of an overall CNN, such as layers corresponding to the ‘backbone’ of the CNN. The backbone layers of the CNN may produce multiple tensors as output, for example, corresponding to different spatial scales of an input image represented by the video frame data 113. A ‘feature pyramid network’ (FPN) architecture may result in three tensors, corresponding to three layers, output from the backbone 114, with varying spatial resolution and channel count. The feature map quantiser and packer 116 receives tensors 115, which are output from the CNN backbone 114. The feature map quantiser and packer 116 acts to interface an internal layer of the overall CNN, which is the output of the CNN backbone 114, to the video encoder 120 by quantising floating point values in the tensors 115 into data samples that are packed into frames 119. For example, the resolution of frames 119 may be 2056×1224, and the bit depth of frames 119 may be 10 bits. Slicing the tensors 115 along the channel dimension results in extracting one feature map per channel, where the feature maps of a given tensor have a specific size that is determined from additional dimensions of the tensor. Where an FPN is used, multiple tensors per incoming frame are produced including multiple sets of feature maps, each set of feature maps having a different spatial resolution. Feature maps of all layers are packed into planar video frames, such as packed feature map frames 117. The multiplexor 118 selects the packed feature map frames 117 if the source device 110 is configured to encode feature maps or the frame data 113 if the source device 110 is configured to encode video data, outputting frames 119 to an encoding unit in the form of the video encoder 120. The selection between feature maps and regular video data is encoded in the bitstream using a ‘frame_type’ syntax element in a metadata SEI message. The metadata SEI message is described with reference to Appendix A. The frames 119 are input to the video encoder 120 where lossy compression is applied to the frames 119 to produce the bitstream 121. The bitstream 121 is supplied to the transmitter 122 for transmission over the communications channel 130 or the bitstream 121 is written to storage 132 for later use.
After conversion to tensors by the CNN backbone 114, the content of the resulting feature maps can no longer identify individuals that would be clearly identifiable in the video data 113. Storage of the feature maps (e.g. in compressed form), using the storage 132 may be more secure from a user privacy point of view, particularly in relation to European General Data Protection Regulation (GDPR) requirements for pseudonymisation or anonymisation.
The source device 110 supports a particular network for the CNN backbone 114. However, the destination device 140 may use one of several networks for the head CNN 150. In this way, partially processed data in the form of packed feature maps may be stored for later use in performing various tasks without needing to again perform the operation of the CNN backbone 114. The video encoder 120 uses a particular set of coding tools (or ‘profile’) of VVC to encode the frame data 119.
The bitstream 121 is transmitted by the transmitter 122 over the communication channel 130 as encoded video data (or “encoded video information”). The bitstream 121 can in some implementations be stored in the storage 132, where the storage 132 is a non-transitory storage device such as a “Flash” memory or a hard disk drive, until later being transmitted over the communication channel 130 (or in-lieu of transmission over the communication channel 130). For example, encoded video data may be served upon demand to customers over a wide area network (WAN) for a video streaming application.
The destination device 140 includes a receiver 142, a video decoder 144, a demultiplexor 146, a feature map unpacker and inverse quantiser 148, a CNN head 150, a CNN task 152, and a display device 160. The receiver 142 receives encoded video data from the communication channel 130 and passes received video data to the video decoder 144 as a bitstream (indicated by an arrow 143). The video decoder 144 then outputs decoded frame data (indicated by an arrow 145) to the demultiplexor 146. Decoded metadata 155 is also extracted from the bitstream 143 by the video decoder 144 and passed to a feature map unpacker and inverse quantiser 148. The decoded metadata 155 is typically obtained from a ‘supplementary enhancement information’ (SEI) message 1413 (see
Notwithstanding the example devices mentioned above, each of the source device 110 and destination device 140 may be configured within a general purpose computing system, typically through a combination of hardware and software components.
The computer module 201 typically includes at least one processor unit 205, and a memory unit 206. For example, the memory unit 206 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 201 also includes a number of input/output (V/O) interfaces including: an audio-video interface 207 that couples to the video display 214, loudspeakers 217 and microphone 280; an I/O interface 213 that couples to the keyboard 202, mouse 203, scanner 226, camera 227 and optionally a joystick or other human interface device (not illustrated); and an interface 208 for the external modem 216 and printer 215. The signal from the audio-video interface 207 to the computer monitor 214 is generally the output of a computer graphics card. In some implementations, the modem 216 may be incorporated within the computer module 201, for example within the interface 208. The computer module 201 also has a local network interface 211, which permits coupling of the computer system 200 via a connection 223 to a local-area communications network 222, known as a Local Area Network (LAN). As illustrated in
The I/O interfaces 208 and 213 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 209 are provided and typically include a hard disk drive (HDD) 210. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g. CD-ROM, DVD, Blu ray Disc™), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the computer system 200. Typically, any of the HDD 210, optical drive 212, networks 220 and 222 may also be configured to operate as the video source 112, or as a destination for decoded video data to be stored for reproduction via the display 214. The source device 110 and the destination device 140 of the system 100 may be embodied in the computer system 200.
The components 205 to 213 of the computer module 201 typically communicate via an interconnected bus 204 and in a manner that results in a conventional mode of operation of the computer system 200 known to those in the relevant art. For example, the processor 205 is coupled to the system bus 204 using a connection 218. Likewise, the memory 206 and optical disk drive 212 are coupled to the system bus 204 by connections 219. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun SPARCstations, Apple Mac™ or alike computer systems.
Where appropriate or desired, the video encoder 120 and the video decoder 144, as well as methods described below, may be implemented using the computer system 200. In particular, the video encoder 120, the video decoder 144 and methods to be described, may be implemented as one or more software application programs 233 executable within the computer system 200. In particular, the video encoder 120, the video decoder 144 and the steps of the described methods are effected by instructions 231 (see
The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 200 from the computer readable medium, and then executed by the computer system 200. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 200 preferably effects an advantageous apparatus for implementing the source device 110 and the destination device 140 and the described methods.
The software 233 is typically stored in the HDD 210 or the memory 206. The software is loaded into the computer system 200 from a computer readable medium, and executed by the computer system 200. Thus, for example, the software 233 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 225 that is read by the optical disk drive 212.
In some instances, the application programs 233 may be supplied to the user encoded on one or more CD-ROMs 225 and read via the corresponding drive 212, or alternatively may be read by the user from the networks 220 or 222. Still further, the software can also be loaded into the computer system 200 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 200 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-ray Disc™, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 201. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of the software, application programs, instructions and/or video data or encoded video data to the computer module 201 include radio or infra-red transmission channels, as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.
The second part of the application program 233 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 214. Through manipulation of typically the keyboard 202 and the mouse 203, a user of the computer system 200 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 217 and user voice commands input via the microphone 280.
When the computer module 201 is initially powered up, a power-on self-test (POST) program 250 executes. The POST program 250 is typically stored in a ROM 249 of the semiconductor memory 206 of
The operating system 253 manages the memory 234 (209, 206) to ensure that each process or application running on the computer module 201 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the computer system 200 of
As shown in
The application program 233 includes a sequence of instructions 231 that may include conditional branch and loop instructions. The program 233 may also include data 232 which is used in execution of the program 233. The instructions 231 and the data 232 are stored in memory locations 228, 229, 230 and 235, 236, 237, respectively. Depending upon the relative size of the instructions 231 and the memory locations 228-230, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 230. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 228 and 229.
In general, the processor 205 is given a set of instructions which are executed therein. The processor 205 waits for a subsequent input, to which the processor 205 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 202, 203, data received from an external source across one of the networks 220, 202, data retrieved from one of the storage devices 206, 209 or data retrieved from a storage medium 225 inserted into the corresponding reader 212, all depicted in
The video encoder 120, the video decoder 144 and the described methods may use input variables 254, which are stored in the memory 234 in corresponding memory locations 255, 256, 257. The video encoder 120, the video decoder 144 and the described methods produce output variables 261, which are stored in the memory 234 in corresponding memory locations 262, 263, 264. Intermediate variables 258 may be stored in memory locations 259, 260, 266 and 267.
Referring to the processor 205 of
Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 239 stores or writes a value to a memory location 232.
Each step or sub-process in the methods of
As seen in
The CBL module 360 takes as input a tensor 361, which is passed to a convolutional layer 362 to produce tensor 363. When the convolutional layer 362 has a stride of one, the tensor 363 has the same spatial dimensions as the tensor 361. When the convolution layer 362 has a larger stride, such as two, the tensor 363 has smaller spatial dimensions compared to the tensor 361, for example, halved in size for the stride of two. Regardless of the stride, the size of channel dimension of the tensor 363 may vary compared to the channel dimension of the tensor 361 for a particular CBL block. The tensor 363 is passed to a batch normalisation module 364 which outputs a tensor 365. The batch normalisation module 364 normalises the input tensor 363, applies a scaling factor and offset value to produce the output tensor 365. The scaling factor and offset value are derived from a training process. The tensor 365 is passed to a leaky rectified linear activation (“LeakyReLU”) module 366 to produce a tensor 367. The module 366 provides a ‘leaky’ activation function whereby positive values in the tensor are passed through and negative values are severely reduced in magnitude, for example, to 0.1× their former value.
The tensor 316 is passed from the CBL block 314 to a residual block 11 module 320, containing a concatenation of 11 residual units internally.
A residual block is described with reference to a ResBlock 340 as shown in
The Res11 module 320 outputs a tensor 322, which is output from the backbone module 310 as one of the layers and also provided to a Res8 module 324. The Res8 module 324 is a residual block (i.e., 340), which includes eight residual units (i.e. 350). The Res8 module 324 produces a tensor 326, which is passed to a Res4 module 328 and also output from the backbone module 310 as one of the layers. The Res4 module is a residual block (i.e., 340), which includes four residual units (i.e. 350). The Res4 module 324 produces a tensor 329 which is output from the backbone module 310 as one of the layers. Collectively, the layer tensors 322, 326, and 329 are output as tensors 115. The backbone CNN 310 may take as input a video frame of resolution 1088×608 and produce three tensors, corresponding to three layers, with the following dimensions: [1, 256, 76, 136], [1, 512, 38, 68], [1, 1024, 19, 34]. Another example of the three tensors corresponding to three layers may be [1, 512, 34, 19], [1, 256, 68, 38], [1, 128, 136, 76] which are respectively separated at 75th feature map, 90th feature map, and 105th feature map in the CNN 310. The separating points depend on the CNN 310.
The bitstream 121 includes a ‘qr_update’ flag in metadata (see Appendix A) indicating whether the quantisation ranges were updated or not. A single quantisation range may be used to represent the maximum magnitude of any value prior to quantisation within the feature maps of the group to which the quantisation range belongs. In another arrangement, a separate quantisation range for the maximum positive value within the feature map group and the maximum negative value within the feature map are used, resulting in an asymmetric quantisation range, with two values per group.
The tensors 115 generally have 32-bit floating-point precision values and so each quantisation range is also a floating point value. Other floating point precisions are possible, such as 16-bit and 8-bit, and various allocations of bits to the exponent and fraction portions of the floating point values are also possible.
The quantisation ranges 516 are passed to a quantiser module 518 and output as part of the metadata 125. The quantiser module 518 quantises each feature map into sample values in two stages. Firstly, the quantisation range of the feature map group to which the feature map belongs is used to normalise the feature map values, resulting in values in a range from [−1, 1]. Secondly, the normalised feature map values are scaled into a sample range corresponding to the bit-depth of the video encoder 120. For 10-bit operation, the normalised feature maps are multiplied by the feature map groups 512, then an offset of the feature map groups 512 is added and the sum is converted to integer precision and output as integerised feature maps 520. The multiplication and addition operation results in utilisation of at least one value at the minimum or maximum allowed sample value (i.e. zero (0) or one-thousand and twenty-three (1023) for 10-bit video) among the feature maps of a given feature map group. To provide some resilience to overshoot that may occur at the output of the video decoder 144, the multiplicative factor applied to the normalised feature maps may be reduced compared to the maximum possible multiplicative factor that could be used without introducing clipping. For regular video represented in YCbCr colour space, a ‘video range’ of sixteen (16) to two-hundred and thirty-five (235) or 8-bit video data and sixty-four (64) to nine-hundred and forty (940) for 10-bit video data is defined. Accordingly, a reduction of the multiplicative factor to ⅞ of the full value can be applied, resulting in a similar sample range as seen in the video range of YCbCr video data. The resulting multiplicative factor would be ⅞×(1<<(bit_depth−1)). The offset factor used to shift the negative tensor values into a positive range is left at the half-way point, i.e. 1<<(bit_depth−1), corresponding to the default predictor for unavailable reference samples for intra-prediction, as described with reference to
Although the video encoder 120 of
Although operation is generally described on a CTU-by-CTU basis, the video encoder 120 and the video decoder 144 may operate on a smaller-sized region to reduce memory consumption. For example, each CTU may be divided into smaller regions, known as ‘virtual pipeline data units’ (VPDU's) of size 64×64. The VPDUs form a granularity of data that is more amenable to pipeline processing in hardware architectures where the reduction in memory footprint reduces silicon area and hence cost, compared to operating on full CTUs. When the CTU size is 128×128, restrictions on allowed coding trees are in place to ensure that processing of one VPDU is fully completed before progressing to the next VPDU. For example, at the root node of the coding tree of a 128×128 CTU, ternary splitting is prohibited as the resulting CUs (such as 32×128/128×32 or further decompositions thereof) could not be processed with the required progression from one 64×64 region to a subsequent 64×64 region. When the CTU size is 64×64, regardless of the coding tree selected by the encoder, processing necessarily completes one 64×64 region before progressing to the next 64×64 region (i.e. from one CTU to the next).
The CTUs resulting from the first division of the frame data 119 may be scanned in raster scan order and may be grouped into one or more ‘slices’. A slice may be an ‘intra’ (or ‘I’) slice. An intra slice (I slice) indicates that every CU in the slice is intra predicted. Generally, the first picture in a coded layer video sequence (CLVS) contains only I slices, and is referred to as an ‘intra picture’. The CLVS may contain periodic intra pictures, forming ‘random access points’ (i.e., intermediate frames in a video sequence upon which decoding can commence). Alternatively, a slice may be uni- or bi-predicted (‘P’ or ‘B’ slice, respectively), indicating additional availability of uni- and bi-prediction in the slice, respectively.
The video encoder 120 encodes sequences of pictures according to a picture structure. One picture structure is ‘low delay’, in which case pictures using inter-prediction may only reference pictures occurring previously in the sequence. Low delay enables each picture to be output as soon as it is decoded, in addition to being stored for possible reference by a subsequent picture. Another picture structure is ‘random access’, whereby the coding order of pictures differs from the display order. Random access allows inter-predicted pictures to reference other pictures that, although decoded, have not yet been output. A degree of picture buffering is needed so the reference pictures in the future in terms of display order are present in the decoded picture buffer, resulting in a latency of multiple frame.
When a chroma format other than 4:0:0 is in use, in an I slice, the coding tree of each CTU may diverge below the 64×64 level into two separate coding trees, one for luma and another for chroma. Use of separate trees allows different block structure to exist between luma and chroma within a luma 64×64 area of a CTU. For example, a large chroma CB may be collocated with numerous smaller luma CBs and vice versa. In a P or B slice, a single coding tree of a CTU defines a block structure common to luma and chroma. The resulting blocks of the single tree may be intra predicted or inter predicted.
In addition to a division of pictures into slices, pictures may also be divided into ‘tiles’. A tile is a sequence of CTUs covering a rectangular region of a picture. CTU scanning occurs in a raster-scan manner within each tile and progresses from one tile to the next. A slice can be either an integer number of tiles, or an integer number of consecutive row of CTUs within a given tile.
For each CTU, the video encoder 120 operates in two stages. In the first stage (referred to as a ‘search’ stage), the block partitioner 610 tests various potential configurations of a coding tree. Each potential configuration of a coding tree has associated ‘candidate’ CBs. The first stage involves testing various candidate CBs to select CBs providing relatively high compression efficiency with relatively low distortion. The testing generally involves a Lagrangian optimisation whereby a candidate CB is evaluated based on a weighted combination of rate (i.e., coding cost) and distortion (i.e., error with respect to the input frame data 119). ‘Best’ candidate CBs (i.e., the CBs with the lowest evaluated rate/distortion) are selected for subsequent encoding into the bitstream 121. Included in evaluation of candidate CBs is an option to use a CB for a given area or to further split the area according to various splitting options and code each of the smaller resulting areas with further CBs, or split the areas even further. As a consequence, both the coding tree and the CBs themselves are selected in the search stage.
The video encoder 120 produces a prediction block (PB), indicated by an arrow 620, for each CB, for example, CB 612. The PB 620 is a prediction of the contents of the associated CB 612. A subtracter module 622 produces a difference, indicated as 624 (or ‘residual’, referring to the difference being in the spatial domain), between the PB 620 and the CB 612. The difference 624 is a block-size difference between corresponding samples in the PB 620 and the CB 612. The difference 624 is transformed, quantised and represented as a transform block (TB), indicated by an arrow 636. The PB 620 and associated TB 636 are typically chosen from one of many possible candidate CBs, for example, based on evaluated cost or distortion.
A candidate coding block (CB) is a CB resulting from one of the prediction modes available to the video encoder 120 for the associated PB and the resulting residual. When combined with the predicted PB in the video encoder 120, the TB 636 reduces the difference between a decoded CB and the original CB 612 at the expense of additional signalling in a bitstream.
Each candidate coding block (CB), that is prediction block (PB) in combination with a transform block (TB), thus has an associated coding cost (or ‘rate’) and an associated difference (or ‘distortion’). The distortion of the CB is typically estimated as a difference in sample values, such as a sum of absolute differences (SAD), a sum of squared differences (SSD) or a Hadamard transform applied to the differences. The estimate resulting from each candidate PB may be determined by a mode selector 686 using the difference 624 to determine a prediction mode 687. The prediction mode 687 indicates the decision to use a particular prediction mode for the current CB, for example, intra-frame prediction or inter-frame prediction. Estimation of the coding costs associated with each candidate prediction mode and corresponding residual coding may be performed at significantly lower cost than entropy coding of the residual. Accordingly, a number of candidate modes may be evaluated to determine an optimum mode in a rate-distortion sense even in a real-time video encoder.
Determining an optimum mode in terms of rate-distortion is typically achieved using a variation of Lagrangian optimisation.
Lagrangian or similar optimisation processing can be employed to both select an optimal partitioning of a CTU into CBs (by the block partitioner 610) as well as the selection of a best prediction mode from a plurality of possibilities. Through application of a Lagrangian optimisation process of the candidate modes in the mode selector module 686, the intra prediction mode with the lowest cost measurement is selected as the ‘best’ mode. The lowest cost mode includes the selected secondary transform index 688, which is also encoded in the bitstream 121 by an entropy encoder 638.
In the second stage of operation of the video encoder 120 (referred to as a ‘coding’ stage), an iteration over the determined coding tree(s) of each CTU is performed in the video encoder 120. For a CTU using separate trees, for each 64×64 luma region of the CTU, a luma coding tree is firstly encoded followed by a chroma coding tree. Within the luma coding tree, only luma CBs are encoded and within the chroma coding tree only chroma CBs are encoded. For a CTU using a shared tree, a single tree describes the CUS (i.e., the luma CBs and the chroma CBs) according to the common block structure of the shared tree.
The entropy encoder 638 supports bitwise coding of syntax elements using variable-length and fixed-length codewords, and an arithmetic coding mode for syntax elements. Portions of the bitstream such as ‘parameter sets’, for example, sequence parameter set (SPS) and picture parameter set (PPS) use a combination of fixed-length codewords and variable-length codewords. Slices, also referred to as contiguous portions, have a slice header that uses variable length coding followed by slice data, which uses arithmetic coding. The slice header defines parameters specific to the current slice, such as slice-level quantisation parameter offsets. The slice data includes the syntax elements of each CTU in the slice. Use of variable length coding and arithmetic coding requires sequential parsing within each portion of the bitstream. The portions may be delineated with a start code to form ‘network abstraction layer units’ or ‘NAL units’. Arithmetic coding is supported using a context-adaptive binary arithmetic coding process.
Arithmetically coded syntax elements consist of sequences of one or more ‘bins’. Bins, like bits, have a value of ‘0’ or ‘1’. However, bins are not encoded in the bitstream 121 as discrete bits. Bins have an associated predicted (or ‘likely’ or ‘most probable’) value and an associated probability, known as a ‘context’. When the actual bin to be coded matches the predicted value, a ‘most probable symbol’ (MPS) is coded. Coding a most probable symbol is relatively inexpensive in terms of consumed bits in the bitstream 121, including costs that amount to less than one discrete bit. When the actual bin to be coded mismatches the likely value, a ‘least probable symbol’ (LPS) is coded. Coding a least probable symbol has a relatively high cost in terms of consumed bits. The bin coding techniques enable efficient coding of bins where the probability of a ‘0’ versus a ‘1’ is skewed. For a syntax element with two possible values (i.e., a ‘flag’), a single bin is adequate. For syntax elements with many possible values, a sequence of bins is needed.
The presence of later bins in the sequence may be determined based on the value of earlier bins in the sequence. Additionally, each bin may be associated with more than one context. The selection of a particular context may be dependent on earlier bins in the syntax element, the bin values of neighbouring syntax elements (i.e. those from neighbouring blocks) and the like. Each time a context-coded bin is encoded, the context that was selected for that bin (if any) is updated in a manner reflective of the new bin value. As such, the binary arithmetic coding scheme is said to be adaptive.
Also supported by the entropy encoder 638 are bins that lack a context, referred to as “bypass bins”. Bypass bins are coded assuming an equiprobable distribution between a ‘0’ and a ‘1’. Thus, each bin has a coding cost of one bit in the bitstream 121. The absence of a context saves memory and reduces complexity, and thus bypass bins are used where the distribution of values for the particular bin is not skewed. One example of an entropy coder employing context and adaption is known in the art as CABAC (context adaptive binary arithmetic coder) and many variants of this coder have been employed in video coding.
The entropy encoder 638 encodes a quantisation parameter 692 and, if in use for the current CB, the LFNST index 388, using a combination of context-coded and bypass-coded bins. The quantisation parameter 692 is encoded using a ‘delta QP’. The delta QP is signalled at most once in each area known as a ‘quantisation group’. The quantisation parameter 692 is applied to residual coefficients of the luma CB. An adjusted quantisation parameter is applied to the residual coefficients of collocated chroma CBs. The adjusted quantisation parameter may include mapping from the luma quantisation parameter 692 according to a mapping table and a CU-level offset, selected from a list of offsets. The secondary transform index 688 is signalled when the residual associated with the transform block includes significant residual coefficients only in those coefficient positions subject to transforming into primary coefficients by application of a secondary transform.
Residual coefficients of each TB associated with a CB are coded using a residual syntax. The residual syntax is designed to efficiently encode coefficients with low magnitudes, using mainly arithmetically coded bins to indicate significance of coefficients, along with lower-valued magnitudes and reserving bypass bins for higher magnitude residual coefficients. Accordingly, residual blocks comprising very low magnitude values and sparse placement of significant coefficients are efficiently compressed. Moreover, two residual coding schemes are present. A regular residual coding scheme is optimised for TBs with significant coefficients predominantly located in the upper-left corner of the TB, as is seen when a transform is applied. A transform-skip residual coding scheme is available for TBs where a transform is not performed and is able to efficiently encode residual coefficients regardless of their distribution throughout the TB.
A multiplexer module 684 outputs the PB 620 from an intra-frame prediction module 664 according to the determined best intra prediction mode, selected from the tested prediction mode of each candidate CB. The candidate prediction modes need not include every conceivable prediction mode supported by the video encoder 120. Intra prediction falls into three types, first, “DC intra prediction”, which involves populating a PB with a single value representing the average of nearby reconstructed samples; second, “planar intra prediction”, which involves populating a PB with samples according to a plane, with a DC offset and a vertical and horizontal gradient being derived from nearby reconstructed neighbouring samples. The nearby reconstructed samples typically include a row of reconstructed samples above the current PB, extending to the right of the PB to an extent and a column of reconstructed samples to the left of the current PB, extending downwards beyond the PB to an extent; and, third, “angular intra prediction”, which involves populating a PB with reconstructed neighbouring samples filtered and propagated across the PB in a particular direction (or ‘angle’). In VVC, sixty-five (65) angles are supported, with rectangular blocks able to utilise additional angles, not available to square blocks, to produce a total of eighty-seven (87) angles.
A fourth type of intra prediction is available to chroma PBs, whereby the PB is generated from collocated luma reconstructed samples according to a ‘cross-component linear model’ (CCLM) mode. Three different CCLM modes are available, each mode using a different model derived from the neighbouring luma and chroma samples. The derived model is used to generate a block of samples for the chroma PB from the collocated luma samples. Luma blocks may be intra predicted using a matrix multiplication of the reference samples using one matrix selected from a predefined set of matrices. This matrix intra prediction (MIP) achieves gain by using matrices trained on a large set of video data, with the matrices representing relationships between reference samples and a predicted block that are not easily captured in angular, planar, or DC intra prediction modes.
The module 664 may also produce a prediction unit by copying a block from nearby the current frame using an ‘intra block copy’ (IBC) method. The location of the reference block is constrained to an area equivalent to one CTU, divided into 64×64 regions known as VPDUs, with the area covering the processed VPDUs of the current CTU and VPDUs of the previous CTU(s) within each row or CTUs and within each slice or tile up to the area limit corresponding to one 128×128 luma samples, regardless of the configured CTU size for the bitstream. This area is known as an ‘IBC virtual buffer’ and limits the IBC reference area, thus limiting the required storage. The IBC buffer is populated with reconstructed samples 654 (i.e. prior to loop filtering), and so a separate buffer to the frame buffer 672 is needed. When the CTU size is 128×128 the virtual buffer includes samples only from the CTU adjacent and to the left of the current CTU. When the CTU size is 32×32 or 64×64 the virtual buffer includes CTUs from up to the four or sixteen CTUs to the left of the current CTU. Regardless of the CTU size, access to neighbouring CTUs for obtaining samples for IBC reference blocks is constrained by boundaries such as edges of pictures, slices, or tiles. Especially for feature maps of FPN layers having smaller dimensions, use of a CTU size such as 32×32 or 64×64 results in a reference area more aligned to cover a set of previous feature maps. Where feature map placement is ordered based on SAD, SSE or other difference metric, access to similar feature maps for IBC prediction offers coding efficient advantage.
The residual for a predicted block when encoding feature map data is different to the residual seen for natural video. Such natural video is typically captured by an imaging sensor, or screen content, as generally seen in operating system user interfaces and the like. Feature map residuals tend to contain much detail, which is amenable to transform skip coding more than predominantly low-frequency coefficients of various transforms. Experiments show that the feature map residual has enough local similarity to benefit from transform coding. However, the distribution of feature map residual coefficients is not clustered towards the DC (top-left) coefficient of a transform block. In other words, sufficient correlation exists for a transform to show gain when encoding feature map data and this is true also for when intra block copy is used to produce prediction blocks for the feature map data. Accordingly, a Hadamard cost estimate may be used when evaluating residuals resulting from candidate block vectors for intra block copy when encoding feature map data, instead of relying solely on a SAD or SSD cost estimate. SAD or SSD cost estimates tend to select block vectors with residuals more amenable to transform skip coding and may miss block vectors with residuals that would be compactly encoded using transforms. The multiple transform selection (MTS) tool of the VVC standard may be used when encoding feature map data so that, in addition to the DCT-2 transform, combinations of DST-7 and DCT-8 transforms are available horizontally and vertically for residual encoding.
An intra-predicted luma coding block may be partitioned into a set of equal-sized prediction blocks, either vertically or horizontally, which each block having a minimum area of sixteen (16) luma samples. This intra sub-partition (ISP) approach enables separate transform blocks to contribute to prediction block generation from one sub-partition to the next sub-partition in the luma coding block, improving compression efficiency.
Where previously reconstructed neighbouring samples are unavailable, for example at the edge of the frame, a default half-tone value of one half the range of the samples is used. For example, for 10-bit video a value of five-hundred and twelve (512) is used. As no previously samples are available for a CB located at the top-left position of a frame, angular and planar intra-prediction modes produce the same output as the DC prediction mode (i.e. a flat plane of samples having the half-tone value as magnitude).
For inter-frame prediction a prediction block 682 is produced using samples from one or two frames preceding the current frame in the coding order frames in the bitstream by a motion compensation module 680 and output as the PB 620 by the multiplexer module 684. Moreover, for inter-frame prediction, a single coding tree is typically used for both the luma channel and the chroma channels. The order of coding frames in the bitstream may differ from the order of the frames when captured or displayed. When one frame is used for prediction, the block is said to be ‘uni-predicted’ and has one associated motion vector. When two frames are used for prediction, the block is said to be ‘bi-predicted’ and has two associated motion vectors. For a P slice, each CU may be intra predicted or uni-predicted. For a B slice, each CU may be intra predicted, uni-predicted, or bi-predicted.
Frames are typically coded using a ‘group of pictures’ structure, enabling a temporal hierarchy of frames. Frames may be divided into multiple slices, each of which encodes a portion of the frame. A temporal hierarchy of frames allows a frame to reference a preceding and a subsequent picture in the order of displaying the frames. The images are coded in the order necessary to ensure the dependencies for decoding each frame are met. An affine inter prediction mode is available where instead of using one or two motion vectors to select and filter reference sample blocks for a prediction unit, the prediction unit is divided into multiple smaller blocks and a motion field is produced so each smaller block has a distinct motion vector. The motion field uses the motion vectors of nearby points to the prediction unit as ‘control points’. Affine prediction allows coding of motion different to translation with less need to use deeply split coding trees. A bi-prediction mode available to VVC performs a geometric blend of the two reference blocks along a selected axis, with angle and offset from the centre of the block signalled. This geometric partitioning mode (“GPM”) allows larger coding units to be used along the boundary between two objects, with the geometry of the boundary coded for the coding unit as an angle and centre offset. Motion vector differences, instead of using cartesian (x, y) offset, may be coded as a direction (up/down/left/right) and a distance, with a set of power-of-two distances supported. The motion vector predictor is obtained from a neighbouring block (‘merge mode’) as if no offset is applied. The current block will share the same motion vector as the selected neighbouring block.
The samples are selected according to a motion vector 678 and reference picture index. The motion vector 678 and reference picture index applies to all colour channels and thus inter prediction is described primarily in terms of operation upon PUs rather than PBs. The decomposition of each CTU into one or more inter-predicted blocks is described with a single coding tree. Inter prediction methods may vary in the number of motion parameters and their precision. Motion parameters typically comprise a reference frame index, indicating which reference frame(s) from lists of reference frames are to be used plus a spatial translation for each of the reference frames, but may include more frames, special frames, or complex affine parameters such as scaling and rotation. In addition, a pre-determined motion refinement process may be applied to generate dense motion estimates based on referenced sample blocks.
Having determined and selected the PB 620, and subtracted the PB 620 from the original sample block at the subtractor 622, a residual with lowest coding cost, represented as 624, is obtained and subjected to lossy compression. The lossy compression process comprises the steps of transformation, quantisation and entropy coding. A forward primary transform module 626 applies a forward transform to the difference 624, converting the difference 624 from the spatial domain to the frequency domain, and producing primary transform coefficients represented by an arrow 628. The largest primary transform size in one dimension is either a 32-point DCT-2 or a 64-point DCT-2 transform, configured by a ‘sps_max_luma_transform_size_64_flag’ in the sequence parameter set. If the CB being encoded is larger than the largest supported primary transform size expressed as a block size (e.g. 64×64 or 32×32), the primary transform 626 is applied in a tiled manner to transform all samples of the difference 624. Where a non-square CB is used, tiling is also performed using the largest available transform size in each dimension of the CB. For example, when a maximum transform size of thirty-two (32) is used, a 64×16 CB uses two 32×16 primary transforms arranged in a tiled manner. When a CB is larger in size than the maximum supported transform size, the CB is filled with TBs in a tiled manner. For example, a 128×128 CB with 64-pt transform maximum size is filled with four 64×64 TBs in a 2×2 arrangement. A 64×128 CB with a 32-pt transform maximum size is filled with eight 32×32 TBs in a 2×4 arrangement.
Application of the transform 626 results in multiple TBs for the CB. Where each application of the transform operates on a TB of the difference 624 larger than 32×32, e.g. 64×64, all resulting primary transform coefficients 628 outside of the upper-left 32×32 area of the TB are set to zero (i.e., discarded). The remaining primary transform coefficients 628 are passed to a quantiser module 634. The primary transform coefficients 628 are quantised according to a quantisation parameter 692 associated with the CB to produce primary transform coefficients 632. In addition to the quantisation parameter 692, the quantiser module 634 may also apply a ‘scaling list’ to allow non-uniform quantisation within the TB by further scaling residual coefficients according to their spatial position within the TB. The quantisation parameter 692 may differ for a luma CB versus each chroma CB. The primary transform coefficients 632 are passed to a forward secondary transform module 630 to produce transform coefficients represented by the arrow 636 by performing either a non-separable secondary transform (NSST) operation or bypassing the secondary transform. The forward primary transform is typically separable, transforming a set of rows and then a set of columns of each TB. The forward primary transform module 626 uses either a type-II discrete cosine transform (DCT-2) in the horizontal and vertical directions, or bypass of the transform horizontally and vertically, or combinations of a type-VII discrete sine transform (DST-7) and a type-VIII discrete cosine transform (DCT-8) in either horizontal or vertical directions for luma TBs not exceeding 16 samples in width and height. Use of combinations of a DST-7 and DCT-8 is referred to as ‘multi transform selection set’ (MTS) in the VVC standard.
The forward secondary transform of the module 630 is generally a non-separable transform, which is only applied for the residual of intra-predicted CUs and may nonetheless also be bypassed. The forward secondary transform operates either on sixteen (16) samples (arranged as the upper-left 4×4 sub-block of the primary transform coefficients 628) or forty-eight (48) samples (arranged as three 4×4 sub-blocks in the upper-left 8×8 coefficients of the primary transform coefficients 628) to produce a set of secondary transform coefficients. The set of secondary transform coefficients may be fewer in number than the set of primary transform coefficients from which they are derived. Due to application of the secondary transform to only a set of coefficients adjacent to each other and including the DC coefficient, the secondary transform is referred to as a ‘low frequency non-separable secondary transform’ (LFNST). Moreover, when the LFNST is applied, all remaining coefficients in the TB are zero, both in the primary transform domain and the secondary transform domain.
The quantisation parameter 692 is constant for a given TB and thus results in a uniform scaling for the production of residual coefficients in the primary transform domain for a TB. The quantisation parameter 692 may vary periodically with a signalled ‘delta quantisation parameter’. The delta quantisation parameter (delta QP) is signalled once for CUs contained within a given area, referred to as a ‘quantisation group’. If a CU is larger than the quantisation group size, delta QP is signalled once with one of the TBs of the CU. That is, the delta QP is signalled by the entropy encoder 638 once for the first quantisation group of the CU and not signalled for any subsequent quantisation groups of the CU. A non-uniform scaling is also possible by application of a ‘quantisation matrix’, whereby the scaling factor applied for each residual coefficient is derived from a combination of the quantisation parameter 692 and the corresponding entry in a scaling matrix. The scaling matrix may have a size that is smaller than the size of the TB, and when applied to the TB a nearest neighbour approach is used to provide scaling values for each residual coefficient from a scaling matrix smaller in size than the TB size. The residual coefficients 636 are supplied to the entropy encoder 638 for encoding in the bitstream 121. Typically, the residual coefficients of each TB with at least one significant residual coefficient of the TU are scanned to produce an ordered list of values, according to a scan pattern. The scan pattern generally scans the TB as a sequence of 4×4*sub-blocks', providing a regular scanning operation at the granularity of 4×4 sets of residual coefficients, with the arrangement of sub-blocks dependent on the size of the TB. The scan within each sub-block and the progression from one sub-block to the next typically follow a backward diagonal scan pattern. Additionally, the quantisation parameter 692 is encoded into the bitstream 121 using a delta QP syntax element and the secondary transform index 688 is encoded in the bitstream 121.
As described above, the video encoder 120 needs access to a frame representation corresponding to the decoded frame representation seen in the video decoder 144. Thus, the residual coefficients 636 are passed through an inverse secondary transform module 644, operating in accordance with the secondary transform index 688 to produce intermediate inverse transform coefficients, represented by an arrow 642. The intermediate inverse transform coefficients 642 are inverse quantised by a dequantiser module 640 according to the quantisation parameter 692 to produce inverse transform coefficients, represented by an arrow 646. The dequantiser module 640 may also perform an inverse non-uniform scaling of residual coefficients using a scaling list, corresponding to the forward scaling performed in the quantiser module 634. The inverse transform coefficients 646 are passed to an inverse primary transform module 648 to produce residual samples, represented by an arrow 650, of the TU. The inverse primary transform module 648 applies DCT-2 transforms horizontally and vertically, constrained by the maximum available transform size as described with reference to the forward primary transform module 626. The types of inverse transform performed by the inverse secondary transform module 644 correspond with the types of forward transform performed by the forward secondary transform module 630. The types of inverse transform performed by the inverse primary transform module 648 correspond with the types of primary transform performed by the primary transform module 626. A summation module 652 adds the residual samples 650 and the PU 620 to produce reconstructed samples (indicated by an arrow 654) of the CU.
The reconstructed samples 654 are passed to a reference sample cache 656 and an in-loop filters module 668. The reference sample cache 656, typically implemented using static RAM on an ASIC to avoid costly off-chip memory access, provides minimal sample storage needed to satisfy the dependencies for generating intra-frame PBs for subsequent CUs in the frame. The minimal dependencies typically include a ‘line buffer’ of samples along the bottom of a row of CTUs, for use by the next row of CTUs and column buffering the extent of which is set by the height of the CTU. The reference sample cache 656 supplies reference samples (represented by an arrow 658) to a reference sample filter 660. The sample filter 660 applies a smoothing operation to produce filtered reference samples (indicated by an arrow 662). The filtered reference samples 662 are used by an intra-frame prediction module 664 to produce an intra-predicted block of samples, represented by an arrow 666. For each candidate intra prediction mode the intra-frame prediction module 664 produces a block of samples, that is 666. The block of samples 666 is generated by the module 664 using techniques such as DC, planar or angular intra prediction. The block of samples 666 may also be produced using a matrix-multiplication approach with neighbouring reference sample as input and a matrix selected from a set of matrices by the video encoder 120, with the selected matrix signalled in the bitstream 120 using an index to identify which matrix of the set of matrices is to be used by the video decoder 144.
The in-loop filters module 668 applies several filtering stages to the reconstructed samples 654. The filtering stages include a ‘deblocking filter’ (DBF) which applies smoothing aligned to the CU boundaries to reduce artefacts resulting from discontinuities. Another filtering stage present in the in-loop filters module 668 is an ‘adaptive loop filter’ (ALF), which applies a Wiener-based adaptive filter to further reduce distortion. A further available filtering stage in the in-loop filters module 668 is a ‘sample adaptive offset’ (SAO) filter. The SAO filter operates by firstly classifying reconstructed samples into one or multiple categories and, according to the allocated category, applying an offset at the sample level.
Filtered samples, represented by an arrow 670, are output from the in-loop filters module 668. The filtered samples 670 are stored in a frame buffer 672. The frame buffer 672 typically has the capacity to store several (e.g., up to sixteen (16)) pictures and thus is stored in the memory 206. The frame buffer 672 is not typically stored using on-chip memory due to the large memory consumption required. As such, access to the frame buffer 672 is costly in terms of memory bandwidth. The frame buffer 672 provides reference frames (represented by an arrow 674) to a motion estimation module 676 and the motion compensation module 680.
The motion estimation module 676 estimates a number of ‘motion vectors’ (indicated as 678), each being a Cartesian spatial offset from the location of the present CB, referencing a block in one of the reference frames in the frame buffer 672. A filtered block of reference samples (represented as 682) is produced for each motion vector. The filtered reference samples 682 form further candidate modes available for potential selection by the mode selector 686. Moreover, for a given CU, the PU 620 may be formed using one reference block (‘uni-predicted’) or may be formed using two reference blocks (‘bi-predicted’). For the selected motion vector, the motion compensation module 680 produces the PB 620 in accordance with a filtering process supportive of sub-pixel accuracy in the motion vectors. As such, the motion estimation module 676 (which operates on many candidate motion vectors) may perform a simplified filtering process compared to that of the motion compensation module 680 (which operates on the selected candidate only) to achieve reduced computational complexity. When the video encoder 120 selects inter prediction for a CU the motion vector 678 is encoded into the bitstream 121.
Although the video encoder 120 of
The video decoder 144 is shown in
The bitstream 143 is input to an entropy decoder module 720. The entropy decoder module 720 extracts syntax elements from the bitstream 143 by decoding sequences of ‘bins’ and passes the values of the syntax elements to other modules in the video decoder 144. The entropy decoder module 720 uses variable-length and fixed length decoding to decode SPS, PPS or slice header an arithmetic decoding engine to decode syntax elements of the slice data as a sequence of one or more bins. Each bin may use one or more ‘contexts’, with a context describing probability levels to be used for coding a ‘one’ and a ‘zero’ value for the bin. Where multiple contexts are available for a given bin, a ‘context modelling’ or ‘context selection’ step is performed to choose one of the available contexts for decoding the bin. The process of decoding bins forms a sequential feedback loop, thus each slice may be decoded in the slice's entirety by a given entropy decoder 720 instance. A single (or few) high-performing entropy decoder 720 instances may decode all slices for a frame from the bitstream 143 multiple lower-performing entropy decoder 720 instances may concurrently decode the slices for a frame from the bitstream 143.
The entropy decoder module 720 applies an arithmetic coding algorithm, for example ‘context adaptive binary arithmetic coding’ (CABAC), to decode syntax elements from the bitstream 143. The decoded syntax elements are used to reconstruct parameters within the video decoder 144. Parameters include residual coefficients (represented by an arrow 724), a quantisation parameter 774, a secondary transform index 770, and mode selection information such as an intra prediction mode (represented by an arrow 758). The mode selection information also includes information such as motion vectors, and the partitioning of each CTU into one or more CBs. Parameters are used to generate PBs, typically in combination with sample data from previously decoded CBs.
The residual coefficients 724 are passed to an inverse secondary transform module 736 where either a secondary transform is applied or no operation is performed (bypass) according to a secondary transform index. The inverse secondary transform module 736 produces reconstructed transform coefficients 732, that is primary transform domain coefficients, from secondary transform domain coefficients. The reconstructed transform coefficients 732 are input to a dequantiser module 728. The dequantiser module 728 performs inverse quantisation (or ‘scaling’) on the residual coefficients 732, that is, in the primary transform coefficient domain, to create reconstructed intermediate transform coefficients, represented by an arrow 740, according to the quantisation parameter 774. The dequantiser module 728 may also apply a scaling matrix to provide non-uniform dequantization within the TB, corresponding to operation of the dequantiser module 640. Should use of a non-uniform inverse quantisation matrix be indicated in the bitstream 143, the video decoder 144 reads a quantisation matrix from the bitstream 143 as a sequence of scaling factors and arranges the scaling factors into a matrix. The inverse scaling uses the quantisation matrix in combination with the quantisation parameter to create the reconstructed intermediate transform coefficients 740.
The reconstructed transform coefficients 740 are passed to an inverse primary transform module 744. The module 744 transforms the coefficients 740 from the frequency domain back to the spatial domain. The inverse primary transform module 744 applies inverse DCT-2 transforms horizontally and vertically, constrained by the maximum available transform size as described with reference to the forward primary transform module 626. The result of operation of the module 744 is a block of residual samples, represented by an arrow 748. The block of residual samples 748 is equal in size to the corresponding CB. The residual samples 748 are supplied to a summation module 750.
At the summation module 750 the residual samples 748 are added to a decoded PB (represented as 752) to produce a block of reconstructed samples, represented by an arrow 756. The reconstructed samples 756 are supplied to a reconstructed sample cache 760 and an in-loop filtering module 788. The in-loop filtering module 788 produces reconstructed blocks of frame samples, represented as 792. The frame samples 792 are written to a frame buffer 796.
The reconstructed sample cache 760 operates similarly to the reconstructed sample cache 656 of the video encoder 120. The reconstructed sample cache 760 provides storage for reconstructed samples needed to intra predict subsequent CBs without the memory 206 (e.g., by using the data 232 instead, which is typically on-chip memory). Reference samples, represented by an arrow 764, are obtained from the reconstructed sample cache 760 and supplied to a reference sample filter 768 to produce filtered reference samples indicated by arrow 772. The filtered reference samples 772 are supplied to an intra-frame prediction module 776. The module 776 produces a block of intra-predicted samples, represented by an arrow 780, in accordance with the intra prediction mode parameter 758 signalled in the bitstream 143 and decoded by the entropy decoder 720. The intra prediction module 776 supports the modes of the module 664, including IBC and MIP. The block of samples 780 is generated using modes such as DC, planar or angular intra prediction.
When the prediction mode of a CB is indicated to use intra prediction in the bitstream 143, the intra-predicted samples 780 form the decoded PB 752 via a multiplexor module 784. Intra prediction produces a prediction block (PB) of samples, which is a block in one colour component, derived using ‘neighbouring samples’ in the same colour component. The neighbouring samples are samples adjacent to the current block and by virtue of being preceding in the block decoding order have already been reconstructed. Where luma and chroma blocks are collocated, the luma and chroma blocks may use different intra prediction modes. However, the two chroma CBs share the same intra prediction mode.
When the prediction mode of the CB is indicated to be inter prediction in the bitstream 143, a motion compensation module 734 produces a block of inter-predicted samples, represented as 738. The block of inter-predicted samples 738 are produced using a motion vector, decoded from the bitstream 143 by the entropy decoder 720, and reference frame index to select and filter a block of samples 798 from a frame buffer 796. The block of samples 798 is obtained from a previously decoded frame stored in the frame buffer 796. For bi-prediction, two blocks of samples are produced and blended together to produce samples for the decoded PB 752. The frame buffer 796 is populated with filtered block data 792 from an in-loop filtering module 788. As with the in-loop filtering module 668 of the video encoder 120, the in-loop filtering module 788 applies any of the DBF, the ALF and SAO filtering operations. Generally, the motion vector is applied to both the luma and chroma channels, although the filtering processes for sub-sample interpolation in the luma and chroma channel are different.
Not shown in
Once all groups of feature maps are scaled, the result is output as intermediate data in the form of tensors 149. The tensors 149 may contain multiple tensors each having a different spatial resolution, for example when the CNN backbone 114 includes an FPN. In addition to using a zero-centred linear, symmetric, quantisation process, other quantisation processes are also possible. For example, an asymmetric approach where a positive and a negative quantisation range are signalled for each feature map group, may be used. The positive and negative quantisation range map the range utilised by floating point values of the group of feature maps into the full sample range afforded by the bit depth of the samples, which results in an asymmetric quantisation as the mid-point of the sample range that is no longer guaranteed to correspond to a zero floating point value. A ‘quant_type’ syntax element in the SEI message 1413 selects the quantisation approach and is described with reference to Appendix A.
Although a quantisation range for a given group of feature maps is derived from the values within the feature maps of the group, the quantisation range needs to retain the same data type as the values within the feature maps of the group. A coarser floating point precision may be used, with rounding applied so that the range, when expressed back in the original floating point format (e.g. 32-bit IEEE 754 format) is not reduced. For example, the coarser floating point precision may be used with upward rounding. The precision of quantisation range in terms of bits allocated to the fraction portion is selected using a ‘qr_fraction_precision’ syntax element which is described with reference to Appendix A. To produce a mantissa for a quantisation range, a leading ‘1’ is prepended to the fraction portion (i.e., the quantisation range may not be a ‘denormal’ value). As a quantisation range is always positive, there is no need to encode a sign bit for each quantisation range. The quantisation range may be greater than one or less than one, so a sign bit for the quantisation range exponent is needed. In an arrangement of the system 100, quantisation ranges below 1.0 are not permitted and the quantisation exponent sign bit may be omitted from the SEI message 1413. When the quantisation exponent sign bit is not coded, quantisation ranges less than 1.0 are clipped to the value 1.0 in the quantisation range determiner module 514.
Notwithstanding that the operation of the inverse quantiser module 814 and the quantiser module 518 is referred to as ‘quantisation’, operation of the modules 518 and 814 is distinct from quantisation operation of the video encoder 120 and the video decoder 144, which involves the use of the quantisation parameter. Moreover, operation of the modules 518 and 814 may be viewed as a form of tone mapping operation, involving the conversion between floating point domain of tensors and sample domain of frames. Although there is a scaling (i.e., via the quantisation range of each group of feature maps) for the purpose of utilising a wide range of the sample value space, there is no quantisation parameter applicable to the modules 518 and 814 to further alter the quantiser step size.
The upscaler module 960 accepts a tensor 962 as input, which is passed to a CBL module 966 to produce a tensor 968. The tensor 968 is passed to an upsampler 970 to produce an upsampled tensor 972. A concatenation module 974 produces a tensor 976 by concatenating the upsampled tensor 972 with an input tensor 964. The detection modules 916, 930, and 944 are instances of a detection module 980 as shown in
Input to the ROI pooler 1028 are the P2-P5 feature maps 1010, 1012, 1014, and 1016, and region of interest proposals 1026. Each proposal (ROI) from 1026 is associated with a portion of the feature maps (1010-1016) to produce a fixed-size map. The fixed-size map is of a size independent of the underlying portion of the feature map 1010-1016. One of the feature maps 1010-1016 is selected such that the resulting cropped map has sufficient detail, for example, according to the following rule: floor (4+log 2(sqrt(box_area)/224)), where 224 is the canonical box size. The ROI pooler 1028 thus crops incoming feature maps according to the proposals 1026 producing a tensor 1030. The tensor 1030 is fed into a fully connected (FC) neural network head 1032. The FC head 1032 performs two fully connected layers to produce class score and bounding box predictor delta tensor 1034. The class score is generally an 80 element tensor, each element corresponding to a prediction score for the corresponding object category. The bounding box prediction deltas tensor is a 80×4=320 element tensor, containing bounding boxes for the corresponding object categories. Final processing is performed by an output layers module 1036, receiving the tensor 1034 and performing a filtering operation to produce a filtered tensor 1038. Low-scoring (low classification) objects are removed from further consideration. A non-maximum suppression module 1040 removes overlapping bounding boxes by removing the overlapped box with a lower classification score, resulting in an inference output tensor 151.
In placing feature maps in the two-dimensional array in the form of the monochrome frame 1102, feature maps of the same group of frames are placed adjacent in the frame 1102. For example, group 1106 contains feature map 1110, with group 1108 and group 1109 containing the remaining feature maps in the layer. Likewise, group 1114 contains feature map 1112, with two additional groups for the layer. For brevity, grouping is not shown for the layer containing the smallest feature maps (i.e. feature map 1120), however the same groupwise packing approach is used. Within each group, feature maps are present in a determined ordering and the placement in the monochrome frame 1102 reflects the ordering.
In placing feature maps into the monochrome frame 1202 of
In addition to aligning feature maps to a specific alignment grid, a minimum padding between feature maps, such as two samples, may also be enforced. The minimum padding helps prevent artefacts in one feature map caused by content in an adjacent feature map in cases where the feature map size is a multiple of the alignment grid. For example, a feature map of size 136×76 fits onto a 4×4 alignment grid with no inserted unused sample space between itself and the adjacent feature maps. A minimum padding area ensures some separation between adjacent feature maps, which may help reduce coding artefacts crossing from one feature map to an adjacent feature map.
In arrangement of the system 100, feature maps of a given image, that is one frame from the video source 112, are packed into more than one frame. For example, feature maps from one image may be packed into four frames. Feature maps may be grouped based on similarity into fixed-sized groups, such as groups of size four. Each feature map group may be placed into the four frames such that the feature maps of a given group are spatially collocated across the four frames. Such a packing arrangement results in the frames 117 having a frame rate four times greater than the frame rate of the frame data 113. The sets of four frames may then be encoded by the video encoder 120 using low-delay or random-access picture structures, allowing inter-prediction coding tools to exploit correlation between the spatially collocated feature maps of a given feature map group.
An SEI message 1413 encodes a feature map grouping 1430, as determined by the group determiner module 510 and quantisation ranges 1432, as determined by the range determiner module 514. Appendix A shows example syntax and semantics for the SEI message 1413. The packing format used by the packer module 522 may also be encoded in the SEI message 1413, using an index to select one feature packing format from an enumeration of all available feature packing formats. The particular CNN backbone that was used to produce the feature maps may also be indicated in the SEI message 1413 using an index to select one CNN backbone from an enumeration of a set of predetermined CNN backbones, some or all of which are available to the source device 110. From the CNN backbone type index, the number of layers and number of channels in each layer and resolution of each feature map in each layer may be determined. For groupings where feature maps within a given group are in the same layer, separate group lists of feature map indices are coded for each layer. For groupings where feature maps within a given group may span multiple layers, a feature map index and layer index pair are coded as items in each group. For groupings where at most one feature map in each layer is present, and those that are present are in adjacent layers, the layer index is only needed for the first feature map in the group. If the group includes feature maps of all layers, for example all three layers, no group index is required as the feature map indices apply implicitly to one feature map in each layer.
Each frame is encoded in the bitstream 1400 as an ‘access unit’, such as access unit 1414 as seen in
The method 1500 begins at a perform CNN first portion step 1510. At the step 1510, the CNN backbone 114, under execution of the processor 205, performs a subset of the layers of a particular CNN to convert an input frame 113 into intermediate tensors 115. Due to use of a prediction head or FPN the tensors 115 may contain multiple tensors. The method 1500 operates to encode tensors corresponding to one frame of video data from the video source 112. Control in the processor 205 then progresses from the step 1510 to a determine feature map similarity step 1520. The intermediate tensors 115 may be stored, for example, in the memory 206 and/or hard disk drive 210.
At the determine feature map similarity step 1520 the module 116, under execution of the processor 205, produces a similarity matrix containing a measure of the similarity of each feature map with each other feature map within each layer. The similarity matrix may be stored, for example, in the memory 206 and/or hard disk drive 210. The similarity measure may be mean squared difference (MSE) of two feature maps or sum of absolute differences (SAD) of two feature maps or some other measure of difference. Where it is desired to measure similarity of feature maps in different layers, the feature maps having a lower spatial resolution may be upscaled (e.g. using nearest neighbour interpolation), to produce a compatible resolution with the higher spatial resolution for the purpose of difference measurement. To reduce computational overhead, the step 1520 is performed infrequently, for example, only on the first picture of a CLVS, or on each random-access point in a CLVS. Control in the processor 205 then progresses from the step 1520 to a determine feature map grouping step 1530.
At the determine feature map group step 1530 the group determiner 510, under execution of the processor 205, determines sets of groups to which feature maps are assigned. The groups of feature maps may be stored, for example, in the memory 206 and/or hard disk drive 210. One example of grouping is for feature maps of a given layer to be assigned to one group, resulting in one group per layer of the FPN. The step 1530 needs to be performed when the similarity matrix of the step 1520 has been determined, for example, on the first picture of a CLVS or on every random-access point in the CLVS. Control in the processor 205 progresses from the step 1530 to a determine feature map placement step 1540.
At the determine feature map placement step 1540 the packer module 522, under execution of the processor 205, determines the location at which each feature map will be placed in a frame. When the frame is a monochrome frame, the feature maps are placed in a raster scan order filling the frame area, with the frame area initialised based on the total area of all feature maps to be packed into the frame and a target aspect ratio. Packing arrangements are described with reference to
At the determine group ranges step 1550 the range determiner 514, under execution of the processor 205, determines the range of the floating-point data in each group of feature maps determined in step 1530. The determined ranges may be stored, for example, in the memory 206 and/or hard disk drive 210. For symmetric operation, the range for the group is the largest magnitude (absolute) value of the values in the feature maps belonging to the group. The range provides a value for normalisation of the feature map data prior to conversion and quantisation to integer sample values. For asymmetric operation, a positive and negative range is determined for each group of feature maps, indicating the largest positive and largest negative value encountered within the group of feature maps. A quantisation range is determined for each group of feature maps in the tensors 115. The quantisation ranges may be determined for tensors of every frame of video data, or a less frequent update may be applied. To reduce signalling overhead, quantisation ranges may be determined for intra pictures or random-access pictures only in the video bitstream. The range of floating-point data tensors of a subsequent frame, where the quantisation range was not determined, may exceed the earlier determined quantisation ranges. A safety margin may be introduced by increasing the magnitude of the determined quantisation ranges by some specified scaling factor. Multiplying quantisation ranges by a fixed factor, for example 8/7, results in compressing the utilised sample range of the data into a range approximately corresponding to video range used in YCbCr video data. Later frames where the quantisation range might not be determined have some headroom to exceed this range up to the limit of the sample bit depth, e.g. [0 . . . 1023] for 10-bit video. Control in the processor 205 then progresses from the step 1550 to a quantise feature maps step 1560.
At the quantise feature maps step 1560 the quantiser module 518, under execution of the processor 205, quantises each feature map from floating point values into integer sample values according to the quantisation range of the group to which the feature map belongs. The determined integer sample values may be stored, for example, in the memory 206 and/or hard disk drive 210. The step 1560 is described with reference to
At the pack feature maps step 1570 the packer module 522, under execution of the processor 205, packs integer feature maps 520 to produce the packed feature map frame 117. Quantised feature maps 520, corresponding to feature maps from each layer of the tensors 115 may be stored in a memory buffer configured, for example, within the memory 206 and/or hard disk drive 210, holding one frame of video data. Packing formats for the feature maps are described with reference to
At the encode metadata step 1580 the entropy encoder 638, under execution of the processor 205, encodes the feature map groupings 512 and quantisation ranges 516, i.e. the metadata 125 into the bitstream 120. The metadata 125 may be encoded using as the SEI message 1413. The format of the SEI message 1413 is described with reference to Appendix A. Control in the processor 205 then progresses from the step 1580 to an encode frame step 1590.
At the encode frame step 1590 the video encoder 120, under execution of the processor 205, encodes the frame 119 into the bitstream 121. When the source device 110 is configured to encode feature maps, the frame 119 is obtained from the packed feature map frame 117 via the multiplexor 118. When the source device 110 is configured to encode feature maps, the video encoder 120 may use a subset of the coding tools available to a profile of the video coding standard. The subset of coding tools may be signalled using general constraint flags. For example, the “Main10” profile may be signalled in the profile level tier syntax 1438 in the bitstream 120 and general constraint flags 1440 may signal the following tools are not used in the bitstream 120: LFNST (via gci_no_lfnst_constraint_flag), MIP (via gci_no_mip_constraint_flag), LMCS (via_gci_no_lmcs_constraint_flag), ISP (via gci_no_isp_constraint_flag), Affine (via_gci_no_affine_motion_constraint_flag), GPM (via gci_no_gpm_constraint_flag), MMVD (via_gci_no_mmvd_constraint_flag). Disabling the deblocking filter results in greater compression efficiency and higher task performance when encoding feature maps. In the VVC coding standard, the deblocking filter is disabled for pictures referencing a picture parameter set in the bitstream 121 having pps_deblocking_filter_disabled_flag set to ‘1’, unless overridden at the slice or picture level by coding sh_deblocking_filter_disabled_flag with a value of ‘1’ or by coding ph_deblocking_filter_disabled_flag with a value of ‘1’. Deblocking is not explicitly disabled using a constraint flag in the VVC standard and thus disabling the deblocking filter does not constitute part of the definition of a subprofile for feature map encoding, even though such disablement shows advantage. The method 1500 completes and processing in the processor 205 proceeds to the next frame.
At the step 1605 the quantiser module 518, under execution of the processor 205, a floating-point value from a feature map is normalised into a [−1.0, 1.0] range by dividing the value by the quantisation range for the feature map to produce a normalised floating-point value. Control in the processor 205 progresses from step 1605 to a determine sign and magnitude step 1610.
At the step 1610, the quantiser module 518, under execution of the processor 205, separates the sign and the magnitude from the normalised floating-point value. Control in the processor 205 progresses from the step 1610 to an apply scaling step 1620.
At the apply scaling step 1620 the quantiser module 518, under execution of the processor 205, multiplies the floating-point magnitude from the step 1610 with a prescaling constant to produce a prescaled magnitude. The prescaled magnitude has two components, a power-of-two factor that effectively shifts a number of fractional bits of the floating-point value into integer bits, and a scaling factor. The scaling factor has a value greater than one. The scaling factor is selected to compensate for a quantisation from floating point precision to integer precision using a floor operation, which introduces a downward bias in the magnitude. A scaling factor of 1.31 was found to minimise the error in the reconstructed floating-point values after quantisation and inverse quantisation, although other values may also be used, such as a value approximating the square root of two, i.e. 1.41. The power-of-two factor of 65,536 results in a normalised range that, after a log 2 operation, can be reduced to a range of zero to sixteen, requiring four sample bits. The presence of the scaling factor may add another bit to the sample. The overall sample bit width remains less than the minimum bit-depth of 8 bits supported by the video encoder 120. Accordingly, the video encoder 120 may be configured to use an 8-bit sample depth and operate at 8-bits internally, i.e., an 8-bit profile may be used. Other power-of-two factors may also be used. Control in the processor 205 progresses from the step 1620 to a perform log 2 step 1630.
At the step 1630 the quantiser module 518, under execution of the processor 205, the fractional portion of the prescaled magnitude is truncated to remove any portion to the right of the decimal point (i.e. a floor operation is applied) to produce an integer magnitude. A log 2 operation is performed on the result of one plus the integer magnitude, producing a log 2 value. The addition of the value one to the integer magnitude allows the logarithm operation to handle zero-valued tensor magnitudes. In other words, the power-of-two exponent is extracted from the prescaled magnitude to produce a log 2 value. As a result of steps 1620 and 1630 tensor magnitudes are converted from a linear space to a logarithmic space with low complexity. As the tensors contain many samples, low complexity quantisation offers an implementation benefit. Moreover, experimentation has shown that overall task performance is more dependent on preserving the exponent of floating-point values in each tensor than the precise values, i.e. the fractional portion of each floating-point value. Control in the processor 205 progresses from the step 1630 to a log 2 value threshold test step 1640.
At the step 1640 the quantiser module 518, under execution of the processor 205, compares the log 2 value with a predetermined threshold. If the log 2 value is less than or equal to the predetermined threshold, an adjusted log 2 value is set to zero and control in the processor 205 progresses from step 1640 to a produce sample value step 1660. If the log 2 value is greater than the predetermined threshold, control in the processor 205 progresses from step 1640 to a log 2 value adjustment step 1650. The predetermined threshold results in a number of narrow quantisation bins close to zero being merged into the zero bin. In particular, the zero bin covers approximately the same range as used in a linear quantisation scheme with a uniform bin spacing over the quantisation range of the floating-point values in the feature map with a 10-bit range. The bins for +1 and −1 also cover a similar bin spacing as seen in the case of linear quantisation to a 10-bit range. Alignment of bin sizes for the −1, 0, and +1 bins to the linear quantisation case is beneficial as many tensor values utilise these bins, which may be compressed by the video encoder 120 mainly using significance map coding (with a suitably set quantisation parameter 692).
At the step 1650 the quantiser module 518, under execution of the processor 205, subtracts the predetermined threshold from the log 2 value to produce an adjusted log 2 value. The predetermined threshold value may have a value of eight when the power-of-two factor is 65536 and an 8-bit sample bit-width is used. Control in the processor progresses from the step 1650 to a produce sample value step 1660.
At the step 1660 the quantiser module, under execution of the processor 205, produces a sample for the packed feature map frames 117 by adding or subtracting the adjusted log 2 value to a DC offset according to the sign as determined at the step 1610. The DC offset may be set to the mid-point of the range of sample values afforded by the sample bit depth. For example, a DC offset of 128 may be used when the frames 117 use 8-bit samples. The method 1600 terminates and control in the processor 205 progresses to the next sample in the feature map to be quantised. The method 1600 results in sample values quite close to the DC value, with the popular bin values of −1, 0, +1 approximately preserved compared to the linear quantisation case. The maximum excursions away from the DC value are limited to −8 to +8, which is quite a narrow range, necessitating the use of low QPs to reduce error due to loss in the video encoder 120. As the quantised samples now encode tensor values in a logarithmic space, loss in the video encoder 120 has an exponential effect on reconstructed sample values.
At the decode feature map groupings step 1710 the entropy decoder 720, under execution of the processor 205, decodes from the SEI message 1413 a structure indicating the assignment of each feature map of each layer to one or more groups of feature maps (i.e. the feature map groups 820). The decoded structure may be stored, for example, in the memory 206 and/or hard disk drive 210. The syntax of the feature map grouping in the SEI message 1413 is described with reference to Appendix A. Control in the processor 205 then progresses from the step 1710 to a decode quantisation ranges step 1720.
At the decode quantisation ranges step 1720 the entropy decoder 720, under execution of the processor 205, decodes a parameter in the form of a quantisation range 822 for each feature map group of 820, as determined at the step 1710 from the SEI message 1413. The quantisation range 822 is shared by each of a plurality of feature maps in a feature map group. The quantisation range 822 determined at step 1720 may be stored, for example, in the memory 206 and/or hard disk drive 210. When symmetric quantisation is in use, a single value is decoded at step 1720 for each feature map group, representing the maximum magnitude of the floating-point data within the feature maps belonging to the respective group. When asymmetric quantisation is in use at step 1720, a pair of values is decoded for each feature map group, representing the maximum and minimum values of the floating-point data within the feature maps belonging to the respective group. The processor 205 may operate to perform the step 1720 on every frame of video data, or the processor 205 may operate to perform the step 1720 less frequently. The step 1720 may be performed in intra pictures or on random access points in the bitstream 143. When the step 1720 is not performed for every frame, the feature map grouping and quantisation range data is carried over subsequent frames for reuse, until a new set of feature map grouping and/or quantisation range data is decoded from the bitstream 143. Control in the processor 205 then progresses from the step 1720 to a decode frame step 1630.
At the decode frame step 1730 the entropy decoder 114, under execution of the processor 205, operates to produce the frame 145 by decoding a portion of the bitstream 143, corresponding to an access unit, such as AU 1414. The frame 145 may contain packed feature maps or may contain an image corresponding to a frame, for example from the video source 112. If the frame 145 contains an image frame, that is, does not contain packed feature maps, the method 1700 terminates and decoding progresses to then next frame. The frame 145 produced at step 1730 may be stored, for example, in the memory 206 and/or hard disk drive 210. If the frame 145 contains packed feature maps, the processor 205 progresses from the step 1730 to a determine feature map placement step 1740.
At the determine feature map placement step 1740 the unpacker module 810, under execution of the processor 205, determines the location of each feature map of each layer in the frame 145. Using the spatial size of each feature map, the feature map groupings, and the number of feature maps in each layer, placement information is determined in accordance with the approach of the step 1540 and as described with reference to
At the unpack feature maps step 1750 the unpacker module 810, under execution of the processor 205, extracts samples from the frame 147 to produce integer feature maps 812 according to the determined feature map placement from the step 1740. The integer feature maps 812 determined at step 1750 may be stored, for example, in the memory 206 and/or hard disk drive 210. Control in the processor 205 then progresses from the step 1750 to an inverse quantise feature maps step 1760.
At the inverse quantise feature maps step 1760 the inverse quantiser module 814, under execution of the processor 205, converts the integer feature maps 812 into floating point feature maps, assembled into the tensors 149 as input to the CNN head 150. The floating-point feature maps may be stored, for example, in the memory 206 and/or hard disk drive 210. Operation of the step 1760 is described with reference to
At the perform CNN second portion step 1770 the CNN head 150, under execution of the processor 205, performs the remaining stages of the CNN (i.e. the stages specific to a particular task). The decoded, unpacked and inverse quantised tensors 149 are input to the CNN head 150. Within the CNN head 150 a series of convolutions, normalisations, fully connected layer operations, and activation stages are performed leading to a CNN result 151. The CNN result 151 is stored in the task result buffer 152, for example, configured within the memory 206. The method 1700 terminates and control in the processor 205 progresses to the next frame.
At the step 1810 the inverse quantiser 814, under execution of the processor 205, subtracts a DC offset from a sample value obtained from the frame 147, separating the sign and magnitude of the result to obtain a sample sign and a sample magnitude. The DC offset may be set to the mid-point of the range of sample values afforded by the sample bit depth. For example, a DC offset of 128 may be used when the frames 147 use 8-bit samples. As the sample value encodes a tensor value in a logarithmic space, the maximum excursions away from the DC value is limited to −8 to +8. Control in the processor 205 progresses from the step 1810 to a magnitude test step 1820.
At the step 1820 the inverse quantiser module 814, under execution of the processor 205, determines whether the sample magnitude is equal to zero or greater than zero. If the sample magnitude is equal to zero (NO), an adjusted sample magnitude value is set to zero and control in the processor 205 progresses from step 1820 to an apply exponent step 1840. If the sample magnitude is greater than zero (YES), control in the processor 205 progresses from step 1820 to an apply offset step 1830.
At the step 1830 the inverse quantiser module 814, under execution of the processor 205, adds a predetermined threshold to the sample magnitude to produce an adjusted sample magnitude. The predetermined threshold value may have a value of eight when the power-of-two factor is 65536 and an 8-bit sample bit-width is used. In other words, the adjusted sample magnitude is the sum of the sample magnitude and the predetermined threshold, if the sample magnitude is greater than zero. Control in the processor progresses from the step 1830 to the step 1840.
At the step 1840 the inverse quantiser module 814, under execution of the processor 205, computes the value of two to the power of the adjusted sample magnitude minus 1 to compute an integer tensor magnitude (i.e., tensor magnitude=2adjusted sample magnitude−1). Subtraction of the value of one (and addition of the value of one at corresponding step 1630) allows propagation of zero-valued tensor magnitudes to and from the logarithmic domain. Control in the processor 205 progresses from the step 1840 to the generate normalised tensor value 1850.
At the step 1850, the inverse quantiser module 814, under execution of the processor 205, divides the integer tensor magnitude by the power-of-two factor, for example 65536, and applies the sample sign from the step 1810 to produce a normalised tensor magnitude. The normalised tensor magnitude lies within the range −1.0 to 1.0. Control in the processor 205 progresses from the step 1850 to an apply quantisation range step 1860.
At the step 1860 the inverse quantiser module 814, under execution of the processor 205, multiplies the normalised tensor magnitude (which is a floating-point value) from the step 1850 by the quantisation range for the feature map (which is part of the metadata 125 and the associated decoded metadata 155, both of which are related to the decoded frame 147) to determine a tensor value, restoring the tensor value to the range seen at the input to the quantiser module 518. One quantisation range for one or more feature maps specifies the magnitude, that is, one value represents the largest magnitude seen within the one or more feature maps regardless whether such a value is positive or negative. The method 1800 then terminates and control in the processor 205 progresses to the next sample in the frame 147.
In an arrangement of the system 100, and having the division of a network (i.e. the boundary between the CNN backbone 114 and the CNN head 150) at the output of a leaky rectified linear (LeakyReLU) activation function, negative tensor values are quantised using a linear quantisation scheme and positive tensor values are quantised using the logarithmic scheme of
The use of power-of-two and log 2 exponential functions in methods 1600 and 1800 enables low-complexity implementations. However, it is possible to use other base values, including non-integer values, at the expense of introducing more complex logic including additional floating-point operations into the design. Regardless of the base value used, ensuring quantisation bin width remains comparable to the linear case for small tensor magnitudes is necessary to avoid excessively large sample values, which do not contribute to task performance but do increase the difficulty of encoding the packed feature map frame 117.
In another arrangement of the system 100, the conversion of the methods 1600 and 1800 is approximated with a piecewise-linear model with n segments, where n is an odd number and the middle bin of the centre segment corresponding to the zero bin. In one example, n is set to three, resulting in a central linear segment and two outer linear segments, one for positive values above a threshold and another for negative values below the threshold. The piecewise-linear model can be applied in the floating-point domain or can be applied on integerised values. For example, the result of the step 1620 could be integerised (‘floor’ operation applied) and the piecewise-linear model performed instead of the steps 1630-1650. In the method 1800, an inverse of the piecewise-linear model is performed on the received sample values from the feature map, after removal of any DC offset. Use of a symmetric linear model, that is, having an odd number of segments and with the central value of the central segment corresponding to the zero bin, means that there is no need to separately store the sign while converting the magnitude between the sample domain and the tensor domain.
In another arrangement of the system 100, the conversion of the methods 1600 and 1800 uses a three-segment model. A central segment provides a linear model, with a central bin corresponding to the zero bin. Two outer segments provide logarithmic models, for example using integer log 2 operations as described with reference to
The arrangements described are applicable to the computer and data processing industries and particularly for the digital signal processing for the encoding and decoding of signals such as video and image signals, achieving high compression efficiency.
Arrangements for quantising floating-point tensor data in groups of channels, or feature maps, and packing the resulting integer values into planar frames using a logarithmic quantised domain are also disclosed. Quantisation and inverse quantisation methods employing a logarithmic quantised domain enable greater compression efficiency due to the absence of bits spent encoding precise values for large magnitude tensor values, where such precision does not result in additional improvement in task performance for the network in use.
The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.
In the context of this specification, the word “comprising” means “including principally but not necessarily solely” or “having” or “including”, and not “consisting only of”. Variations of the word “comprising”, such as “comprise” and “comprises” have correspondingly varied meanings.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2022200086 | Jan 2022 | AU | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/AU2022/051286 | 10/26/2022 | WO |