The project leading to this application has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783162. The JU receives support from the European Union's Horizon 2020 research and innovation programme and Netherlands, Czech Republic, Finland, Spain, Italy.
The examples and non-limiting embodiments relate generally to multimedia transport and neural networks, and more particularly, to method, apparatus, and computer program product for implementing mechanisms for training or finetuning at least one neural network.
It is known to provide standardized formats for exchange of neural networks.
An example apparatus includes at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: train or finetune at least one neural network (NN) based at least on a temporal persistence scope; and encode or decode one or more media elements based at least on the trained or finetuned at least one neural network.
The example apparatus may further include, wherein the temporal persistence scope comprises one or more of following: any test video, and wherein the at least one NN is used to encode or decode the any test video; a first set of videos, and wherein the at least one NN is used to encode or decode a video in the first set of videos; a first video, and wherein the at least one NN is used to encode or decode any frame or any patch of the first video; one or more sets of consecutive video frames from a second video, and wherein the at least one NN is used to encode or decode any frame or any patch in the one or more sets of consecutive video frames from the second video; one or more video frames from a third video, and wherein, the at least one NN is used to encode or decode any patch in the one or more video frames from the third video; or one or more patches from one or more video frames, and wherein the at least one NN is used to encode or decode the one or more patches from a video frame of the one or more video frames from a fourth video.
The example apparatus may further include, wherein when the temporal persistence scope comprises any test video, the at least one NN is pretrained on a training dataset, in an offline pretraining phase.
The example apparatus may further include, wherein when the temporal persistence scope comprises the set of videos, the at least one NN is trained based on a base NN by using content from the set of videos as training data.
The example apparatus may further include, wherein the base NN comprises one of following: a randomly initialized NN; or an NN pretrained on a training dataset.
The example apparatus may further include, wherein when the temporal persistence scope comprises the first video, the at least one NN is trained based on a base NN by using content from the first video as training data.
The example apparatus may further include, wherein the base NN comprises one of following: a randomly initialized NN; an NN pretrained on a training dataset; or a NN pretrained or finetuned on a second set of videos comprising the first video.
The example apparatus may further include, wherein when the temporal persistence scope comprises the one or more sets of consecutive video frames, the at least one NN is trained based on a base NN by using a content from the one or more sets of consecutive video frames from the second video as training data.
The example apparatus may further include, wherein the base NN comprises one of following: a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; or an NN pretrained or finetuned on a part or all frames in the second video.
The example apparatus may further include, wherein when the temporal persistence scope comprises the one or more video frames from the third video, the at least one NN is trained based on a base NN by using a content from the one or more video frames from the third video as training data.
The example apparatus may further include, wherein the base NN comprises one of following: a randomly initialized NN; a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; or an NN pretrained or finetuned on one or more sets of consecutive video frames in the third video.
The example apparatus may further include, wherein when the temporal persistence scope comprises the one or more patches from the one or more video frames, the at least one NN is trained based on a base NN by using a content from the one or more patches from the fourth video as training data.
The example apparatus may further include, wherein the base NN comprises one of following: a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; an NN pretrained or finetuned on a one or more sets of consecutive video frames in the third video; or an NN pretrained or finetuned on one or more video frames in the fourth video.
The example apparatus may further include, wherein the apparatus is further caused to: encode at least one of a topology, weights, or weight-update of the at least one NN specify universal resource indicator (URI) from which at least one of the topology or weights of the at least one NN are obtained.
The example apparatus may further include, wherein the apparatus is further caused to signal an indication of which base NN to update, wherein the indication comprises a first high-level syntax element.
The example apparatus may further include, wherein the first high-level syntax element comprises a base neural network identity, comprising a value from a set of predetermined values.
The example apparatus may further include, wherein the indicated base NN comprises a NN pretrained on a training dataset, or a NN trained or finetuned on a second set of videos comprising the first video.
The example apparatus may further include, wherein the indicated base NN comprises a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; or an NN pretrained or finetuned on a part or all frames in the second video.
The example apparatus may further include, wherein the indicated base NN comprises a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; or an NN pretrained or finetuned on one or more sets of consecutive video frames in the third video.
The example apparatus may further include, wherein the indicated base NN comprises a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; an NN pretrained or finetuned on a one or more sets of consecutive video frames in the third video; or an NN pretrained or finetuned on one or more video frames in the fourth video.
The example apparatus may further include, wherein the apparatus is further caused to: signal a unique identifier for each NN.
The example apparatus may further include, wherein the apparatus is further caused to signal a flag to indicate whether a NN comprises a base NN.
The example apparatus may further include, wherein to train or finetune the at least one neural network based on the temporal persistence scope, the apparatus is further caused to finetune the at least one neural network jointly on one or more video frames from a first random access segment and one or more video frames from a second random access segment, wherein the second random access segment comprises following segment of the first segment.
The example apparatus may further include, wherein the one or more video frames from the first random access segment comprises all video frames from the first random access segment, and wherein the one or more video frames from the second random access segment comprises at least one initial video frame from the second random access segment.
The example apparatus may further include, wherein the apparatus is further caused to process the one or more video frames from the first random access segment and the second random access segment by using one of following NNs: an NN trained or finetuned on a previous RA segment; an NN trained or finetuned on a current RA segment; or an NN trained or finetuned on a next RA segment.
The example apparatus may further include, wherein the apparatus is further caused to process the one or more video frames from the first random access segment and the second random access segment by using a NN obtained by combining two or more of following: an NN trained or finetuned on a previous RA segment; an NN trained or finetuned on a current RA segment; or an NN trained or finetuned on a next RA segment.
The example apparatus may further include, wherein the apparatus is further caused to signal one or more NNs from different examples that are to be used to encode or decode different parts of the content in the one or more media elements.
The example apparatus may further include, wherein the signal comprises a second high-level syntax element
The example apparatus may further include, wherein the second high-level syntax element comprises a multiple_nn_scopes.
The example apparatus may further include, wherein the apparatus is further caused to indicate an NN that is to be used for each patch or CTU of the one or more media elements.
The example apparatus may further include, wherein the apparatus is further caused to associate the each of the one or more media elements an identifier of an associated NN.
The example apparatus may further include, wherein the identifier comprises ref_nn_id, wherein the ref_nn_id comprises one of the predetermined values of an nn_id.
The example apparatus may further include, wherein the apparatus is further caused to indicate a default NN, wherein the default NN is used to encode or decode all media elements.
The example apparatus may further include, wherein the apparatus is caused to signal the default NN by using a third high-level syntax.
The example apparatus may further include, wherein the third high-level syntax comprises a default_NN_flag.
The example apparatus may further include, wherein the third high-level syntax comprises a default_nn_id, wherein the default_nn_id is signaled once for the one or more media elements, and wherein the default_nn_id comprises one of the predetermined values of nn_id.
Another example apparatus includes: at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: receive a weight-update prediction error from an encoder-side; and predict a weight-update based on one or more reference weight updates, and a prediction function or algorithm; reconstruct a weight update by combining the predicted weight-update and the prediction error.
The example apparatus may further include, wherein the two or more weight-updates are represented as a single weight update.
The example apparatus may further include, wherein to represent the two or more weight-updates as the single weight update, the apparatus is further caused to perform summarization.
The example apparatus may further include, wherein to perform summarization, the apparatus is further caused to cluster the two or more weight-updates.
The example apparatus may further include, wherein to perform summarization, the apparatus is further caused to combine the two or more weight-updates by using a linear combination
The example apparatus may further include, wherein one or more of the weight-updates are dropped or removed from a memory or a storage.
A yet another apparatus includes at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: perform a prediction process, on an encoder-side, to generate a predicted weight-update based on one or more reference weight updates and a prediction function or algorithm; generate a weight-update prediction error based on a weight-update and on a predicted weight-update; encode the weight-update prediction error; provide the encoded weight-update prediction error to a decoder-side; and wherein the decoders-side decodes the encoded weight-update prediction error, predicts a weight-update based on one or more reference weight updates and a prediction function or algorithm, and reconstructs a weight update by combining the predicted weight-update and the decoded weight-update prediction error.
The example apparatus may further include, wherein the prediction process is performed based at least on one or more of previously decoded weight-updates or at least part of a decoded content.
The example apparatus may further include, wherein the decoded content comprises at least one of: a decoded frame that needs to be post-processed by the NN; or one or more of the previously decoded frames.
The example apparatus may further include, wherein the prediction process comprises one or more of following techniques: use one of the previous weight-updates as a predicted weight-update; combine one or more of the previous weight-updates by using a predetermined function; combine one or more of the previous weight-updates by using a parametric function; or use an auxiliary neural network to predict the weight-update, by using at least one of one or more of the previous weight-updates or one or more of the previously decoded content.
The example apparatus may further include, wherein the predetermined function comprises a linear combination with predetermined coefficients.
The example apparatus may further include, wherein the parametric function comprises a linear combination with coefficients signaled from the encoder-side to the decoder-side.
The example apparatus may further include, wherein the apparatus is further caused to, indicate previous weight-updates and content to use to predict the weight-update.
The example apparatus may further include, wherein the apparatus is further caused to: use a weight-update identifier to uniquely identify each weight-update; and signal the weight-update identifier to the decoder-side, and corresponding weight-update prediction error.
An example method includes training or finetuning at least one neural network (NN) based at least on a temporal persistence scope; and encode or decode one or more media elements based at least on the trained or finetuned at least one neural network.
The example method may further include, wherein the temporal persistence scope comprises one or more of following: any test video, and wherein the at least one NN is used to encode or decode the any test video; a first set of videos, and wherein the at least one NN is used to encode or decode a video in the first set of videos; a first video, and wherein the at least one NN is used to encode or decode any frame or any patch of the first video; one or more sets of consecutive video frames from a second video, and wherein the at least one NN is used to encode or decode any frame or any patch in the one or more sets of consecutive video frames from the second video; one or more video frames from a third video, and wherein, the at least one NN is used to encode or decode any patch in the one or more video frames from the third video; or one or more patches from one or more video frames, and wherein the at least one NN is used to encode or decode the one or more patches from a video frame of the one or more video frames from a fourth video.
The example method may further include, wherein when the temporal persistence scope comprises any test video, the at least one NN is pretrained on a training dataset, in an offline pretraining phase.
The example method may further include, wherein when the temporal persistence scope comprises the set of videos, the at least one NN is trained based on a base NN by using content from the set of videos as training data.
The example method may further include, wherein the base NN comprises one of following: a randomly initialized NN; or an NN pretrained on a training dataset.
The example method may further include, wherein when the temporal persistence scope comprises the first video, the at least one NN is trained based on a base NN by using content from the first video as training data.
The example method may further include, wherein the base NN comprises one of following: a randomly initialized NN; an NN pretrained on a training dataset; or a NN pretrained or finetuned on a second set of videos comprising the first video.
The example method may further include, wherein when the temporal persistence scope comprises the one or more sets of consecutive video frames, the at least one NN is trained based on a base NN by using a content from the one or more sets of consecutive video frames from the second video as training data.
The example method may further include, wherein the base NN comprises one of following: a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; or an NN pretrained or finetuned on a part or all frames in the second video.
The example method may further include, wherein when the temporal persistence scope comprises the one or more video frames from the third video, the at least one NN is trained based on a base NN by using a content from the one or more video frames from the third video as training data.
The example method may further include, wherein the base NN comprises one of following: a randomly initialized NN; a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; or an NN pretrained or finetuned on one or more sets of consecutive video frames in the third video.
The example method may further include, wherein when the temporal persistence scope comprises the one or more patches from the one or more video frames, the at least one NN is trained based on a base NN by using a content from the one or more patches from the fourth video as training data.
The example method may further include, wherein the base NN comprises one of following: a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; an NN pretrained or finetuned on a one or more sets of consecutive video frames in the third video; or an NN pretrained or finetuned on one or more video frames in the fourth video.
The example method may further include encoding at least one of a topology, weights, or weight-update of the at least one NN specify universal resource indicator (URI) from which at least one of the topology or weights of the at least one NN are obtained.
The example method may further include signaling an indication of which base NN to update, wherein the indication comprises a first high-level syntax element.
The example method may further include, wherein the first high-level syntax element comprises a base neural network identity, comprising a value from a set of predetermined values.
The example method may further include, wherein the indicated base NN comprises a NN pretrained on a training dataset, or a NN trained or finetuned on a second set of videos comprising the first video.
The example method may further include, wherein the indicated base NN comprises a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; or an NN pretrained or finetuned on a part or all frames in the second video.
The example method may further include, wherein the indicated base NN comprises a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; or an NN pretrained or finetuned on one or more sets of consecutive video frames in the third video.
The example method may further include, wherein the indicated base NN comprises a randomly initialized NN; an NN pretrained on a training dataset; an NN pretrained or finetuned on a second set of videos comprising the first video; an NN pretrained or finetuned on part or all frames in the second video; an NN pretrained or finetuned on a one or more sets of consecutive video frames in the third video; or an NN pretrained or finetuned on one or more video frames in the fourth video.
The example method may further include signaling a unique identifier for each NN.
The example method may further include signaling a flag to indicate whether a NN comprises a base NN.
The example method may further include, wherein finetuning or training the at least one neural network based on the temporal persistence scope, comprising finetuning the at least one neural network jointly on one or more video frames from a first random access segment and one or more video frames from a second random access segment, wherein the second random access segment comprises following segment of the first segment.
The example method may further include, wherein the one or more video frames from the first random access segment comprises all video frames from the first random access segment, and wherein the one or more video frames from the second random access segment comprises at least one initial video frame from the second random access segment.
The example method may further include processing the one or more video frames from the first random access segment and the second random access segment by using one of following NNs: an NN trained or finetuned on a previous RA segment; an NN trained or finetuned on a current RA segment; or an NN trained or finetuned on a next RA segment.
The example method may further include processing the one or more video frames from the first random access segment and the second random access segment by using a NN obtained by combining two or more of following: an NN trained or finetuned on a previous RA segment; an NN trained or finetuned on a current RA segment; or an NN trained or finetuned on a next RA segment.
The example method may further include signaling one or more NNs from different examples that are to be used for encoding or decoding different parts of the content in the one or more media elements.
The example method may further include, wherein the signal comprises a second high-level syntax element
The example method may further include, wherein the second high-level syntax element comprises a multiple_nn_scopes.
The example method may further include indicating an NN that is to be used for each patch or CTU of the one or more media elements.
The example method may further include associating the each of the one or more media elements an identifier of an associated NN.
The example method may further include, wherein the identifier comprises ref_nn_id, wherein the ref_nn_id comprises one of the predetermined values of an nn_id.
The example method may further include indicating a default NN, wherein the default NN is used to encode or decode all media elements.
The example method may further include signaling the default NN by using a third high-level syntax.
The example method may further include, wherein the third high-level syntax comprises a default_NN_flag.
The example method may further include, wherein the third high-level syntax comprises a default_nn_id, wherein the default_nn_id is signaled once for the one or more media elements, and wherein the default_nn_id comprises one of the predetermined values of nn_id.
Another example method includes receiving a weight-update prediction error from an encoder-side; and predicting a weight-update based on one or more reference weight updates, and a prediction function or algorithm; reconstructing a weight update by combining the predicted weight-update and the prediction error.
The example method may further include representing the two or more weight-updates as a single weight update.
The example method may further include, wherein the representing the two or more weight-updates as the single weight update comprises: performing summarization.
The example method may further include, wherein performing summarization comprises clustering the two or more weight-updates.
The example method may further include, wherein performing summarization comprises combining the two or more weight-updates by using a linear combination
The example method may further include, wherein one or more of the weight-updates are dropped or removed from a memory or a storage.
Yet another example method includes performing a prediction process, on an encoder-side, to generate a predicted weight-update based on one or more reference weight updates and a prediction function or algorithm; generating a weight-update prediction error based on a weight-update and on a predicted weight-update; encoding the weight-update prediction error; provide the encoded weight-update prediction error to a decoder-side; and wherein the decoders-side decodes the encoded weight-update prediction error, predicts a weight-update based on one or more reference weight updates and a prediction function or algorithm, and reconstructs a weight update by combining the predicted weight-update and the decoded weight-update prediction error.
The example method may further include, wherein the prediction process is performed based at least on one or more of previously decoded weight-updates or at least part of a decoded content.
The example method may further include, wherein the decoded content comprises at least one of: a decoded frame that needs to be post-processed by the NN; or one or more of the previously decoded frames.
The example method may further include, wherein the prediction process comprises one or more of following techniques: use one of the previous weight-updates as a predicted weight-update; combine one or more of the previous weight-updates by using a predetermined function; combine one or more of the previous weight-updates by using a parametric function; or use an auxiliary neural network to predict the weight-update, by using at least one of one or more of the previous weight-updates or one or more of the previously decoded content.
The example method may further include, wherein the predetermined function comprises a linear combination with predetermined coefficients.
The example method may further include, wherein the parametric function comprises a linear combination with coefficients signaled from the encoder-side to the decoder-side.
The example method may further include indicating previous weight-updates and content to use to predict the weight-update.
The example method may further include use a weight-update identifier to uniquely identify each weight-update; and signal the weight-update identifier to the decoder-side, and corresponding weight-update prediction error.
An example computer readable medium includes program instructions for causing an apparatus to perform at least the methods as claimed in any of the claims 51 to 100.
The example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
The foregoing aspects and other features are explained in the following description, taken in connection with the accompanying drawings, wherein:
The following acronyms and abbreviations that may be found in the specification and/or the drawing figures are defined as follows:
Some embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms ‘data,’ ‘content,’ ‘information,’ and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present invention.
Additionally, as used herein, the term ‘circuitry’ refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of ‘circuitry’ applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term ‘circuitry’ also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term ‘circuitry’ as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
As defined herein, a ‘computer-readable storage medium,’ which refers to a non-transitory physical storage medium (e.g., volatile or non-volatile memory device), can be differentiated from a ‘computer-readable transmission medium,’ which refers to an electromagnetic signal.
A method, apparatus and computer program product are provided in accordance with example embodiments for implementing mechanisms for finetuning at least one neural network. A method, apparatus and computer program product are provided in accordance with another example embodiments for implementing mechanisms for training or finetuning at least one neural network for encoding or decoding one or more media elements. Some examples of media elements include, but are not limited to, frames, block of a frame, patches, CTUs, and the like. In some embodiments, a patch and a CTU may be used interchangeably. In some examples, the patch or the CTU may mean a portion of a video frame, such as a 2-dimensional portion (e.g. a rectangle, a square, or a portion covering an object in the video frame).
The following describes in detail suitable apparatus and possible mechanisms for training or finetuning of at least one neural network for media compression. In this regard reference is first made to
The electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system, a sensor device, a tag, or a lower power device. However, it would be appreciated that embodiments of the examples described herein may be implemented within any electronic device or apparatus which may process data by neural networks.
The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 may further comprise a display 32, for example, in the form of a liquid crystal display, light emitting diode display, organic light emitting diode display, and the like. In other embodiments of the examples described herein the display may be any suitable display technology suitable to display media or multimedia content, for example, an image or a video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the examples described herein any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the examples described herein may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the examples described herein the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera capable of recording or capturing images and/or video. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth® wireless connection or a USB/firewire wired connection.
The apparatus 50 may comprise a controller 56, a processor or a processor circuitry for controlling the apparatus 50. The controller 56 may be connected to a memory 58 which in embodiments of the examples described herein may store both data in the form of an image, audio data, video data, and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and/or decoding of audio, image, and/or video data or assisting in coding and/or decoding carried out by the controller.
The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example, a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals, for example, for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and/or for receiving radio frequency signals from other apparatus(es).
The apparatus 50 may comprise a camera 42 capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video image data for processing from another device prior to transmission and/or storage. The apparatus 50 may also receive either wirelessly or by a wired connection the image for coding/decoding. The structural elements of apparatus 50 described above represent examples of means for performing a corresponding function.
With respect to
The system 10 may include both wired and wireless communication devices and/or apparatus 50 suitable for implementing embodiments of the examples described herein.
For example, the system shown in
The example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.
The embodiments may also be implemented in a set-top box; for example, a digital TV receiver, which may/may not have a display or wireless capabilities, in tablets or (laptop) personal computers (PC), which have hardware and/or software to process neural network data, in various operating systems, and in chipsets, processors, DSPs and/or embedded systems offering hardware/software based coding.
Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the Internet 28. The system may include additional communication devices and communication devices of various types.
The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol-internet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11, 3GPP Narrowband IoT and any similar wireless communication technology. A communications device involved in implementing various embodiments of the examples described herein may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.
In telecommunications and data networks, a channel may refer either to a physical channel or to a logical channel. A physical channel may refer to a physical transmission medium such as a wire, whereas a logical channel may refer to a logical connection over a multiplexed medium, capable of conveying several logical channels. A channel may be used for conveying an information signal, for example a bitstream, from one or several senders (or transmitters) to one or several receivers.
The embodiments may also be implemented in so-called internet of things (IOT) devices. The IoT may be defined, for example, as an interconnection of uniquely identifiable embedded computing devices within the existing Internet infrastructure. The convergence of various technologies has and may enable many fields of embedded systems, such as wireless sensor networks, control systems, home/building automation, and the like, to be included the IoT. In order to utilize the Internet, IoT devices are provided with an IP address as a unique identifier. IoT devices may be provided with a radio transmitter, such as WLAN or Bluetooth® transmitter or an RFID tag. Alternatively, IoT devices may have access to an IP-based network via a wired network, such as an Ethernet-based network or a power-line connection (PLC).
The devices/system described in
An MPEG-2 transport stream (TS), specified in ISO/IEC 13818-1 or equivalently in ITU-T Recommendation H.222.0, is a format for carrying audio, video, and other media as well as program metadata or other metadata, in a multiplexed stream. A packet identifier (PID) is used to identify an elementary stream (a.k.a. packetized elementary stream) within the TS. Hence, a logical channel within an MPEG-2 TS may be considered to correspond to a specific PID value.
Available media file format standards include ISO base media file format (ISO/IEC 14496-12, which may be abbreviated ISOBMFF) and file format for NAL unit structured video (ISO/IEC 14496-15), which derives from the ISOBMFF.
Video codec consists of an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form, or into a form that is suitable as an input to one or more algorithms for analysis or processing. A video encoder and/or a video decoder may also be separate from each other, for example, need not form a codec. Typically, encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
Typical hybrid video encoders, for example, many encoder implementations of ITU-T H.263 and H.264, encode the video information in two phases. Firstly pixel values in a certain picture area (or ‘block’) are predicted for example, by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, for example, the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (for example, Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).
In temporal prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures). In intra block copy (IBC; a.k.a. intra-block-copy prediction and current picture referencing), prediction is applied similarly to temporal prediction, but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process. Inter-layer or inter-view prediction may be applied similarly to temporal prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal prediction only, while in other cases inter prediction may refer collectively to temporal prediction and any of intra block copy, inter-layer prediction, and inter-view prediction provided that they are performed with the same or similar process than temporal prediction. Inter prediction or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.
Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, reduces temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures. Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, for example, either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra-coding, where no inter prediction is applied.
One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
Depending on which encoding mode is selected to encode the current block, the output of the inter-predictor 306, 406 or the output of one of the optional intra-predictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410. The output of the mode selector 310, 410 is passed to a first summing device 321, 421. The first summing device may subtract the output of the pixel predictor 302, 402 from the base layer image 300/enhancement layer image 400 to produce a first prediction error signal 320, 420 which is input to the prediction error encoder 303, 403.
The pixel predictor 302, 402 further receive from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 and the output 338, 438 of the prediction error decoder 304, 404. The preliminary reconstructed image 314, 414 may be passed to the intra-predictor 308, 408 and to a filter 316, 416. The filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 which may be saved in a reference frame memory 318, 418. The reference frame memory 318 may be connected to the inter-predictor 306 to be used as the reference image against which a future base layer image 300 is compared in inter-prediction operations. Subject to the base layer being selected and indicated to be source for inter-layer sample prediction and/or inter-layer motion information prediction of the enhancement layer according to some embodiments, the reference frame memory 318 may also be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer image 400 is compared in inter-prediction operations. Moreover, the reference frame memory 418 may be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer image 400 is compared in inter-prediction operations.
Filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be source for predicting the filtering parameters of the enhancement layer according to some embodiments.
The prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444. The transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain. The transform is, for example, the DCT transform. The quantizer 344, 444 quantizes the transform domain signal, for example, the DCT coefficients, to form quantized coefficients.
The prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414. The prediction error decoder may be considered to comprise a dequantizer 346, 446, which dequantizes the quantized coefficient values, for example, DCT coefficients, to reconstruct the transform signal and an inverse transformation unit 348, 448, which performs the inverse transformation to the reconstructed transform signal wherein the output of the inverse transformation unit 348, 448 contains reconstructed block(s). The prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters.
The entropy encoder 330, 430 receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encoding on the signal to provide a compressed signal. The outputs of the entropy encoders 330, 430 may be inserted into a bitstream, for example, by a multiplexer 508.
The general analysis or processing algorithm may be part of the decoder 504. The decoder 504 uses a decoder or decompression algorithm, for example, to perform the neural network decoding 505 (e.g., decoding by using one or more neural networks) to decode the compressed data 512 (for example, compressed video) which was encoded by the encoder 501. The decoder 504 produces decompressed data 513 (for example, reconstructed data).
The encoder 501 and decoder 504 may be entities implementing an abstraction, may be separate entities or the same entities, or may be part of the same physical device.
An out-of-band transmission, signaling, or storage may refer to the capability of transmitting, signaling, or storing information in a manner that associates the information with a video bitstream. The out-of-band transmission may use a more reliable transmission mechanism compared to the protocols used for carrying coded video data, such as slices. The out-of-band transmission, signaling or storage can additionally or alternatively be used e.g. for ease of access or session negotiation. For example, a sample entry of a track in a file conforming to the ISO Base Media File Format may comprise parameter sets, while the coded data in the bitstream is stored elsewhere in the file or in another file. Another example of out-of-band transmission, signaling, or storage comprises including information, such as NN and/or NN updates in a file format track that is separate from track(s) containing coded video data.
The phrase along the bitstream (e.g. indicating along the bitstream) or along a coded unit of a bitstream (e.g. indicating along a coded tile) may be used in claims and described embodiments to refer to transmission, signaling, or storage in a manner that the ‘out-of-band’ data is associated with, but not included within, the bitstream or the coded unit, respectively. The phrase decoding along the bitstream or along a coded unit of a bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream or the coded unit, respectively. For example, the phrase along the bitstream may be used when the bitstream is contained in a container file, such as a file conforming to the ISO Base Media File Format, and certain file metadata is stored in the file in a manner that associates the metadata to the bitstream, such as boxes in the sample entry for a track containing the bitstream, a sample group for the track containing the bitstream, or a timed metadata track associated with the track containing the bitstream. In another example, the phrase along the bitstream may be used when the bitstream is made available as a stream over a communication protocol and a media description, such as a streaming manifest, is provided to describe the stream.
An elementary unit for the output of a video encoder and the input of a video decoder, respectively, may be a network abstraction layer (NAL) unit. For transport over packet-oriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures. A bytestream format encapsulating NAL units may be used for transmission or storage environments that do not provide framing structures. The bytestream format may separate NAL units from each other by attaching a start code in front of each NAL unit. To avoid false detection of NAL unit boundaries, encoders may run a byte-oriented start code emulation prevention algorithm, which may add an emulation prevention byte to the NAL unit payload if a start code would have occurred otherwise. In order to enable straightforward gateway operation between packet and stream-oriented systems, start code emulation prevention may be performed regardless of whether the bytestream format is in use or not. A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte sequence payload interspersed as necessary with emulation prevention bytes. A raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.
In some coding standards, NAL units consist of a header and payload. The NAL unit header indicates the type of the NAL unit. In some coding standards, the NAL unit header indicates a scalability layer identifier (e.g. called nuh_layer_id in H.265/HEVC and H.266/VVC), which could be used e.g. for indicating spatial or quality layers, views of a multiview video, or auxiliary layers (such as depth maps or alpha planes). In some coding standards, the NAL unit header includes a temporal sublayer identifier, which may be used for indicating temporal subsets of the bitstream, such as a 30-frames-per-second subset of a 60-frames-per-second bitstream.
NAL units may be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units. VCL NAL units are typically coded slice NAL units.
A non-VCL NAL unit may be, for example, one of the following types: a video parameter set (VPS), a sequence parameter set (SPS), a picture parameter set (PPS), an adaptation parameter set (APS), a supplemental enhancement information (SEI) NAL unit, an access unit delimiter, an end of sequence NAL unit, an end of bitstream NAL unit, or a filler data NAL unit. Parameter sets may be needed for the reconstruction of decoded pictures, whereas many of the other non-VCL NAL units are not necessary for the reconstruction of decoded sample values.
Some coding formats specify parameter sets that may carry parameter values needed for the decoding or reconstruction of decoded pictures. A parameter may be defined as a syntax element of a parameter set. A parameter set may be defined as a syntax structure that contains parameters and that can be referred to from or activated by another syntax structure, for example, using an identifier.
Some types of parameter sets are briefly described in the following, but it needs to be understood, that other types of parameter sets may exist and that embodiments may be applied, but are not limited to, the described types of parameter sets.
Parameters that remain unchanged through a coded video sequence may be included in a sequence parameter set. Alternatively, an SPS may be limited to apply to a layer that references the SPS, e.g. an SPS may remain valid for a coded layer video sequence. In addition to the parameters that may be needed by the decoding process, the sequence parameter set may optionally contain video usability information (VUI), which includes parameters that may be important for buffering, picture output timing, rendering, and resource reservation.
A picture parameter set contains such parameters that are likely to be unchanged in several coded pictures. A picture parameter set may include parameters that can be referred to by the VCL NAL units of one or more coded pictures.
A video parameter set (VPS) may be defined as a syntax structure containing syntax elements that apply to zero or more entire coded video sequences and may contain parameters applying to multiple layers. The VPS may provide information about the dependency relationships of the layers in a bitstream, as well as many other information that are applicable to all slices across all layers in the entire coded video sequence.
A video parameter set RBSP may include parameters that can be referred to by one or more sequence parameter set RBSPs.
The relationship and hierarchy between a video parameter set (VPS), a sequence parameter set (SPS), and a picture parameter set (PPS) may be described as follows. A VPS resides one level above an SPS in the parameter set hierarchy and in the context of scalability. The VPS may include parameters that are common for all slices across all layers in the entire coded video sequence. The SPS includes the parameters that are common for all slices in a particular layer in the entire coded video sequence, and may be shared by multiple layers. The PPS includes the parameters that are common for all slices in a particular picture and are likely to be shared by all slices in multiple pictures.
An adaptation parameter set (APS) may be specified in some coding formats, such as H.266/VVC. An APS may be applied to one or more image segments, such as slices. In H.266/VVC, an APS may be defined as a syntax structure containing syntax elements that apply to zero or more slices as determined by zero or more syntax elements found in slice headers or in a picture header. An APS may comprise a type (aps_params_type in H.266/VVC) and an identifier (aps_adaptation_parameter_set_id in H.266/VVC). The combination of an APS type and an APS identifier may be used to identify a particular APS. H.266/VVC comprises three APS types: an adaptive loop filtering (ALF), a luma mapping with chroma scaling (LMCS), and a scaling list APS types. The ALF APS(s) are referenced from a slice header (thus, the referenced ALF APSs can change slice by slice), and the LMCS and scaling list APS(s) are referenced from a picture header (thus, the referenced LMCS and scaling list APSs can change picture by picture). In H.266/VVC, the APS RBSP has the following syntax:
Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units. A prefix SEI NAL unit can start a picture unit or alike; and a suffix SEI NAL unit can end a picture unit or alike. Hereafter, an SEI NAL unit may equivalently refer to a prefix SEI NAL unit or a suffix SEI NAL unit. An SEI NAL unit includes one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation.
Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for specific use. The standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.
The method and apparatus of an example embodiment may be utilized in a wide variety of systems, including systems that rely upon the compression and decompression of media data and possibly also the associated metadata. In one embodiment, however, the method and apparatus are configured to compress the media data and associated metadata streamed from a source via a content delivery network to a client device, at which point the compressed media data and associated metadata is decompressed or otherwise processed. In this regard,
An apparatus 700 is provided in accordance with an example embodiment as shown in
The processing circuitry 702 may be in communication with the memory device 704 via a bus for passing information among components of the apparatus 700. The memory device may be non-transitory and may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processing circuitry). The memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present disclosure. For example, the memory device could be configured to buffer input data for processing by the processing circuitry. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processing circuitry.
The apparatus 700 may, in some embodiments, be embodied in various computing devices as described above. However, in some embodiments, the apparatus may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment of the present disclosure on a single chip or as a single ‘system on a chip.’ As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.
The processing circuitry 702 may be embodied in a number of different ways. For example, the processing circuitry may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processing circuitry may include one or more processing cores configured to perform independently. A multi-core processing circuitry may enable multiprocessing within a single physical package. Additionally or alternatively, the processing circuitry may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
In an example embodiment, the processing circuitry 702 may be configured to execute instructions stored in the memory device 704 or otherwise accessible to the processing circuitry. Alternatively or additionally, the processing circuitry may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processing circuitry may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present disclosure while configured accordingly. Thus, for example, when the processing circuitry is embodied as an ASIC, FPGA or the like, the processing circuitry may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processing circuitry is embodied as an executor of instructions, the instructions may specifically configure the processing circuitry to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processing circuitry may be a processor of a specific device (e.g., an image or video processing system) configured to employ an embodiment of the present invention by further configuration of the processing circuitry by instructions for performing the algorithms and/or operations described herein. The processing circuitry may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processing circuitry.
The communication interface 706 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data, including video bitstreams. In this regard, the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface may alternatively or also support wired communication. As such, for example, the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
In some embodiments, the apparatus 700 may optionally include a user interface that may, in turn, be in communication with the processing circuitry 702 to provide output to a user, such as by outputting an encoded video bitstream and, in some embodiments, to receive an indication of a user input. As such, the user interface may include a display and, in some embodiments, may also include a keyboard, a mouse, a joystick, a touch screen, touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. Alternatively or additionally, the processing circuitry may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as a display and, in some embodiments, a speaker, ringer, microphone and/or the like. The processing circuitry and/or user interface circuitry comprising the processing circuitry may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processing circuitry (e.g., memory device, and/or the like).
A neural network (NN) is a computation graph consisting of several layers of computation. Each layer consists of one or more units, where each unit performs a computation. A unit is connected to one or more other units, and a connection may be associated with a weight. The weight may be used for scaling the signal passing through an associated connection. Weights are learnable parameters, for example, values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.
Couple of examples of architectures for neural networks are feed-forward and recurrent architectures. Feed-forward neural networks are such that there is no feedback loop, each layer takes input from one or more of the previous layers, and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers and provide output to one or more of following layers.
Initial layers, those close to the input data, extract semantically low-level features, for example, edges and textures in images, and intermediate and final layers extract more high-level features. After the feature extraction layers, there may be one or more layers performing a certain task, for example, classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, and the like. In recurrent neural networks, there is a feedback loop, so that the neural network becomes stateful, for example, it is able to memorize information or a state.
Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, for example, mobile phones, chat bots, IoT devices, smart cars, voice assistants, and the like. Some of these applications include, but are not limited to, image and video analysis and processing, social media data analysis, device usage data analysis, and the like.
One of the properties of neural networks, and other machine learning tools, is that they are able to learn properties from input data, either in a supervised way or in an unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.
In general, the training algorithm consists of changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to. Training usually happens by minimizing or decreasing the output error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, and the like. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural network to make a gradual improvement in the network's output, for example, gradually decrease the loss.
Training a neural network is an optimization process, but the final goal is different from the typical goal of optimization. In optimization, the only goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, for example, data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, for example, to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data, which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set are monitored during the training process to understand the following:
Lately, neural networks have been used for compressing and de-compressing data such as images. The most widely used architecture for such task is the auto-encoder, which is a neural network consisting of two parts: a neural encoder and a neural decoder. In various embodiments, these neural encoder and neural decoder would be referred to as encoder and decoder, even though these refer to algorithms which are learned from data instead of being tuned manually. The encoder takes an image as an input and produces a code, to represent the input image, which requires less bits than the input image. This code may have been obtained by a binarization or quantization process after the encoder. The decoder takes in this code and reconstructs the image which was input to the encoder.
Such encoder and decoder are usually trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), or the like. These distortion metrics are meant to be correlated to the human visual perception quality, so that minimizing or maximizing one or more of these distortion metrics results into improving the visual quality of the decoded image as perceived by humans.
In various embodiments, terms ‘model’, ‘neural network’, ‘neural net’ and ‘network’ may be used interchangeably, and also the weights of neural networks may be sometimes referred to as learnable parameters or as parameters.
Video codec consists of an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. Typically, an encoder discards some information in the original video sequence in order to represent the video in a more compact form, for example, at lower bitrate.
Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode the video information in two phases. Firstly, pixel values in a certain picture area (or ‘block’) are predicted. In an example, the pixel values may be predicted by using motion compensation algorithm. This prediction technique includes finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded.
In other example, the pixel values may be predicted by using spatial prediction techniques. This prediction technique uses the pixel values around the block to be coded in a specified manner. Secondly, the prediction error, for example, the difference between the predicted block of pixels and the original block of pixels is coded. This is typically done by transforming the difference in pixel values using a specified transform, for example, discrete cosine transform (DCT) or a variant of it; quantizing the coefficients; and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation, for example, picture quality and size of the resulting coded video representation, for example, file size or transmission bitrate.
Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures.
Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, for example, either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra-coding, where no inter prediction is applied.
One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
The decoder reconstructs the output video by applying prediction techniques similar to the encoder to form a predicted representation of the pixel blocks. For example, using the motion or spatial information created by the encoder and stored in the compressed representation and prediction error decoding, which is inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain. After applying prediction and prediction error decoding techniques the decoder sums up the prediction and prediction error signals, for example, pixel values to form the output video frame. The decoder and encoder can also apply additional filtering techniques to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
In typical video codecs the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded in the encoder side or decoded in the decoder side and the prediction source block in one of the previously coded or decoded pictures.
In order to represent motion vectors efficiently, the motion vectors are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs, the predicted motion vectors are created in a predefined way, for example, calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture.
Moreover, typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.
In typical video codecs, the prediction residual after motion compensation is first transformed with a transform kernel, for example, DCT and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, for example, the desired macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor λ to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information that is required to represent the pixel values in an image area:
C=D+λR equation 1
In equation 1, C is the Lagrangian cost to be minimized, D is the image distortion, for example, mean squared error with the mode and motion vectors considered, and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder including the amount of data to represent the candidate motion vectors.
Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike. An SEI NAL unit contains one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation.
Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.
A design principle has been followed for SEI message specifications: the SEI messages are generally not extended in future amendments or versions of the standard.
Conventional image and video codecs use a set of filters to enhance the visual quality of the predicted visual content and can be applied either in-loop or out-of-loop, or both. In the case of in-loop filters, the filter applied on one block in the currently-encoded frame will affect the encoding of another block in the same frame and/or in another frame which is predicted from the current frame. An in-loop filter can affect the bitrate and/or the visual quality. An enhanced block may cause a smaller residual, difference between original block and predicted-and-filtered block, thus using less bits in the bitstream output by the encoder. An out-of-loop filter may be applied on a frame after it has been reconstructed, the filtered visual content may not be a source for prediction, and thus it may only impact the visual quality of the frames that are output by the decoder.
Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches.
In one approach, NNs are used to replace or are used as an addition to one or more of the components of a traditional codec such as VVC/H.266. Here ‘traditional’ means those codecs whose components and their parameters are typically not learned from data by means of a training process, for example, those codecs whose components are not neural networks. Some examples of uses of neural networks within a traditional codec include but are not limited to:
In another approach, commonly referred to as ‘end-to-end learned compression’, NNs are used as the main components of the image/video codecs. Couple of examples from second approach are described below:
Option 1: re-use the video coding pipeline but replace most or all the components with NNs. Referring to
In order to train the neural networks of this system, a training objective function, referred to as training loss, is typically utilized, which usually comprises one or more terms, or loss terms, or simply losses. Although here the Option 2 and
The rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder. ‘Compressing’ for example, means reducing the number of bits output by the encoding stage.
When an entropy-based lossless encoder is used, such as the arithmetic encoder, the rate loss typically encourages the output of the Encoder NN to have low entropy. The rate loss may be computed on the output of the Encoder NN, or on the output of the quantization operation, or on the output of the probability model. Following are some examples of rate losses:
For training one or more neural networks that are part of a codec, such as one or more neural networks in
For the sake of explanation, video is considered as data type in various embodiments. However, it would be understood that the embodiments are also applicable to other media items, for example, images and audio data.
It is to be understood that even in end-to-end learned approaches, there may be components which are not learned from data, such as an arithmetic codec.
Option 2 is illustrated in
On the encoding side, the encoder 1001 takes a video/image as an input 1009 and converts the video/image in original signal space into a latent representation that may comprise a more compressible representation of the input. The latent representation may be normally a 3-dimensional tensor for image compression, where 2 dimensions represent spatial information and the third dimension contains information at that specific location.
Consider an example, in which the input data is an image, when the input image is a 128×128×3 RGB image (with horizontal size of 128 pixels, vertical size of 128 pixels, and 3 channels for the Red, Green, Blue color components), and when the encoder downsamples the input tensor by 2 and expands the channel dimension to 32 channels, then the latent representation is a tensor of dimensions (or ‘shape’) 64×64×32 (e.g, with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels). Please note that the order of the different dimensions may differ depending on the convention which is used. In some embodiments, for the input image, the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3×128×128, instead of 128×128×3.
In the case of an input video (instead of just an input image), another dimension in the input tensor may be used to represent temporal information.
The quantizer 1002 quantizes the latent representation into discrete values given a predefined set of quantization levels. The probability model 1003 and the arithmetic encoder 1005 work together to perform lossless compression for the quantized latent representation and generate bitstreams to be sent to the decoder side. Given a symbol to be encoded to the bitstream, the probability model 1003 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already encoded/decoded. The arithmetic encoder 1005 encodes the input symbols to bitstream using the estimated probability distributions.
On the decoding side, opposite operations are performed. The arithmetic decoder 1006 and the probability model 1003 first decode symbols from the bitstream to recover the quantized latent representation. Then, the dequantizer 1007 reconstructs the latent representation in continuous values and pass it to the decoder 1008 to recover the input video/image. The recovered input video/image is provided as an output 1010. Note that the probability model 1003, in this system 1000, is shared between the arithmetic encoder 1005 and arithmetic decoder 1006. In practice, this means that a copy of the probability model 1003 is used at the arithmetic encoder 1005 side, and another exact copy is used at the arithmetic decoder 1006 side.
In this system 1000, the encoder 1001, the probability model 1003, and the decoder 1008 are normally based on deep neural networks. The system 1000 is trained in an end-to-end manner by minimizing the following rate-distortion loss function, which may be referred to simply as training loss, or loss:
L=D+λR equation 2
In equation 2, D is the distortion loss term, R is the rate loss term, and A is the weight that controls the balance between the two losses.
The distortion loss term may be referred to also as reconstruction loss. It encourages the system to decode data that is similar to the input data, according to some similarity metric. Examples of reconstruction losses are:
Multiple distortion losses may be used and integrated into D.
Minimizing the rate loss encourages the system to compress the quantized latent representation so that the quantized latent representation can be represented by a smaller number of bits. The rate loss may be computed on the output of the encoder NN, or on the output of the quantization operation, or on the output of the probability model. In one example embodiment, the rate loss may comprise multiple rate losses. Example of rate losses are the following:
A similar training loss may be used for training the systems illustrated in
For training one or more neural networks that are part of a codec, such as one or more neural networks in
In one example embodiment, the rate loss and the reconstruction loss may be minimized jointly at each iteration. In another example embodiment, the rate loss and the reconstruction loss may be minimized alternately, e.g., in one iteration the rate loss is minimized and in the next iteration the reconstruction loss is minimized, and so on. In yet another example embodiment, the rate loss and the reconstruction loss may be minimized sequentially, e.g., first one of the two losses is minimized for a certain number of iterations, and then the other loss is minimized for another number of iterations. These different ways of minimizing rate loss and reconstruction loss may also be combined.
It is to be understood that even in end-to-end learned approaches, there may be components which are not learned from data, such as an arithmetic codec.
For lossless video/image compression, the system 1000 contains the probability model 1003, the arithmetic encoder 1005 and the arithmetic decoder 1006. The system loss function contains the rate loss, since the distortion loss is always zero, in other words, no loss of information.
Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, e.g. consuming or watching the decoded images or videos. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (e.g., autonomous agents) that analyze or process data independently from humans and may even take decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, and the like. Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, and the like. Accordingly, when decoded data is consumed by machines, a quality metric for the decoded data may be defined, which is different from a quality metric for human perceptual quality. Also, dedicated algorithms for compressing and decompressing data for machine consumption may be different than those for compressing and decompressing data for human consumption. The set of tools and concepts for compressing and decompressing data for machine consumption is referred to here as Video Coding for Machines.
The decoder-side device may have multiple ‘machines’ or neural networks (NNs) for analyzing or processing decoded data. These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system. The multiple machines may be used for example in temporal succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of objects in the frames.
An ‘encoder-side device’ may encode input data, such as a video, into a bitstream which represents compressed data. The bitstream is provided to a ‘decoder-side device’. The term ‘receiver-side’ or ‘decoder-side’ refers to a physical or abstract entity or device which performs decoding of compressed data, and the decoded data may be input to one or more machines, circuits or algorithms.
The encoded video data may be stored into a memory device, for example as a file. The stored file may later be provided to another device.
Alternatively, the encoded video data may be streamed from one device to another.
One of the possible approaches to realize video coding for machines is an end-to-end learned approach.
The rate loss 1302 and the task loss 1310 may then be used to train 1318 the neural networks used in the system, such as a neural network encoder 1308, probability model, a neural network decoder 1320. Training may be performed by first computing gradients of each loss with respect to the trainable parameters of the neural networks that are contributing or affecting the computation of that loss. The gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks.
Another possible approach to realize video coding for machines is to use a video codec which is mainly based on traditional components, that is components which are not obtained or derived by machine learning means. For example, H.266/VVC codec can be used. However, some of the components of such a codec may still be obtained or derived by machine learning means. In one example, one or more of the in-loop filters of the video codec may be a neural network. In another example, a neural network may be used as a post-processing operation (out-of-loop). A neural network filter or other type of filter may be used in-loop or out-of-loop for adapting the reconstructed or decoded frames in order to improve the performance or accuracy of one or more machine neural networks.
In some implementations, machine tasks may be performed at decoder side (instead of at encoder side). Some reasons for performing machine tasks at decoder side include, for example, the encoder-side device may not have the capabilities (computational, power, memory, and the like) for running the neural networks that perform these tasks, or some aspects or the performance of the task neural networks may have changed or improved by the time that the decoder-side device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there could be a customization need, where different clients would run different neural networks for performing these machine learning tasks.
At encoding phase, when an input content needs to be encoded (e.g. an input image or a video sequence), the encoder may decide to optimize some of the parameters of the neural network with respect to the specific input content. In proposed embodiments, the terms ‘optimize’, ‘adapt’, ‘finetune’, and ‘overfit’ the parameters may refer to the same operation, e.g, making the parameters more optimal to the input content, in order to improve the rate-distortion performance or to minimize the distortion or to minimize the rate. The parameters to be adapted may belong to one or more of the following categories of parameters:
In an embodiment, the parameters to be adapted may be a subset of one or more of the above categories of parameters. For example, they may be a subset trainable parameters or weights of the decoder, or a subset of a post-processing neural network filter.
The optimization or finetuning may be performed at encoder-side, and may comprise an iterative process, where at each iteration a loss function is computed by using one or more outputs of the codec, the loss function is differentiated with respect to the parameters to be optimized in order to compute gradients (for example, one gradient for each parameter to be optimized), the computed gradients are then used for updating the parameters to be optimized, for example by using an optimizer routine such as stochastic gradient descent (SGD) or Adam. The neural network whose parameters represent the initial parameters which are then finetuned by the finetuning process, may be referred to as the base model or base neural network in some of the embodiments. The finetuning process may be performed until one or more criteria are met. One example criterion may be a predetermined number of iterations. Another example criterion may be a predetermined distortion value, a predetermined rate, or a predetermined rate-distortion performance. Yet another example criterion may be a predetermined time elapsed from the beginning of finetuning. Still another example criterion may be a loss term value or the loss function value not changing more than a predetermined amount for a predetermined number of iterations. After a neural network has been finetuned, it is possible to compute a weight-update, which may be the difference between one or more parameters of the neural network before the finetuning process and the corresponding one or more parameters of the neural network after the finetuning process.
Some examples of loss function, include, but are not limited to:
The one or more outputs from the codec, that may be used to compute the loss terms may be:
As an example, various embodiments consider the case of finetuning a post-processing filter, which is applied on the output frames from the decoder, e.g. VVC/H.266 decoder.
It may be noted that:
Various embodiments enable determining an optimal persistence scope of a certain finetuned NN, and therefore of the corresponding weight-update, with respect to rate-distortion performance, or simply with respect to distortion performance; describe procedures and/or mechanisms to re-use and eventually modify finetuned NNs for applying them on different persistence scopes.
An embodiment proposes neural networks for different levels of temporal persistence, which may be referred to as neural network options, e.g.:
One or more of the above neural network options may be used for coding, reconstructing, and/or filtering a certain video sequence. Finetuning may be performed by using a certain base NN as the initial NN. The base NN may be any of the above mentioned neural network options. In one example, finetuning a NN for a certain frame may be performed by using a pretrained NN as the base NN. In another example, finetuning a NN for a certain frame may be performed by using a NN finetuned on the whole video sequence as the base NN, where this base NN may have been finetuned from the pretrained NN.
Information about which NN needs to be used for a certain sequence may be signaled from an encoder side to a decoder side in or along a video bitstream. For example, the information may indicate that a pretrained NN may be used for the whole video.
Another embodiment proposes to use predictive coding for the weight-updates e.g., a prediction of weight-updates may be performed at decoder-side, and a prediction error may be encoded and provided by the encoder-side to the decoder-side in or along a video bitstream. A reconstructed weight-update may be obtained at decoder-side by combining the decoded prediction error with the predicted weight-update. The prediction may be based on one or more previously decoded weight-updates, and/or based on at least part of the decoded content. In some examples, one of the previously decoded weight-updates may be re-used without further modification. In some examples, the prediction may be based also on one or more coefficients to be used as the parameters of a parametric prediction function.
In another embodiment, two or more encoded or decoded weight-updates are represented as a single weight-update, for example, in order to reduce memory complexity. In one example implementation, the weight-updates may be clustered by using a clustering algorithm such as k-means. In this implementation, the encoder side may signal to the decoder side when a clustering operation needs to be performed. In one example, the encoder side may then signal a cluster index to indicate which weight-update may be re-used for a certain frame or random access (RA) segment. An RA segment may be specified to start with a picture that enables random access, e.g. enables starting a decoding process from that picture. For example, an RA segment may start from an intra-coded picture, such as an IRAP picture in some video coding standards, or a gradual decoding refresh picture. The RA segment may, in some cases, be specified to pertain up to (but excluding) the next picture, in decoding order, that can start an RA segment. In another example, the encoder side may signal one or more cluster indexes to indicate the reference weight-updates from which to predict a new weight-update. In one example implementation, the clustering may be performed over pre-defined structures in weight updates, e.g., blocks of weight-update values, channels (matrices).
A yet another embodiment proposes to finetune a neural network jointly on the K1 final video frames belonging to one RA segment and on the K2 initial frames belonging to the following RA segment, where K1 and K2 are two integer numbers. Information about which finetuned NN needs to be used for each frame may be signaled from an encoder side to the decoder side, for example, as one binary flag for each frame, where the resulting set of binary flags may be compressed.
In an embodiment, a set of neural networks are finetuned for the K1 final video frames belonging to one RA segment and the K2 initial frames belonging to the following RA segment, where K1 and K2 are two integer numbers. Information about which finetuned neural network or networks are used for each frame may be signaled from an encoder side to the decoder side.
In an embodiment, a set of neural networks are generated for the video frames belonging to a first segment of frames and another set of neural networks is generated for a second segment of frames. The encoder may signal, and the decoder may decode an indication that a frame in the first segment uses a neural network or a set of neural networks generated for the second segment. This indication may be signaled or decoded for a frame in the first segment which uses a reference frame belonging to the second segment.
In another embodiment, a neural network or a set of neural networks are indicated for a first RA segment, and another neural network or set of neural networks are indicated for a second RA segment. The encoder may signal, and the decoder may decode an indication that a frame in the first RA segment uses a neural network or some set of neural networks indicated for the second RA segment. This indication may be signaled or decoded for a frame in the first RA segment which uses a reference frame belonging to the second RA segment.
In an embodiment, one or more frames of an RA segment may be processed by one of the following NNs:
In another embodiment, one or more frames of an RA segment may be processed by a NN which was obtained by combining two or more of the following:
The combination may be performed directly on the neural networks, or on the weight-updates associated to the neural networks.
The combination may be, for example, a linear combination, where the coefficients may be signaled from an encoder-side to a decoder-side in or along a video bitstream.
In another embodiment, two different versions or portions of a NN may be obtained, and then each version or portion is finetuned for a different RA segment. For example, a version or portion of a NN may be finetuned for a certain RA segment, another version may be finetuned for the following RA segment, and this is repeated for the following pairs of RA segments. Different portions of a NN may be, for example, two different subsets of the NN. Different versions of a NN may be obtained, for example, by quantizing the weights and/or the activations of the NN by using different quantization granularities.
Various embodiments consider the examples of compressing and decompressing data. For the sake of simplicity, in various embodiments, video is considered as an example of data type. However, it should be noted that the embodiments are also applicable to other data types, e.g., image or audio data.
In some embodiments, it is assumed that an encoder-side device performs a compression or encoding operation by using an encoder. A decoder-side device performs decompression or decoding operation by using a decoder. The encoder-side device may also use some decoding operations, for example, in a coding loop. The encoder-side device and the decoder-side device may be the same physical device, or different physical devices.
In some embodiments, it is assumed that the decoder contains one or more neural networks. Some examples of such decoder side neural networks may include the following:
The pretrained NN filter 1408 is deployed into the encoder-side device and into the decoder-side device. The trained NN filter may be delivered into the encoder-side device and into the decoder-side device by any means, such as but not limited to i) pre-defining the trained NN filter in a coding standard and thus having it as an integral part of the encoder and the decoder implementation; ii) out-of-band delivery prior to encoding or decoding the video bitstream; iii) out-of-band delivery in relation to encoding or decoding the video bitstream; or iv) in-band delivery with the video bitstream to the decoder.
During the finetuning and encoding stage 1414, the NN filter (e.g., the pretrained NN filter 1408) is finetuned by using a finetuning process 1416. In particular, some of the trainable parameters of the neural network are finetuned. During the encoding stage 1414, original input data or test uncompressed frames 1420 (e.g., frames extracted from images or videos) are given as input to a non-learned codec 1422 (e.g. VTM 11 codec) to obtain video decoded frames or test reconstructed frames 1424. The original-decoded pairs of frames (e.g. the original input data frames 1420 and video decoded frames 1424) are used for updating the weights of the NN filter. The output of the finetuning process 1416 is a weight-updated or a finetuned NN filter 1418. The finetuned NN filter 1418 and the pretrained NN filter 1408 are then used in a process 1419 for computing a weight-update 1421, for example, as a difference between the finetuned parameters of the finetuned NN filter 1418 and the corresponding parameters of the pretrained NN filter 1408 prior to finetuning). The weight-update 1421 then may optionally be compressed or encoded 1425 to obtain a compressed weight update 1426 and included into or along the bitstream 1428 together with the bitstream for an encoded video bitstream 1430 (e.g. VTM's encoded video bitstream) obtained from a VTM encoder 1432 (e.g. VTM 11 encoder with NN support). Alternatively, instead of encoding the weight-update 1421, the finetuned parameters of the finetuned NN filter 1418 may be encoded.
During the decoding and filtering stage 1434, the encoded video bitstream 1430 is decoded by the codec 1436 (e.g. VTM 11 decoder) to obtain decoded frames or test reconstructed frames 1438, the encoded weight-update 1426 for the post-processing NN filter is decompressed 1433 (when it was compressed), the decompressed weight-update 1435 is used for updating 1440 the corresponding parameters of the pretrained NN filter 1408, and the updated or finetuned NN filter 1441 is used to filter 1442 the decoded video frames 1438 to obtain reconstructed and filtered video or video frames 1444.
It is to be understood that one or more of the operations or blocks 1433, 1435, 1440, 1408, 1441, 1438, 1442 may be performed within a decoder with NN support. A decoder with NN support may be, for example, a VTM decoder which integrates one or more neural networks, such as NN for in-loop filtering, a NN for intra-frame prediction, a NN for inter-frame prediction, a NN representing the probability model for a lossless decoder, and the like.
It is also to be understood that, in some embodiments, the compressed weight-update 1426 may be part of the encoded video bitstream 1430.
The encoded video bitstream 1430 may include encoded signaling which may indicate to the decoder when and how to use the NN and/or the weight-update, according to some embodiments.
Further details on each of these blocks or stages are provided in the following paragraphs.
The training stage is aimed at training the learnable parameters of one or more neural networks in the encoder and in the decoder. Usually, in this stage, the learnable parameters of all neural networks in the encoder and decoder are trained.
The training process may be performed offline, e.g., before the time when the codec is deployed for compressing and decompressing data. However, after an initial training process, the codec and the neural networks in the codec may be deployed and later updated. The updating of the codec and the neural networks may occur multiple times.
Test phase is when the codec is used for compressing and decompressing data. The encoder-side device performs an optimization operation in order to obtain updated parameters for one or more decoder-side neural networks.
The optimization process (may also be referred to as finetuning in several embodiments) may comprise computing a loss, such as a rate-distortion loss, computing gradients of the loss with respect to the one or more parameters present in one or more decoder-side neural networks, updating the one or more parameters present in one or more decoder-side neural network using an optimization routine such as stochastic gradient descent (SGD), and repeating these operations until a stopping criterion is satisfied. A stopping criterion may be based on a predefined number of iterations, on the value for the loss, on the value for the distortion metric, or the like. For example, the optimization may stop when the loss does not decrease more than a predetermined amount, during a predetermined temporal span.
In an additional embodiment, the optimization process may perform additional operations to make the updates to the parameters more robust to compression operations such as quantization and/or sparsification. This may comprise using an additional term in the training objective function, such as the L1 norm of the updates to the parameters.
Once the optimization process terminates, the updated parameters may be combined with the initial parameters for obtaining the updates to the parameters. For example, the updated parameters may be subtracted from the initial parameters, thus obtaining the updates to the parameters. The updates to the parameters may be referred to as weight-update in several embodiments. For this example, the decoder-side updating mechanism may comprise adding the weight-update to the initial parameters.
The updates to the parameters may undergo lossless compression, or lossy compression, or both. Lossless compression may comprise using an entropy encoder, such as an arithmetic encoder. Lossy compression may comprise applying sparsification, quantization, predictive coding with lossy compression of prediction error, and other lossy operations to the updates to the parameters. Quantization may comprise converting the updates to the parameters from floating-point 32 bits values to fixed precision 8 bits values. Sparsification may comprise setting to zero the values which are below a predetermined threshold.
In an embodiment, the weight updates are encoded by using a traditional image or video encoder. For example, the weight updates may be reshaped in a way to form a rectangular image frame(s). These reshaped weight update images may then be fed to the traditional video codec, e.g., VVC/H.266, and make use of the existing coding tools such as spatial/temporal prediction tools.
In an embodiment, the rectangular weight update frames may be encoded into a scalable layer of scalable video coding. For example, rectangular update frames may be dedicated with a layer identifier value (e.g. nuh_layer_id value in HEVC/H.265 or VVC/H.266) that is separate from a layer identifier value for conventional video content. In an embodiment, rectangular update frames may be encoded into a sequence of image segments, such as subpictures in VVC/H.266, that reside in pictures also containing conventional video content. It needs to be understood that there are similar embodiments for decoding of weight updates with a traditional image or video encoder from a video bitstream, from a layer of a video bitstream, or from a sequence of coded image segments.
The bitstream representing the updates to the parameters may be concatenated with the bitstream representing the encoded video. In an embodiment, the bitstream representing the updates to the parameters may be transmitted, signaled, or stored along the bitstream representing the encoded video. In another embodiment, the bitstream representing the updates to the parameters may be included in the bitstream representing the encoded video.
The bitstream representing the updates to the parameters may be decompressed, depending on the compression operations performed at the encoder-side device. For example, when the parameters were lossless compressed by an arithmetic encoder, the bitstream needs to be decompressed by an arithmetic decoder.
The decompressed updates to the parameters, also referred to as updates to the parameters (or as weight-update), even when lossy compression was performed, are used to update the initial parameters. The NN with updated parameters may then be used for its task, such as for post-processing one or more decoded video frames.
This embodiment proposes to train and/or finetune neural networks based on the temporal persistence scope. Following examples are proposed:
In example 1, a temporal persistence scope of a NN may be any test video. In this example, a NN may be used for any test video. The NN may be pretrained on a training dataset, during an offline pretraining phase. The training dataset may not include the video data used at the test stage. No finetuning of the NN on a specific video or frame is performed. The base NN may be a randomly initialized NN.
In example 2, a temporal persistence scope of a NN may be one set of videos. In this example, a NN may be used for any video in the set of videos. The NN may be trained based on a base NN, by using content from the set of videos as training data. The base NN may be one of the following:
In example 3, a temporal persistence scope of a NN may be one whole video. In this example, a NN may be used for any frame or any patch in a certain video. The NN may be trained based on a base NN, by using content from this video as training data. The base NN may be one of the following:
In example 4, a temporal persistence scope of a NN may be one or more sets of consecutive video frames. In this example, a NN may be used for any frame or any patch in one or more sets of consecutive video frames in a certain video, such as one or more RA segments. The NN may be trained based on a base NN, by using content from the one or more sets of consecutive video frames as training data. The base NN may be one of the following:
In example 5 a temporal persistence scope of a NN is one or more video frames. In this example, a NN may be used for any patch of one or more video frames in a video. The NN may be trained based on a base NN, by using content from the one or more video frames as training data. The base NN may be one of the following:
In example 6, a temporal persistence scope of a NN is one or more patches from one or more video frames. In this case, a NN may be used for one or more patches from a video frame. The NN may be trained based a base NN, by using content from the one or more patches from a video frame as training data. The base NN may be one of the following:
The encoder-side may decide, for each video and each frame, which example may be the optimal with respect to a criterion such as based on a value of a rate-distortion function. NNs from multiple examples may be used for encoding and/or decoding the same video and/or the same frame.
In one example, given an input video, the encoder-side may decide to use NN described in the example 3. In this example, the encoder-side would train a NN using content from the input video, and the trained NN is used at decoder-side for at least some of the content in the video (e.g., for some of the CTUs in the video).
In another example, given an input video, the encoder-side may decide to use a NN described in the example 3 and one or more NNs described in the example 4. In this example, for some RA segments, the decoder-side would use the example 4 NNs, and for the rest of the RA segments the example 3 NN would be used.
In following paragraphs some examples of the proposed signaling are described.
Signaling the NN, as Described in the Example 1, from Encoder to Decoder
The encoder may encode the topology and/or weights of the NN into the bitstream or may specify a URI from which the topology and/or weights of the NN may be obtained.
Signaling NN, as Described in the Example 2, from Encoder to Decoder
The encoder may encode the topology, weights, and/or weight-update of the NN into the bitstream, or may specify a URI from which the topology, weights, weight-update of the NN may be obtained. In case a weight-update is signaled (either by encoding it into the bitstream, or by including a URI), the encoder-side may also signal an indication of which base NN to update. This indication may be a high-level syntax element, such as ‘base_nn_id’, which may take one out of a set of possible predetermined values. For example, the indicated base NN may be a NN which was pretrained on a training dataset.
Signaling One or More NN, as Described in the Example 3, NN from Encoder to Decoder
The encoder may encode the topology, weights, weight-update of the NN into the bitstream, or may specify a URI from which the topology, weights, and/or weight-update of the NN may be obtained. In case a weight-update is signaled (either by encoding it into the bitstream, or by including a URI), the encoder-side may also signal an indication of which base NN to update. This indication may be a high-level syntax element, such as ‘base_nn_id’, which may take one out of a set of possible predetermined values. For example, the indicated base NN may be a NN which was pretrained on a training dataset or may be a NN which was trained or finetuned on a set of videos including this video.
Signaling One or More NNs, as Described in the Example 4, from Encoder to Decoder
The encoder may encode the topology, weights, weight-update of each NN into the bitstream, or may specify a URI for each NN from which the topology, weights, and/or weight-update of the NN may be obtained. In case a weight-update is signaled for one or more NNs (either by encoding it into the bitstream, or by including a URI), then the encoder-side may also signal an indication of one or more base NNs to update. This indication may be a high-level syntax element, such as one ‘base_nn_id’ element for each NN, which may take one out of a set of possible predetermined values. For example, the indicated base NN may be a NN which was pretrained on a training dataset, or may be a NN which was trained or finetuned on a set of videos including this video, or may be a NN which was trained or finetuned on this video. For each NN, the encoder may also signal one or more RA segments identifiers, which allows the decoder to apply each NN to the corresponding RA segments.
Signaling One or More NNs, as Described in the Example 5, from Encoder to Decoder
The encoder may encode the topology, weights, weight-update of each NN into the bitstream, or may specify a URI for each NN from which the topology, weights, and/or weight-update of the NN may be obtained. In case a weight-update is signaled for the one or more NNs (either by encoding it into the bitstream, or by including a URI), then the encoder-side may also signal an indication of one or more base NNs to update. This indication may be a high-level syntax element, such as one ‘base_nn_id’ element for each NN, which may take one out of a set of possible predetermined values. For example, the indicated base NN may be a NN which was pretrained on a training dataset, or may be a NN which was trained or finetuned on a set of videos including this video, or may be a NN which was trained or finetuned on this video, or may be a NN which was trained or fine-tuned on one or more sets of consecutive frames. For each NN, the encoder may also signal one or more frame identifiers, which allows the decoder to apply each NN to the corresponding frames.
Signaling One or More NNs, as Described in the Example 6, from Encoder to Decoder
The encoder may encode the topology, weights, and/or weight-update of each NN into the bitstream or may specify a URI for each NN from which the topology, weights, and/or weight-update of the NN may be obtained. In case a weight-update is signaled for the one or more NNs (either by encoding it into the bitstream, or by including a URI), then the encoder-side may also signal an indication of one or more base NNs to update. This indication may be a high-level syntax element, such as one ‘base_nn_id’ element for each NN, which may take one out of a set of possible predetermined values. For example, the indicated base NN may be a NN which was pretrained on a training dataset, or may be a NN which was trained or finetuned on a set of videos including this video, or may be a NN which was trained or finetuned on this video, or may be a NN which was trained or fine-tuned on one or more sets of consecutive frames, or may be a NN which was trained or fine-tuned on one or more video frames. For each NN, the encoder may also signal one or more patch identifiers, which allows the decoder to apply each NN to the corresponding patch.
The encoder-side may signal a unique identifier for each NN, for example, as a high-level syntax element ‘nn_id’.
The encoder-side may signal, for each NN, whether the NN may be used as a base NN. This signaling may comprise a high-level syntax element, such as a ‘base_nn_flag’, associated to information about the NN itself, which when set to 1, indicates that the NN may be used as a base NN.
Signaling that Only a Single Sequence-Level NN May be Used
For a the example 1 NN, the example 2 NN, or the example 3 NN, the encoder may signal that only this NN may be used for the whole video, except when indicated that no NN may be used for a certain CTU, frame, or RA segment. This signaling may be a high-level syntax element, for example, a flag ‘single_nn_only_flag’ which when set to 1 indicates that a single NN may be used for the current video. This signaling may be performed only once for the whole video. However, the encoder may signal one flag for each CTU or for each frame or for each RA segment, indicating whether the NN may be used or not for that a CTU, a frame or an RA segment.
Signaling that Only RA Segment-Level NNs May be Used
The encoder may signal that only one or more the example 4 NNs may be used for one or more sets of consecutive frames, except, when indicated that no NN may be used for a certain CTU or a frame. This signaling may be a high-level syntax element, for example, a flag ‘ra_nn_only_flag’, which when set to 1, indicates that one or more NNs may be used for one or more sets of consecutive frames, and no NNs are used for the whole video or for individual frames. In an embodiment, this signaling may be performed only once for the whole video. However, the encoder may signal one flag for each CTU or for each frame, indicating whether the NN may be used or not for that CTU or frame.
Signaling that Only Frame-Level NNs May be Used
The encoder may signal that only one or more NNs may be used for one or more frames of this video, except when indicated that no NN may be used for a certain CTU. This signaling may comprise a high-level syntax element, such as ‘frame_nn_only_flag’, which when set to 1, indicates that one or more NNs may be used for one or more frames of this video, and no NNs are used for the whole video or for sets of consecutive frames. In an embodiment, this signaling may be performed only once for the whole video. However, the encoder may signal one flag for each CTU, indicating whether the NN may be used or not for that CTU.
Signaling that Only Patch-Level NNs May be Used
The encoder may signal that only one or more NNs may be used for one or more CTUs of this video, except when indicated that no NN may be used for a certain CTU. This signaling may comprise a high-level syntax element, such as ‘ctu_nn_only_flag’, which when set to 1, indicates that one or more NNs may be used for one or more CTUs of this video, and no NNs are used for the whole video, for sets of consecutive frames, or for one or more entire frames. This signaling may be performed only once for the whole video. However, the encoder may signal one flag for each CTU, indicating whether the NN may be used or not for that CTU.
Signaling that NNs from Different Examples May be Used
The encoder may signal that NNs from different examples may be used for processing different parts of the content in the video. This signaling may comprise a high-level syntax element, such as ‘multiple_nn_scopes’, which when set to 1, indicates that NNs from different Cases may be used for processing different parts of the content in the video. In an embodiment, this signaling may be performed only once in the whole video.
Furthermore, signaling is needed to indicate which NN may be used for each CTU, frame and RA segment. One example implementation proposes that each CTU, frame or RA segment is associated with an identifier of the NN to be applied on that CTU, frame or RA segment. The identifier, such as ‘ref_nn_id’ may take one of the predetermined values of the ‘nn_id’ element of each NN.
For example, when ‘multiple_nn_scopes_flag’ is set to 1, the encoder-side may signal one NN of example 1, one NN of example 3, one or more NNs of example 4, and one or more NNs of example 5. Then for each CTU, frame, or segment, the decoder-side may read the ‘ref_nn_id’ element and apply the corresponding NN, out of the NN of example 1, NN of example 3, NNs of example 4, or NNs of example 5.
Alternatively, to the previous ‘single_nn_only_flag’, ‘ra_nn_only_flag’, ‘frame_nn_only_flag’, and ‘multiple_nn_scopes_flag’ binary flags, the encoder may signal these four modes as a single high-level syntax element ‘nn_scope’, which may take one out of four (or more) predetermined values, where the mapping between the predetermined values and their meaning is either already known by the decoder side, or is signaled from an encoder to a decoder.
In an embodiment, it is proposed that the encoder-side may indicate a default NN for the whole video. For example, the default NN may be the NN of the example 1, the NN of the example 2, or the NN of the example 3. Once the decoder-side is indicated of a default NN for a certain video, the decoder-side may use the default NN for all frames and/or CTUs, unless the encoder-side indicates to use another NN.
In one example implementation, for each NN that the encoder signals to the decoder, the encoder may signal a high-level syntax element, such as ‘default_NN_flag’, which when set to 1, indicates that this NN may be used as the default NN. In an embodiment, only one NN may be used as the default NN.
In another example implementation, the encoder-side may indicate a high-level syntax element, such as ‘default_nn_id’, only once for the whole video, whose value may be one of the predetermined values that ‘nn_id’ may take.
The following is an example of using a default NN. The encoder-side trains, for example, the NN of the example 3 by using a content from the input video, and one NN of the example 4, on one RA segment. The encoder-side signals these two NNs to the decoder-side. The encoder-side signals/indicates to the decoder-side that the default NN is the NN of the example 3. Also, the encoder-side indicates to the decoder-side that the NN of the example 4 NN is to be used for one specific RA segment. The decoder-side would then apply the NN of example 3 on all RA segments, except for the indicated RA segment. In this example, the NN of the example 4 is applied on the indicated RA segment.
In this embodiment, it is proposed to use predictive coding for the weight-updates. A prediction of weight-updates is performed at decoder-side, and a prediction error may be encoded and provided by the encoder-side to the decoder-side.
The prediction may be performed also at encoder-side, in order to determine the prediction error. The prediction may be a process that takes as input one or more of the previously reconstructed weight-updates, and/or at least part of the decoded content.
In one example, a post-processing NN filter is considered as a decoder-side neural network. The decoded content that is input to the prediction process may be the decoded frame that needs to be post-processed by the NN. In another example, the decoded content that is input to the prediction process may be the decoded frame that needs to be post-processed by the NN and one or more of the previously reconstructed frames.
The prediction process may use one or more of the following modes or algorithms:
The encoder-side may indicate to the decoder-side which of the above prediction modes or algorithms needs to be used for predicting a certain weight-update. This indication may be performed by using a syntax element in the bitstream, such as ‘wu_pred_mode’ syntax element, which may take one of out a set of predetermined values, where the mapping between the predetermined values and their meaning (e.g., which prediction mode or algorithm they refer to) is either already known by the decoder side, or is signaled from an encoder to a decoder.
For each weight-update to be predicted, the encoder-side may indicate which previous reconstructed weight-updates to use and which decoded content to use. In order to identify the weight-updates uniquely, each weight-update may be associated to a weight-update identifier, such as by using a syntax element ‘wu_id’ in the bitstream. This identifier may be signaled from the encoder-side to the decoder-side, together with the corresponding prediction error of weight-update. The encoder-side may indicate the reference weight-updates to be used for prediction by means of a syntax element ‘ref_wu_ids’, which may be a list of unique identifiers of previously reconstructed weight-updates. The encoder-side may indicate the reference content to be used for prediction by means of a syntax element ‘ref_content_ids’, which may be a list of unique identifiers of previously decoded content, such as previously decoded patches or frames.
In case the prediction mode or algorithm is a parametric function where the parameters are signaled from an encoder-side to a decoder-side, the coefficients may be signaled by using a syntax element ‘wu_pred_coeffs’, which may be a list of coefficients to be used for predicting a weight-update from one or more previously reconstructed weight-updates.
Therefore, in one example implementation, the encoder-side may signal to the decoder-side a ‘wu_pred_mode’ syntax element indicating the weight-update prediction algorithm to use, a ‘ref_wu_ids’ syntax element indicating one or more previously reconstructed weight-updates to be used as reference weight-updates for the prediction process, eventually (based on the indicated prediction algorithm) a ‘ref_content_ids’ syntax element indicating one or more previously decoded content to be used as reference content for the prediction process, a ‘wu_id’ syntax element indicating the identifier of the current weight-update to be predicted, eventually (based on the indicated prediction algorithm) a ‘wu_pred_coeffs’ syntax element indicating the coefficients for a parametric prediction function, an encoded prediction error.
The predicted weight-update is used at encoder-side for determining the prediction error. For example, the prediction error may be the difference between the weight-update and the predicted weight-update. This prediction error may then be compressed using a lossy and/or lossless compression algorithm. The compressed prediction error may then be signaled to the decoder-side. The decoder-side may decompress the compressed prediction error, and then the decompressed prediction error may be combined with the predicted weight-update, for example, by adding the decompressed prediction error to the predicted weight-update, thus obtaining a reconstructed weight-update.
In this embodiment, the decoder-side may represent two or more encoded or decoded weight-updates as a single weight-update, for example, in order to reduce memory complexity at decoder-side. This may be needed, for example, when using the predictive coding embodiment, where one or more previously decoded weight-updates may be used for predicting another weight-update. The encoder-side may signal several weight-updates for a video, for example, one weight-update every RA segment of a video, which may cause the decoder-side to use substantial memory or storage for keeping the received weight-updates. Building a representation of two or more encoded or decoded previous weight-updates as a single weight-update may be referred to as a summarization of the set of previous weight-updates. This summarization may need to be performed at both the encoder-side and the decoder-side.
In one example implementation of the summarization process, two or more of the previous weight-updates may be clustered by using a clustering algorithm such as k-means. The encoder side may signal to the decoder side when a clustering operation needs to be performed. Also, the encoder-side may signal a set of input parameters for the clustering algorithm, such as the number of clusters, a random seed, and the like.
After the clustering has been performed, the encoder side may then indicate the weight-updates in terms of cluster indices. For example, the encoder side may signal one or more cluster indexes to indicate the reference weight-updates from which to predict a new weight-update.
In another example implementation of the summarization process, one or more of the previous weight-updates may be dropped or removed from the memory or storage, or simply tagged as dropped. For example, the encoder-side may simply tag the dropped previous weight-updates as dropped, whereas the decoder-side may remove the dropped previous weight-updates from the memory or storage. The encoder-side may decide which previous weight-updates to drop or remove based on an analysis or processing operation. For example, the encoder-side may decide to drop a previous weight-update when a measure (such as the L1 norm or the L2 norm) computed on the values in that previous weight-update is less than a predetermined threshold. In another example, the encoder-side may decide to keep a predetermined number C of previous weight-updates, by first ranking all the previous weight-updates according to a measure (such as the L1 norm or the L2 norm) computed on the values of each previous weight-update and then selecting the C previous weight-updates with highest measure. Other suitable methods for dropping previous weight-updates may be utilized.
In another example implementation of the summarization process, two or more of the previous weight-updates may be combined by linear combination. The coefficients for the linear combination may be predetermined or may be signaled from encoder-side to decoder-side. The encoder-side may signal to the decoder-side which weight-updates need to be combined, for example, by means of a high-level syntax element ‘wu_comb_ids’ which may be a list of identifiers of weight-updates. The encoder may signal to the decoder-side the coefficients for linearly combining the previous weight-updates by means of a high-level syntax element ‘wu_comb_coeffs’ which may be a list of coefficients.
In this embodiment, it is proposed to finetune a neural network jointly on the K1 final video frames belonging to one Random Access (RA) segment and on the K2 initial frames belonging to the following RA segment. In one example, the K1 video frames are all the video frames in one RA segment, and K2 are the first few video frames in the next RA segment.
Information about which finetuned NN needs to be used for each frame may be signaled from encoder side to decoder side, for example, as one binary flag for each frame, which may be compressed by lossless coding.
Using NN from a Different RA Segment
In this embodiment, one or more frames of an RA segment may be processed by one of the following NNs:
In this embodiment, one or more frames of an RA segment may be processed by a NN which was obtained by combining two or more of the following:
In this embodiment, the combination may be performed directly on the neural networks, or on the weight-updates associated to the neural networks. The combination may be, for example, an average of the weight values or of the weight-update values, or it can be a linear combination where the coefficients may be predetermined or signaled from encoder-side to decoder-side. The coefficients may be determined by the encoder-side, for example, by optimizing them by using gradient descent for computing gradients of an objective function, such as a rate-distortion loss or a distortion loss, and then using the gradients for updating the coefficients.
In another embodiment, the combination may happen adaptively, that is coefficients for combining the NNs or their weight updates may change for different RA segments according to the content in the RA segments.
In this embodiment, two different versions or portions of a NN may be obtained, and then each version or portion is finetuned for a different RA segment. For example, a version or portion of a NN may be finetuned for a certain RA segment, another version or portion of a NN may be finetuned for the following RA segment, and this is repeated again for the following pairs of RA segments. Different portions of a NN may be, for example, two different subsets of the NN. In one example, one subset may be the initial layers of the NN, and another subset may be the final layers of the NN. In another example, the NN architecture comprises a common initial set of layers, followed by two distinct sets of layers (e.g. branches); one branch may be finetuned on one RA segment, and another branch may be finetuned on another RA branch.
Different versions of a NN may be obtained for example by quantizing the weights and/or the activations of the NN by using different quantization granularities. For example, one NN version is obtained by quantizing the weights to 8 bits and another NN version is obtained by quantizing the weights to 16 bits.
In another embodiment, a different weight update(s) may be determined separately for each channel of the image/video content. For example, separate weight updates may be sent for luma and chroma components of the content. In another example, two weight updates may be sent in which one is used for luma (Y channel in YUV color space) and a second weight update is used for both chroma channels (U and V in YUV color space). The choice of signaling channel-wise weight update(s) may follow the same principles as described in above embodiments.
In another embodiment, the signaling of channel-wise weight update(s) may be done based on a rate-distortion optimization process. The encoder may use a single weight update for all channels or use different weight updates for different channels in different RA intervals. A high-level syntax flag may be signaled to the decoder in order to indicate the type of weight update that is used for each channel. This high-level syntax signaling may be done once for a certain RA segment, may be done at picture level, a CTU level, or a CU level.
Various embodiments for signaling an NN and/or weight update(s) may be realized by including the NN and/or weight update(s) in a parameter set, such as an APS, where the type of an APS may indicate that it includes an NN and/or weight update(s). A parameter set may include a parameter set identifier, which may, for example, be an unsigned integer value. When a parameter set with a particular parameter set identifier value includes weight update(s), it may update the previous parameter set of the same type and of the same parameter set identifier value.
Various embodiments for signaling an NN and/or weight update(s) may be realized by including the NN and/or weight update(s) in an SEI message, where the type of an SEI message may indicate that it contains an NN and/or weight update(s). An SEI message may comprise an identifier, which may, for example, be an unsigned integer value. When an SEI message with a particular identifier value comprises weight update(s), it may update the previous SEI message of the same type and of the same identifier value.
In another embodiment, for a certain data unit, such as a CTU, a frame, a RA segment, a video, and the like, the decoder may decide which weight update or filter to use based on an analysis of previous decoded frames or CTUs. This may be done also based on some texture analysis on the reconstructed samples in the decoder side.
The apparatus 1500 optionally includes a display 1508 that may be used to display content during rendering. The apparatus 1500 optionally includes one or more network (NW) interfaces (I/F(s)) 1510. The NW I/F(s) 1510 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique. The NW I/F(s) 1510 may comprise one or more transmitters and one or more receivers. The N/W I/F(s) 1510 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder/decoder circuitry(ies) and one or more antennas.
The apparatus 1500 may be a remote, virtual or cloud apparatus. The apparatus 1500 may be either a coder or a decoder, or both a coder and a decoder. The at least one memory 1504 may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The at least one memory 1504 may comprise a database for storing data. The apparatus 1500 need not comprise each of the features mentioned, or may comprise other features as well. The apparatus 1500 may correspond to or be another embodiment of the apparatus 50 shown in
In an embodiment, the temporal persistence scope includes: a test video, and wherein the at least one NN is used to encode or decode all frames of the test video; a first set of videos, and wherein the at least one NN is used to encode or decode all frames of a video in the first set of videos; a first video, and wherein the at least one NN is used to encode or decode all frames of the first video; one or more sets of consecutive video frames from a second video, and wherein the at least one NN is used to encode or decode all frames in the one or more sets of consecutive video frames from the second video; one or more video frames from a third video, and wherein, the at least one NN is used to encode or decode the one or more video frames from the third video; or one or more patches from one or more video frames, and wherein the at least one NN is used to encode or decode the one or more patches from a video frame of the one or more video frames from a fourth video.
In an embodiment, some examples of the at least one NN include, but are not limited to, a randomly initialized NN, by using a specified random seed; a pretrained NN for videos; a NN finetuned on one whole video sequence; a NN finetuned on one or more sets of consecutive frames of one video sequence; a NN finetuned on one or more frames of one video sequence; and/or a NN finetuned on one of more patches of one frame.
In another embodiment, an example of the at one NN includes a decoder-side NN. Some examples of the decoder-side NN include, but are not limited to:
In one example embodiment, the weight-update prediction error may be first compressed by the encoder-side and then provided to the decoder-side in the compressed form. In this embodiment, the decoder first decompresses the weight-update prediction error and use it for the subsequent steps.
In an embodiment, the prediction process includes one or more of following techniques: use one of a previous weight-updates as a predicted weight-update; combine one or more of the previous weight-updates by using a predetermined function; combine one or more of the previous weight-updates by using a parametric function; or use the neural network to predict the weight-update, by using at least one of one or more of the previous weight-updates or one or more of the previously decoded content.
Referring to
The RAN node 170 in this example is a base station that provides access by wireless devices such as the UE 110 to the wireless network 100. The RAN node 170 may be, for example, a base station for 5G, also called New Radio (NR). In 5G, the RAN node 170 may be a NG-RAN node, which is defined as either a gNB or an ng-eNB. A gNB is a node providing NR user plane and control plane protocol terminations towards the UE, and connected via the NG interface to a 5GC (such as, for example, the network element(s) 190). The ng-eNB is a node providing E-UTRA user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GC. The NG-RAN node may include multiple gNBs, which may also include a central unit (CU) (gNB-CU) 196 and distributed unit(s) (DUs) (gNB-DUs), of which DU 195 is shown. Note that the DU may include or be coupled to and control a radio unit (RU). The gNB-CU is a logical node hosting radio resource control (RRC), SDAP and PDCP protocols of the gNB or RRC and PDCP protocols of the en-gNB that controls the operation of one or more gNB-DUs. The gNB-CU terminates the F1 interface connected with the gNB-DU. The F1 interface is illustrated as reference 198, although reference 198 also illustrates a link between remote elements of the RAN node 170 and centralized elements of the RAN node 170, such as between the gNB-CU 196 and the gNB-DU 195. The gNB-DU is a logical node hosting RLC, MAC and PHY layers of the gNB or en-gNB, and its operation is partly controlled by gNB-CU. One gNB-CU supports one or multiple cells. One cell is supported by only one gNB-DU. The gNB-DU terminates the F1 interface 198 connected with the gNB-CU. Note that the DU 195 is considered to include the transceiver 160, for example, as part of a RU, but some examples of this may have the transceiver 160 as part of a separate RU, for example, under control of and connected to the DU 195. The RAN node 170 may also be an eNB (evolved NodeB) base station, for LTE (long term evolution), or any other suitable base station or node.
The RAN node 170 includes one or more processors 152, one or more memories 155, one or more network interfaces (N/W I/F(s)) 161, and one or more transceivers 160 interconnected through one or more buses 157. Each of the one or more transceivers 160 includes a receiver, Rx, 162 and a transmitter, Tx, 163. The one or more transceivers 160 are connected to one or more antennas 158. The one or more memories 155 include computer program code 153. The CU 196 may include the processor(s) 152, memories 155, and network interfaces 161. Note that the DU 195 may also contain its own memory/memories and processor(s), and/or other hardware, but these are not shown.
The RAN node 170 includes a module 150, comprising one of or both parts 150-1 and/or 150-2, which may be implemented in a number of ways. The module 150 may be implemented in hardware as module 150-1, such as being implemented as part of the one or more processors 152. The module 150-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the module 150 may be implemented as module 150-2, which is implemented as computer program code 153 and is executed by the one or more processors 152. For instance, the one or more memories 155 and the computer program code 153 are configured to, with the one or more processors 152, cause the RAN node 170 to perform one or more of the operations as described herein. Note that the functionality of the module 150 may be distributed, such as being distributed between the DU 195 and the CU 196, or be implemented solely in the DU 195.
The one or more network interfaces 161 communicate over a network such as via the links 176 and 131. Two or more gNBs 170 may communicate using, for example, link 176. The link 176 may be wired or wireless or both and may implement, for example, an Xn interface for 5G, an X2 interface for LTE, or other suitable interface for other standards.
The one or more buses 157 may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, wireless channels, and the like. For example, the one or more transceivers 160 may be implemented as a remote radio head (RRH) 195 for LTE or a distributed unit (DU) 195 for gNB implementation for 5G, with the other elements of the RAN node 170 possibly being physically in a different location from the RRH/DU, and the one or more buses 157 could be implemented in part as, for example, fiber optic cable or other suitable network connection to connect the other elements (for example, a central unit (CU), gNB-CU) of the RAN node 170 to the RRH/DU 195. Reference 198 also indicates those suitable network link(s).
It is noted that description herein indicates that ‘cells’ perform functions, but it should be clear that equipment which forms the cell may perform the functions. The cell makes up part of a base station. That is, there can be multiple cells per base station. For example, there could be three cells for a single carrier frequency and associated bandwidth, each cell covering one-third of a 360 degree area so that the single base station's coverage area covers an approximate oval or circle. Furthermore, each cell can correspond to a single carrier and a base station may use multiple carriers. So when there are three 120 degree cells per carrier and two carriers, then the base station has a total of 6 cells.
The wireless network 100 may include a network element or elements 190 that may include core network functionality, and which provides connectivity via a link or links 181 with a further network, such as a telephone network and/or a data communications network (for example, the Internet). Such core network functionality for 5G may include access and mobility management function(s) (AMF(S)) and/or user plane functions (UPF(s)) and/or session management function(s) (SMF(s)). Such core network functionality for LTE may include MME (Mobility Management Entity)/SGW (Serving Gateway) functionality. These are merely example functions that may be supported by the network element(s) 190, and note that both 5G and LTE functions might be supported. The RAN node 170 is coupled via a link 131 to the network element 190. The link 131 may be implemented as, for example, an NG interface for 5G, or an S1 interface for LTE, or other suitable interface for other standards. The network element 190 includes one or more processors 175, one or more memories 171, and one or more network interfaces (N/W I/F(s)) 180, interconnected through one or more buses 185. The one or more memories 171 include computer program code 173. The one or more memories 171 and the computer program code 173 are configured to, with the one or more processors 175, cause the network element 190 to perform one or more operations.
The wireless network 100 may implement network virtualization, which is the process of combining hardware and software network resources and network functionality into a single, software-based administrative entity, a virtual network. Network virtualization involves platform virtualization, often combined with resource virtualization. Network virtualization is categorized as either external, combining many networks, or parts of networks, into a virtual unit, or internal, providing network-like functionality to software containers on a single system. Note that the virtualized entities that result from the network virtualization are still implemented, at some level, using hardware such as processors 152 or 175 and memories 155 and 171, and also such virtualized entities create technical effects.
The computer readable memories 125, 155, and 171 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The computer readable memories 125, 155, and 171 may be means for performing storage functions. The processors 120, 152, and 175 may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples. The processors 120, 152, and 175 may be means for performing functions, such as controlling the UE 110, RAN node 170, network element(s) 190, and other functions as described herein.
In general, the various embodiments of the user equipment 110 can include, but are not limited to, cellular telephones such as smart phones, tablets, personal digital assistants (PDAs) having wireless communication capabilities, portable computers having wireless communication capabilities, image capture devices such as digital cameras having wireless communication capabilities, gaming devices having wireless communication capabilities, music storage and playback appliances having wireless communication capabilities, Internet appliances permitting wireless Internet access and browsing, tablets with wireless communication capabilities, as well as portable units or terminals that incorporate combinations of such functions.
One or more of modules 140-1, 140-2, 150-1, and 150-2 may be configured to implement mechanisms for finetuning or training at least one neural network. Computer program code 173 may also be configured to implement mechanisms for finetuning or training at least one neural network.
As described above,
A computer program product is therefore defined in those instances in which the computer program instructions, such as computer-readable program code portions, are stored by at least one non-transitory computer-readable storage medium with the computer program instructions, such as the computer-readable program code portions, being configured, upon execution, to perform the functions described above, such as in conjunction with the flowchart(s) of
Accordingly, blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
In the above, some example embodiments have been described with reference to an SEI message or an SEI NAL unit. It needs to be understood, however, that embodiments can be similarly realized with any similar structures or data units. Where example embodiments have been described with SEI messages contained in a structure, any independently parsable structures could likewise be used in embodiments. Specific SEI NAL unit and a SEI message syntax structures have been presented in example embodiments, but it needs to be understood that embodiments generally apply to any syntax structures with a similar intent as SEI NAL units and/or SEI messages.
In the above, some embodiments have been described in relation to a particular type of a parameter set (namely adaptation parameter set). It needs to be understood, however, that embodiments could be realized with any type of parameter set or other syntax structure in the bitstream.
In the above, some example embodiments have been described with the help of syntax of the bitstream. It needs to be understood, however, that the corresponding structure and/or computer program may reside at the encoder for generating the bitstream and/or at the decoder for decoding the bitstream.
In the above, where example embodiments have been described with reference to an encoder, it needs to be understood that the resulting bitstream and the decoder have corresponding elements in them. Likewise, where example embodiments have been described with reference to a decoder, it needs to be understood that the encoder has structure and/or computer program for generating the bitstream to be decoded by the decoder.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Accordingly, the description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
It should be understood that the foregoing description is only illustrative. Various alternatives and modifications may be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from different embodiments described above could be selectively combined into a new embodiment. Accordingly, the description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims.
References to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device such as instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device, and the like.
As used herein, the term ‘circuitry’ may refer to any of the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and memory(ies) that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present. This description of ‘circuitry’ applies to uses of this term in this application. As a further example, as used herein, the term ‘circuitry’ would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware. The term ‘circuitry’ would also cover, for example and if applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2022/053577 | 4/15/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63179168 | Apr 2021 | US |