At present, the encoding and decoding process for the picture and video may include the traditional method and the intelligent method based on the neural network. The traditional method is to perform the process of redundancy elimination for the input data. For example, the encoding and decoding processing for the picture or video is to perform the redundancy elimination by using the spatial correlation of each picture and the temporal correlation between multiple pictures. The intelligent method is to use neural network to perform the picture information process and extract the feature data.
In related arts, the decoded data obtained through the encoding and decoding process is generally directly used as the input data of the intelligent task network. However, the decoded data may include a large amount of redundant information that are not needed by the intelligent task network, and the transmission of these redundant information will lead to the waste of bandwidth or the decrease of the efficiency of the intelligent task network. In addition, there is almost no correlation between the end-to-end encoding and decoding process and the intelligent task network, such that the encoding and decoding process is unable to optimize the intelligent task network.
The embodiments of the present disclosure relate to the technical field of intelligent coding, in particular to an encoding and decoding method, a bitstream, an encoder, a decoder, a storage medium and a system.
The embodiments of the present disclosure provide an encoding method and a decoding method, a bitstream, an encoder, a decoder, a storage medium and a system.
The technical solution of the embodiments of the disclosure may be implemented as follows.
In a first aspect, the embodiments of the present application provide a decoding method, the method includes that:
In a second aspect, the embodiments of the present application provide an encoding method, the method includes that:
In a third aspect, the embodiments of the present disclosure provide a decoder, the decoder includes: a second memory and a second processor.
The second memory is configured to store a computer program capable of running on the second processor.
The second processor is configured to, when running the computer program: parse a bitstream to determine a reconstruction feature data; and perform a feature analysis on the reconstruction feature data by using an intelligent task network to determine a target result.
In order to enable a more detailed understanding of the features and technical content of the embodiments of the present disclosure, the implementation of the embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings, which are provided for illustration only and are not intended to limit the embodiments of the present disclosure.
Unless otherwise defined, all technical and scientific terms used herein have the same meanings as are commonly understood by those skilled in the art of the present disclosure. Terms used herein are for the purpose of describing the embodiments of the disclosure only and are not intended to limit the present disclosure.
In the following description, reference is made to “some embodiments” that describe a subset of all possible embodiments. However, it is to be understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict. It is further to be pointed out that, the terms “first/second/third” referred in embodiments of the present disclosure are merely used to distinguish similar objects, and do not represent a particular order for the objects. It is to be understood that “first/second/third” may be interchanged in a particular order or sequence where allowed, such that the embodiments of the disclosure described herein may be implemented in an order other than that illustrated or described herein.
Presently, the encoding and decoding process for the picture and video may include the traditional method and the intelligent method for processing based on the neural network. The traditional method is to perform the process of redundancy elimination for input data. For example, the encoding and decoding process for the picture or video is to perform the redundancy elimination by using the spatial correlation of each picture and the temporal correlation between multiple pictures. The intelligent method is to use neural network to perform the picture information process and extract the feature data.
In a particular embodiment, for pictures, the picture-based encoding and decoding processing may be categorized into traditional method and neural network-based intelligent method. The traditional method uses the spatial correlation of pixels to perform the process of redundancy elimination for the picture, and obtains the bitstream through transformation, quantization and entropy coding and transmits the bitstream. The intelligent method is to use the neural network to perform the encoding and decoding process. At present, the neural network-based picture encoding and decoding method has introduced many efficient neural network structures, which can be used for the feature information extraction of the picture. Herein, Convolutional Neural Networks (CNN) is the earliest network structure used for picture encoding and decoding. On the basis of CNN, many improved neural network structures as well as probability estimation models are derived. Taking the neural network structure as an example, it may include the network structure, such as Generative Adversarial Network (GAN) and Recurrent Neural Network (RNN), which can improve the neural network-based end-to-end picture compression performance. The GAN-based picture encoding and decoding method has achieved significant result in improving the subjective quality of picture.
In another specific embodiment, for the video, the video-based encoding and decoding processing may also be categorized into the traditional method and neural network-based intelligent method. The traditional method is to perform the encoding and decoding process for the video through the intra prediction coding or inter prediction coding, transform, quantization, entropy coding and in-loop filtering, etc. At present, the intelligent method mainly focuses on three aspects: hybrid neural network coding (that is, embedding neural networks instead of traditional coding modules into the video framework), neural network rate-distortion optimization coding and end-to-end video coding. The hybrid neural network coding is generally applied to the inter prediction module, in-loop filtering module and entropy coding module. The neural network rate-distortion optimization coding used the highly nonlinear characteristics of the neural network to train the neural network to be an efficient discriminator and classifier, for example, it is applied to the decision-making link of video coding mode. At present, the end-to-end video coding is generally classified into replacing all modules of traditional coding method with CNN, or expanding the input dimension of the neural network to perform end-to-end compression for all frames.
In the related art, referring to
In addition, in the embodiments of the present disclosure, the intelligent task network generally analyze the picture or video to complete the task objectives, such as target detection, target tracking or behavior recognition. The input of the intelligent task network is the decoded data obtained though the encoding method and decoding method, and the processing flow of the intelligent task network is generally includes A1 and A2. A1 refers to the process of performing feature extraction for the input decoded data and obtaining the feature data for the target of intelligent task network, while A2 refers to the process of processing the feature data and obtaining the result. Specifically, referring
It is to be understood that the input data of the intelligent task network generally is the decoded data obtained through the encoding method and decoding method, and the decoded data is directly used as the input of the intelligent task network. Referring to
In this way, the decoded data obtained through the encoding method and the decoding method is directly input into the intelligent task network. On one hand, the decoded data may include a large amount of redundant information that is not needed by the intelligent task network, and the transmission of these redundant information leads to waste of bandwidth or reduction of efficiency of the intelligent task network. On the other hand, the correlation between the end-to-end encoding and decoding process and the intelligent task network is almost zero, which results in the inability to optimize the encoding and decoding process for the intelligent task network.
On the basis of the above, the embodiment of the disclosure provides an encoding method, which is applied to an encoder. The feature extraction for input picture data is performed by using an intelligent task network to obtain initial feature data. The encoding process is performed for the initial feature data by using an encoding network, and the obtained encoded bits are signalled in a bitstream.
The embodiment of the disclosure also provides a decoding method, which is applied to the decoder. The bitstream is parsed to determine a reconstruction feature data. The feature analysis is performed on the reconstruction feature data by using an intelligent task network to determine a target result.
In this way, the feature extraction of the intelligent task network is used as the input of the encoding network, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Referring to
Referring to
In an embodiment of the present disclosure, referring to
At S501, the feature extraction for input picture data is performed by using an intelligent task network to obtain initial feature data.
At S502, the encoding process for the initial feature data is performed by using an encoding network, and the obtained encoded bits are signalled in a bitstream.
It is to be noted that the encoding method is applied to the encoder. In embodiments of the present disclosure, the encoder may include an intelligent task network and an encoding network. The intelligent task network is used to implement the feature extract for the input picture data, and the encoding network is used to perform encoding process for the initial feature data. In this way, the feature extraction of intelligent task network is used as the input of the encoding network, which can facilitate the encoding network to learn the picture information needed by the intelligent task network better.
It is also to be noted that in the encoder, after extracting the initial feature data, the intelligent task network will not execute the subsequent processing flow of the intelligent task network, but directly use the encoding nodes with the same dimension to performing encoding process for the initial feature data, such that in the decoder, after the reconstruction feature data is determined by the decoding network, the subsequent processing flow of the intelligent task network can be executed on the reconstruction feature data. In this way, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network.
In some embodiments, for an intelligent task network, the intelligent task network may at least include a feature extraction sub-network. The operation of performing feature extraction for the input picture data by using the intelligent task network to determine the initial feature data may include that: feature extraction for the input picture data is performed by using the feature extraction sub-network to obtain initial feature data at a first feature node.
Further, in some embodiments, the feature extraction sub-network may include N feature extraction layers, where N is an integer greater than or equal to 1. Correspondingly, the operation of performing feature extraction for the input picture data by using the feature extraction sub-network to obtain the initial feature data at the first feature node may include following operations.
When N is equal to 1, the feature extraction for the input picture data is performed by using the feature extraction layer to obtain initial feature data at a first feature node.
When N is greater than 1, the feature extraction for the input picture data is performed by using N feature extraction layer to obtain initial feature data at a first feature node.
It is to be understood that the first feature node may be a feature node corresponding to different feature extraction layers, and for which feature extraction layer, it is specifically determined according to the actual situation. For example, when it is determined in the intelligent task network that the encoding and decoding processes are needed after a certain feature extraction layer, the feature node corresponding to this feature extraction layer is the first feature node, and these feature extraction layers form a feature extraction sub-network, and the initial feature data extracted after passing through this feature extraction layer will be input into the encoding network.
That is, the initial feature data at the first feature node may be obtained by performing feature extraction by one feature extraction layer or by performing feature extraction by two or more feature extraction layers, which is not limited in the embodiments of the present disclosure.
For example, if the encoding and decoding processes are required after the first feature extraction layer, then the first feature node is the feature node corresponding to the first feature extraction layer, the extracted feature data is the initial feature data to be input into the encoding network, and the feature extraction sub-network is only the first feature extraction layer. If the encoding and decoding processes are required after the second feature extraction layer, then the first feature node is the feature node corresponding to the second feature extraction layer, the feature data extracted at this case is the initial feature data to be input into the encoding network, and the feature extraction sub-network is the first feature extraction layer and the second feature extraction layer.
Further, after obtaining the initial feature data at the first feature node, the case that it corresponds to the to-be-encoded feature data at which encoding node in the encoding network is related to the dimensions of the both. Therefore, in some embodiments, the method further includes that:
It is to be understood that in the embodiments of the present disclosure, when the parameters such as the number of feature space channels and the resolution at the first feature node are completely consistent with the parameters such as the number of feature space channels and the resolution at the first encoding node, it may be determined that the data dimension of the first feature node in the intelligent task network are matched with the data dimension of the first encoding node in the encoding network. That is, after the initial feature data at the first feature node is extracted and obtained according to the intelligent task network, the encoding processing may be performed by using directly the corresponding first encoding node with the same data dimension in the encoding network. The initial feature data is inputted to a first encoding node in the encoding network, an encoding process for the initial feature data is performed by the encoding network, and the obtained encoded bits are signalled in the bitstream.
It is further to be understood that in the embodiments of the present disclosure, when the parameters such as the number of the feature space channels and the resolution at the first feature node are not completely consistent with the parameters such as the number of the feature space channels and the resolution of the first encoding node in the encoding network, the case that the data dimension of the first feature node in the intelligent task network does not match with the data dimension of the first encoding node in the encoding network may occur. At this case, the data dimension conversion for the initial feature data at the first feature node is performed by using an adaptation network to obtain the to-be-encoded feature data at the first encoding node. Thus, the operations of performing the encoding process for the initial feature data by the encoding network and signalling the obtained encoded bits in the bitstream may include that: the to-be-encoded feature data is input to a first encoding node in the encoding network, the encoding process for the to-be-encoded feature data is performed by using the encoding network and the obtained encoded bits is signalled in the bitstream.
It is further to be understood that in embodiments of the present disclosure, the adaptation network herein may include a one-layer or multi-layer network structure, and the network structure may use, but is not limited to, up-sampling, down-sampling, selecting or repeating part of channels, etc. That is, in the cascade of the intelligent task network and the encoding network, there will also be a problem that the spatial resolution or the number of channels of the feature map inputted to analysis network and the reconstruction feature map do not match with each other. On the basis of the above, a single-layer or multi-layer network structure can be added as an adapter to perform the feature dimension conversion process, so as to adapt the cascade of the two networks. Herein the network structure of the adapter may use, but is not limited to, up-sampling, down-sampling, selecting or repeating part of channels, etc., which is not limited herein.
It is to be understood that the embodiments of the present disclosure mainly provide an intelligent fusion network model in which an end-to-end encoding and decoding network and an intelligent task network are cascaded. The end-to-end encoding and decoding network includes an encoding network and a decoding network. In other words, the intelligent fusion network model may include the encoding network, the decoding network and the intelligent task network. The encoding network and part of the intelligent task network are used in the encoder, and the decoding network and the other part of the intelligent task network are used in the decoder.
In the embodiments of the present application, the training for the intelligent fusion network model may be performed in an encoder, a decoder, or even in both the encoder and the decoder, which is not limited herein.
In one possible implementation, for the training of the intelligent fusion network model, the method may further include following operations.
At least one training sample is determined.
A preset network model is trained by using the at least one training sample. The preset network model includes an initial encoding network, an initial decoding network and an initial intelligent task network, and the initial encoding network and the initial decoding network are connected with the initial intelligent task network through a node.
When a loss function corresponding to the preset network model converges to a preset threshold value, a model obtained after training is determined as an intelligent fusion network model. The intelligent fusion network model includes the encoding network, a decoding network and the intelligent task network.
In a specific embodiment, the loss function corresponding to the preset network model may be divided into two parts: a loss function of the encoding and decoding network and a loss function of the intelligent task network. Therefore, in some embodiments, the method further includes following operations.
The first rate-distortion tradeoff parameter of the intelligent task network, the loss value of the intelligent task network, the second rate-distortion tradeoff parameter of the encoding and decoding network, the distortion value and the bit rate of the bitstream of the encoding and decoding network are determined.
The loss function corresponding to the preset network model is determined according to the first rate-distortion tradeoff parameter, the second rate-distortion tradeoff parameter, the loss value of the intelligent task network, the distortion value and the bit rate of the bitstream of the encoding and decoding network.
That is, in the embodiment of the present disclosure, the retraining method of the intelligent fusion network model may be to jointly train a fusion network formed by connecting the initial intelligent task network and the initial encoding and decoding network through the node. Exemplarily, the loss function may be as follows,
where R denotes the bit rate of the bitstream of the encoding and decoding network, λ1, λ2 denote the rate-distortion tradeoff parameters, and different λ1, λ2 correspond to different models, i.e., different total bit rates; losstask denotes the loss value of the intelligent task network, D(x, {circumflex over (x)}) denotes the distortion value between the input picture and the decoding picture. Here, x and î denote the data at the encoding node and the reconstruction node used by the encoding and decoding network, respectively, instead of the picture data. In addition, the distortion value here may be measured by using Mean Squared Error (MSE).
With regard to equation (1), it may be regarded as two parts: λ1·losstask denotes the loss function of the intelligent task network, R+λ2D(x, {circumflex over (x)}) denotes the loss function of the encoding and decoding network. That is, the loss function of the intelligent fusion network model may be obtained through both the loss function of the intelligent task network and the loss function of the encoding and decoding network. The values of λ1 and λ2 are specifically set according to the actual situation. For example, the value of λ1 is 0.3, and the value of λ2 is 0.7, which is not limited herein.
It is to be noted that the retraining method of the intelligent fusion network model in the embodiments of the present disclosure may also be performed step by step. For example, firstly the value of λ2 may be set to zero, and the value of λ1 is set to any value, at this case, the training is performed for the intelligent task network. Then the value of λ1 is set to zero, and the value of λ2 is any value, at this case, the training is performed for the encoding and decoding network (including the encoding network and decoding network). Finally, joint training is performed. The training method is not limited herein, it may also be other training methods, or even a combination of different training methods.
It is also to be noted that for the intelligent task network, the intelligent task network may include a feature extraction sub-network and a feature analysis sub-network. The feature extraction sub-network may be used to perform the feature extraction of the input picture data to determine the initial feature data, and then the encoding network performs the encoding process for the initial feature data. The feature analysis sub-network may be used to perform the feature analysis for the input feature data to determine the target result, which may refer to complete the task objectives such as target detection, target tracking or behavior recognition.
Thus, after training to obtain the intelligent task network, in some embodiments, the method may further include following operations.
The feature extraction for the input picture data is performed by using the feature extraction sub-network to obtain initial feature data.
The feature analysis for the initial feature data is performed by using the feature analysis sub-network to determine the target result.
That is, after the training, the intelligent task network that has completed the training may directly perform the feature extraction and feature analysis for the input picture data, and at this case, the target result may be determined without passing through the encoding network and decoding network. For example, in the case of processing of local picture, there is no need to pass through the end-to-end encoding and decoding network to perform the data transmission. At this case, after the intelligent task network (including feature extraction sub-network and feature analysis sub-network) is obtained by training, it may also be applied to picture analysis and processing to obtain the target result of the intelligent task.
The embodiments of the disclosure provide an encoding method, which is applied to an encoder. The feature extraction for input picture data is performed by using an intelligent task network to obtain initial feature data. The encoding processing for the initial feature data is performed by using an encoding network, and the obtained encoded bits are signalled in a bitstream. In this way, the feature extraction of the intelligent task network is used as the input of the encoding network, it not only can better learn the picture information required by the intelligent task network, but also can save the processes of picture restoration and extracting the feature data for restoring picture in related arts, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
In another embodiment of the present disclosure, the embodiment of the present application provides a bitstream. The bitstream is generated by performing bit encoding according to the to-be-encoded information.
In the embodiments of the present disclosure, the to-be-encoded information at least incudes initial feature data, and the initial feature data is obtained by performing feature extraction for the input picture data through an intelligent task network. In this way, after the encoder generates the bitstream, it may be transmitted to the decoder, so that the decoder can subsequently obtain the reconstruction feature data by parsing the bitstream.
In another embodiment of the present disclosure, referring to
At S601, a bitstream is parsed to determine a reconstruction feature data.
At S602, the feature analysis is performed on the reconstruction feature data by using an intelligent task network to determine a target result.
It is to be noted that the decoding method is applied to the decoder. In the embodiments of the present disclosure, the decoder may include a decoding network. Thus, for operation S601, the operation that the bitstream is parsed to determine the reconstruction feature data may further include that: the bitstream is parsed by using the decoding network to determine the reconstruction feature data.
It is further to be understood that in the embodiments of the present disclosure, the decoder not only has a decoding function, but also may have an intelligent analysis function. That is, the decoder in the embodiments of the present application includes an intelligent task network in addition to the decoding network. In this way, in the decoder, there is no need to reconstruct to the decoded picture, but reconstructing to the feature space, that is, after obtaining the reconstruction feature data by decoding, the intelligent task network may be used to perform feature analysis for the reconstruction feature data, so as to determine the target result, which may indicate completing the task objectives such as target detection, target tracking or behavior recognition.
In some embodiments, the operation of parsing the bitstream to determine the reconstruction feature data may include:
It is to be noted that not all the reconstruction feature data obtained by decoding meet the requirements, and it is related to the data dimension of the feature node in intelligent task network. Specifically, when a data dimension of a first feature node in the intelligent task network matches with a data dimension of a first reconstruction node, a candidate reconstruction feature data at the first reconstruction node may be determined as the reconstruction feature data at the first feature node of the intelligent task network. That is, it is to be understood that in the embodiments of the present disclosure, when the parameters such as the number of feature space channels and the resolution at the first feature node are completely consistent with the parameters such as the number of feature space channels and the resolution at the first reconstruction node, it may be determined that the data dimension of the first feature node in the intelligent task network are matched with the data dimension of the first reconstruction node in the decoding network.
It is further to be understood that in the embodiments of the present disclosure, when the parameters such as the number of the feature space channels and the resolution at the first feature node in the intelligent task network are not completely consistent with the parameters such as the number of the feature space channels and the resolution of the first reconstruction node, that is, the case that the data dimension of the first feature node in the intelligent task network does not match with the data dimension of the first reconstruction node may occur. At this case, the data dimension conversion for the candidate reconstruction feature data at the first reconstruction node is performed by using an adaptation network to obtain the reconstruction feature data.
It is further to be understood that the adaptation network herein may include a one-layer or multi-layer network structure, and the network structure may use, but is not limited to, up-sampling, down-sampling, selecting or repeating part of channels, etc. That is, in the cascade of the intelligent task network and the decoding network, there will also be a problem that the spatial resolutions or the number of channels of the feature map inputted to analysis network and the reconstruction feature map do not match with each other. On the basis of the above, a single-layer or multi-layer network structure can be added as an adapter to perform the feature dimension conversion processing, so as to adapt the cascade of the two networks. Herein the network structure of the adapter may use, but is not limited to, up-sampling, down-sampling, selecting or repeating part of channels, etc., which is not limited herein.
Further, after determining the reconstruction feature data, in some embodiments, for S602, the operation that the feature analysis for the reconstruction feature data is performed by using an intelligent task network to determine a target result may include that:
the reconstruction feature data is input to the first feature node in the intelligent task network, and feature analysis for the reconstruction feature data is performed by using the intelligent task network to obtain the target result.
It is to be understood that the first feature node may be a feature node corresponding to different feature extraction layers, and for which feature extraction layer, it is specifically determined according to the actual situation. For example, when it is determined in the intelligent task network that the encoding and decoding processes are needed after a certain feature extraction layer, the feature node corresponding to this feature extraction layer is the first feature node, and the obtained initial feature data which is extracted after passing through that feature extraction layer will be processed through the encoding network and the decoding network, such that the reconstruction feature data at the first feature node may be obtained, and then the target result can be obtained through analysis.
It is to be understood that for the intelligent task network, the intelligent task network may include a feature extraction sub-network and a feature analysis sub-network. Accordingly, in some embodiments, in a specific embodiment, the operation that the feature analysis is performed on the reconstruction feature data by using an intelligent task network to determine a target result may include following operation.
When the first feature node is a feature node obtained after passing through the feature extraction sub-network, the reconstruction feature data is input to the first feature node, and feature analysis for the reconstruction feature data is performed by using the feature analysis sub-network to obtain the target result.
It is to be noted that the feature extraction sub-network may include several feature extraction layers. The first feature node may be obtained by passing through a feature extraction layer or passing through two or more feature extraction layers, which is not limited in the embodiments of the present disclosure.
For example, if it is determined in the intelligent task network that the encoding and decoding processes are required after the fourth feature extraction layer, that is, the first feature node is a feature node corresponding to the fourth feature extraction layer, the feature extraction sub-network includes four feature extraction layers, and the reconstruction feature data obtained after the fourth feature extraction layer is input to the feature analysis sub-network for feature analysis, and the target result can be obtained. If it is determined in the intelligent task network that the encoding and decoding processes are required after the second feature extraction layer, that is, the first feature node is the feature node corresponding to the second feature extraction layer, the feature extraction sub-network includes two feature extraction layers, and the reconstruction feature data obtained after the second feature extraction layer is input to the feature analysis sub-network for feature analysis, and thus the target result may be obtained.
In some embodiments, it is also to be noted that for the feature analysis sub-network, the feature analysis sub-network may include a region proposal network (RPN) and a Region Of Interest_Heads (ROI_Heads). The output end of the region proposal network is connected with the input end of the region of interest_heads, and the input end of the region proposal network is also connected with the region of interest_heads, and the output end of the region of interest_heads is used for outputting the target result.
Accordingly, in some embodiments, the operation that the feature analysis is performed on the reconstruction feature data by using a feature analysis sub-network to obtain a target result may include:
That is, the feature data is firstly processed through the region proposal network to obtain the target region, and then the reconstruction feature data and the target region are intelligently analyzed through the region of interest_heads, so that the target result may be obtained.
It is further to be understood that the embodiments of the present disclosure mainly provide an intelligent fusion network model in which an end-to-end encoding and decoding network and an intelligent task network are cascaded, and the goal is that the intelligent task network can implement optimal performance through the processing and retraining of the intelligent fusion network model. The end-to-end encoding and decoding network includes an encoding network and a decoding network. In other words, the intelligent fusion network model may include the encoding network, the decoding network and the intelligent task network. The encoding network and part of the intelligent task network are used in the encoder, and the decoding network and the other part of the intelligent task network are used in the decoder.
Further, for the training of the intelligent fusion network model, in some embodiments, the method further includes following operations.
At least one training sample is determined.
A preset network model is trained by using the at least one training sample. The preset network model includes an initial encoding network, an initial decoding network and an initial intelligent task network, and the initial encoding network and the initial decoding network are connected with the initial intelligent task network through a node.
When a loss function corresponding to the preset network model converges to a preset threshold value, a model obtained after training is determined as an intelligent fusion network model. The intelligent fusion network model includes the encoding network, a decoding network and the intelligent task network.
In a specific embodiment, the loss function corresponding to the preset network model may be divided into two parts: the loss function of the encoding and decoding network and the loss function of the intelligent task network. Therefore, in some embodiments, the method further includes following operations.
The first rate-distortion tradeoff parameter of the intelligent task network, the loss value of the intelligent task network, the second rate-distortion tradeoff parameter of the encoding and decoding network, the distortion value and the bit rate of the bitstream of the encoding and decoding network are determined.
The loss function corresponding to the preset network model is determined according to the first rate-distortion tradeoff parameter, the second rate-distortion tradeoff parameter, the loss value of the intelligent task network, the distortion value and the bit rate of the bitstream of the encoding and decoding network.
That is, in the embodiment of the present disclosure, the retraining method of the intelligent fusion network model may be to jointly train a fusion network formed by connecting the initial intelligent task network and the initial encoding and decoding network through the node. Exemplarily, the loss function may be shown as Equation (1) above.
With regard to equation (1), it may be regarded as two parts: λ1·losstask denotes the loss function of the intelligent task network, R+λ2D(x, {circumflex over (x)}) denotes the loss function of the encoding and decoding network. That is, the loss function of the intelligent fusion network model may be obtained through both the loss function of the intelligent task network and the loss function of the encoding and decoding network. The values of λ1 and λ2 are specifically set according to the actual situation. For example, the value of λ1 is 0.3, and the value of λ2 is 0.7, which is not limited herein.
It is to be noted that the retraining method of the intelligent fusion network model in the embodiments of the present disclosure may also be performed step by step. For example, firstly the value of λ2 may be set to zero, and the value of λ1 is set to any value, at this case, the training is performed for the intelligent task network. Then the value of λ1 is set to zero, and the value of λ2 is any value, at this case, the training is performed for the encoding and decoding network (including the encoding network and decoding network). Finally, joint training is performed. The training method is not limited herein, it may also be other training methods, or even a combination of different training methods.
The embodiments of the disclosure also provide a decoding method, which is applied to the decoder. The bitstream is parsed to determine a reconstruction feature data. The feature analysis is performed on the reconstruction feature data by using an intelligent task network to determine a target result. In this way, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
In another embodiment of the present disclosure, for the input picture data, in the encoder, the feature extraction for the input picture data may first be performed by an intelligent task network, and then the extracted initial feature data is input into the encoding network. That is, the feature extraction part of the intelligent task network is taken as the preprocessing flow of the encoding network, that is, the feature extracted by using the intelligent task network is used as the input of the encoding network, which may help the encoding and decoding network to learn the picture information required by the intelligent task network better. Then, after the bitstream is obtained by performing the encoding process for the initial feature data by using the encoding network, when the bitstream is transmitted to the decoder, the reconstruction feature data may be obtained by parsing the bitstream, and the reconstruction feature data may be input to the intelligent task network for feature analysis. That is, the analysis and processing part of the intelligent task network is used as the subsequent processing flow of the decoding network, such that the decoding network may execute the analysis processing of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network.
Referring to
It is apparent from
In the embodiments of the present application, for the encoding and decoding network and an intelligent task network herein, the encoding and decoding network may be divided into an encoding network and a decoding network. Specifically, the encoding network may use the feature extraction sub-network of the intelligent task network and some nodes of the end-to-end encoding network, and the feature extraction for the input picture data may be performed through the intelligent task network. The intelligent task network may no longer be executed after a certain feature node, but the end-to-end picture compression network of the corresponding encoding node with the same dimension is directly used for compression. After the decoding is performed to the corresponding reconstruction node with the same dimension as the encoding node, the decoding network also inputs the reconstruction feature data at the reconstruction node to the intelligent task network, and the subsequent processing flow of the intelligent task network is performed.
In addition, in the embodiments of the present disclosure, the encoding and decoding network and the intelligent task network used herein may be various commonly used end-to-end encoding and decoding networks and intelligent task networks, regardless of the specific network structure and type. For example, the encoding and decoding network itself can use various variants of neural network structures such as CNN, RNN and GAN. The intelligent task network does not limit the task objectives and network structure, which may be target detection, target tracking, behavior recognition, pattern recognition and other task objectives related to picture processing.
For example, referring to
It is further to be noted that the encoding and decoding network in
Referring to
Based on the end-to-end encoding and decoding network illustrated in
In
That is, the intelligent fusion network model of the embodiment of the present disclosure is shown in
It is also to be noted that the number of the feature space channels and the resolution extracted at node d and node e corresponding to different feature layers as illustrated in
It is further to be noted that the encoding and decoding network described in the embodiments of the present disclosure may be such as traditional video encoding and decoding, intelligent end-to-end picture encoding and decoding, partial intelligence of traditional video encoding and decoding, end-to-end encoding and decoding of video, etc., which is not limited herein. In addition, the intelligent task network and the end-to-end encoding and decoding network provided by the embodiments of the present disclosure may also be replaced by other common network structures. For example, in the field of end-to-end encoding and decoding, Lee network and Duan network can be used for specific implementation. Lee network adopts transfer learning method to improve the quality of network reconstruction picture, Duan network uses high-level semantic maps to enhance low-level visual features, and it is verified that this method may effectively improve the rate-precision-distortion performance of the picture compression. The compositional structure of the Lee network model is illustrated in
Accordingly, in the field of intelligent task network, the target recognition network (you only look once_version3, yolo_v3) may be used for specific implementation, and the network model compositional structure is illustrated in
As can be seen from the above, the feature space vector of the encoding and decoding network instead of the original picture is input into the intelligent task network, such that the process of picture restoration and extracting the restored picture feature may be saved, and the accuracy and speed of the intelligent task network may be better improved. Meanwhile, the feature extraction of intelligent task network is used as the input of the end-to-end picture encoding and decoding network, which may facilitate the encoding and decoding network to learn the picture information required by the intelligent task network better. In this way, in the embodiments of the disclosure, the feature extraction part of the intelligent task network is taken as the pre-processing flow of the end-to-end encoding network, and the analysis processing part of the intelligent task network is taken as the subsequent processing flow of the picture end-to-end decoding network, such that the decoding network can execute the processing of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network.
The present embodiment describes the specific implementation of the foregoing embodiments in detail. According to the technical solutions of the foregoing embodiments, it can be seen that not only the picture information required by the intelligent task network can be better learned, but also the complexity of the intelligent task network can be reduced, thereby improving the accuracy and speed of the intelligent task network.
In an embodiment of the present disclosure, on the basis of the same inventive concept of the foregoing embodiments, referring to
The first feature extraction unit 1401 is configured to perform the feature extraction for input picture data by using an intelligent task network to obtain initial feature data.
The coding unit 1402 is configured to perform the encoding process for the initial feature data by using an encoding network, and to signal the obtained encoded bits in a bitstream.
In some embodiments, the intelligent task network at least includes a feature extraction sub-network, and accordingly, the first feature extraction unit 1401 is specifically configured to perform feature extraction for the input picture data by using the feature extraction sub-network to obtain initial feature data at the first feature node.
In some embodiments, the feature extraction sub-network includes N feature extraction layers, where N is an integer greater than or equal to 1. Accordingly, the first feature extraction unit 1401 is further configured to: when N is equal to 1, perform feature extraction for the input picture data by using the feature extraction layer to obtain the initial feature data at the first feature node; and when N is greater than 1, perform feature extraction for the input picture data by using N feature extraction layers to obtain initial feature data at the first feature node.
In some embodiments, referring to
The coding unit 1402 is further configured to: when a data dimension of a first encoding node in the encoding network matches with a data dimension of the first feature node, determine the initial feature data at the first feature node as to-be-encoded feature data at the first encoding node; or when the data dimension of the first encoding node in the encoding network does not match with the data dimension of the first feature node, perform data dimension conversion for the initial feature data at the first feature node by using an adaptation network through the first dimension conversion unit 1403, to obtain to-be-encoded feature data at the first encoding node.
In some embodiments, the coding unit 1402 is specifically configured to input the to-be-encoded feature data to the first encoding node of the encoding network, perform the encoding process for the to-be-encoded feature data by using the encoding network, and signal obtained encoded bits in a bitstream.
In some embodiments, the adaptation network includes a one-layer or multi-layer network structure.
In some embodiments, referring to
In some embodiments, the intelligent task network includes a feature extraction sub-network and a feature analysis sub-network. Referring to
The first feature extraction unit 1401 is further configured to perform feature extraction for the input picture data by using the feature extraction sub-network to obtain initial feature data.
The first feature analysis unit 1405 is configured to perform feature analysis for the initial feature data by using the feature analysis sub-network to determine the target result.
It is to be understood that in the embodiments of the present disclosure, the “unit” may be a part of a circuit, part of a processor, part of programs or softwares, etc., of course, it may also be modular or non-modular. Further, in the embodiments, various composition units may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit. The above-mentioned integrated unit may be implemented either in the form of hardware or in the form of software function module.
When the integrated unit is implemented in the form of a software function module, and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present embodiments in essence or the part contributing to the related art or all or part of the technical solution may be embodied in the form of software product, and the computer software product is stored in a storage medium and includes several instructions for enabling a computer device (which can be a personal computer, a server, a network device, etc.) or processor to perform all or part of the steps of the method of the present embodiments. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk and other media that can store program code.
Thus, the embodiments of the present disclosure provide a computer storage medium, applied to the encoder 140. The computer storage medium stores the computer program which, when implemented by the first processor, implements the method of any one of the preceding embodiments.
Based on the above composition of the encoder 140 and the computer storage medium, referring to
The first communication interface 1501 is configured to receive and transmit the signal in the process of transmitting and receiving information with other external network elements.
The first memory 1502 is configured to store a computer program capable of running on the first processor 1503.
The first processor 1503 is configured to, when running the computer program:
It is to be understood that the first memory 1502 in the embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of exemplary illustration, but not limitation, many forms of RAM are available, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM) and direct Rambus RAM (DR RAM). The first memory 1502 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
The first processor 1503 may be an integrated circuit chip, which has signal processing capability. During the implementation, the various steps of the above method may be implemented by the integrated logic circuit of hardware in the first processor 1503 or instructions in the form of software. The above first processor 1503 may be general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. The disclosed methods, steps and logic block diagrams in the embodiments of the present disclosure can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiment of the present disclosure can be directly embodied in the execution completion of the hardware decoding processor, or by the combination of the hardware and software modules in the decoding processor. The software module can be located in random memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register and other mature storage media in the art. The storage medium is located in the first memory 1502, and the first processor 1503 reads the information in the first memory 1502 and completes the steps of the above method in combination with its hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode or a combination thereof. For the hardware implementation, the processing unit may be implemented in one or more Application Specific Integrated Circuits (ASIC), Digital Signal Processors (DSP), Digital Signal Processing Devices (DSPD), Programmable Logic Devices (PLD), Field-Programmable Gate Arrays (FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units for performing the functions described herein, or a combination thereof. For the software implementation, the techniques described herein may be implemented by modules (e.g. procedures, functions, etc.) that perform the functions described herein. The software code may be stored in the memory and executed by a processor. The memory may be implemented in the processor or outside the processor.
Alternatively, as another embodiment, the first processor 1503 is further configured to, when running the computer program, perform the method of any of the preceding embodiments.
The embodiments of the present disclosure provide an encoder. In the encoder, the feature extraction of the intelligent task network is used as the input of the encoding network, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
In another embodiment of the present disclosure, on the basis of the same inventive concept of the foregoing embodiments, referring to
The analysis unit 1601 is configured to parse the bitstream to determine the reconstruction feature data.
The second feature analysis unit 1602 is configured to perform feature analysis for the reconstruction feature data by using the intelligent task network to determine the target result.
In some embodiments, referring to
The parsing unit 1601 is further configured to: parse the bitstream, and when a data dimension of a first feature node in the intelligent task network matches with a data dimension of the first reconstruction node, determine the candidate reconstruction feature data at the first reconstruction node as the reconstruction feature data; or parse the bitstream, and when the data dimension of the first feature node in the intelligent task network does not match with the data dimension of the first reconstruction node, perform data dimension conversion for the candidate reconstruction feature data at the first reconstruction node by using an adaptation network through the second dimension conversion unit 1603, to obtain the reconstruction feature data.
In some embodiments, the second feature analysis unit 1602 is specifically configured to input the reconstruction feature data to the first feature node in the intelligent task network, and perform feature analysis for the reconstruction feature data by using the intelligent task network to obtain the target result.
In some embodiments, the adaptation network includes a one-layer or multi-layer network structure.
In some embodiment, the intelligent task network includes a feature extraction sub-network and a feature analysis sub-network. Accordingly, the second feature analysis unit 1602 is specifically configured to: when the first feature node is a feature node obtained after passing through the feature extraction sub-network, input the reconstruction feature data to the first feature node, and perform feature analysis for the reconstruction feature data by using the feature analysis sub-network to obtain the target result.
In some embodiments, the feature analysis sub-network includes a region proposal network and a region of interest_heads. Accordingly, the second feature analysis unit 1602 is specifically configured to: process the reconstruction feature data by the region proposal network to obtain a target region; and perform intelligent analysis for the reconstruction feature data and the target region by the region of interest_heads to obtain the target result.
In some embodiments, the parsing unit 1601 is further configured to: parse the bitstream by using the decoding network to determine the reconstructed feature data.
In some embodiments, referring to
It is to be understood that in the embodiments, the “unit” may be a part of a circuit, part of a processor, part of programs or softwares, etc., of course, it may also be modular or non-modular. Further, in the embodiments, various composition units may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit. The above-mentioned integrated unit may be implemented either in the form of hardware or in the form of software function module.
When the integrated unit is implemented in the form of a software function module, and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on such understanding, the embodiments of the present disclosure provide a computer storage medium, applied to the decoder 160. The computer storage medium stores the computer program which, when implemented by the second processor, implements the method of any one of the preceding embodiments.
Based on the above composition of the decoder 160 and the computer storage medium, referring to
The second communication interface 1701 is configured to receive and transmit the signal in the process of transmitting and receiving information with other external network elements.
The second memory 1702 is configured to store a computer program capable of running on the second processor 1703.
The second processor 1703 is configured to, when running the computer program:
Alternatively, as another embodiment, the second processor 1703 is further configured to, when running the computer program, perform the method of any of the preceding embodiments.
It is to be understood that the second memory 1702 is similar in hardware function to the first memory 1502, and the second processor 1703 is similar in hardware function to the first processor 1503, and will not be elaborated here.
The present embodiment provides a decoder. The decoder may include a parsing unit and a feature analysis unit. In this way, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
In an embodiment of the present disclosure, referring to
In the embodiments of the present disclosure, the intelligent analysis system 180 includes an intelligent fusion network model, and the intelligent fusion network model may include an encoding network, a decoding network and an intelligent task network. The encoding network and part of the intelligent task network are used in the encoder 1801, and the decoding network and the other part of the intelligent task network are used in the decoder 1802. In this way, the feature extraction of the intelligent task network is used as the input of the encoding network, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
It is to be noted that the terms used herein “including”, “comprising” or any other variation thereof are intended to encompass non-exclusive inclusion, so that a process, a method, an article or a device that includes a set of elements includes not only those elements but also other elements that are not explicitly listed, or also elements inherent to such a process, method, article or device. In the absence of further limitations, an element defined by the phrase “includes an . . . ” does not exclude the existence of another identical element in the process, method, article or device in which the elements is included.
The above serial numbers of the embodiments of the present disclosure are for description only and do not represent the advantages and disadvantages of the embodiments.
The methods disclosed in several embodiments of the methods provided in the disclosure can be arbitrarily combined as long as there is no conflict therebetween to obtain a new embodiment of a method.
The features disclosed in several embodiments of the product provided in the disclosure can be arbitrarily combined as long as there is no conflict therebetween to obtain a new embodiment of a product.
The features disclosed in several embodiments of methods or devices provided in the disclosure can be arbitrarily combined as long as there is no conflict therebetween to obtain a new embodiment of a method or a device.
The descriptions above are only the specific embodiments of the present disclosure, and are not intended to limit the scope of protection of the embodiments of the present disclosure. Any change and replacement is easily to think within the technical scope of the embodiments of the present by those skilled in the art, and fall with the protection scope of the present disclosure. Therefore, the scope of protection of the embodiments of the present disclosure shall be subject to the scope of protection of the claims.
In the embodiments of the present disclosure, at the encoder side, the feature extraction for input picture data is performed by using an intelligent task network to obtain initial feature data, the encoding processing is performed for the initial feature data by using an encoding network, and the obtained encoded bits are signalled in a bitstream. At the decoder side, the bitstream is parsed to determine a reconstruction feature data, the feature analysis is performed on the reconstruction feature data by using an intelligent task network to determine a target result. In this way, the feature extraction of the intelligent task network is used as the input of the encoding network, it not only can better learn the picture information required by the intelligent task network, but also can save the process of picture restoration and extracting the feature data for restoring picture in related arts, so that the decoding network can execute the process of the intelligent task network without restoring to the picture dimension, thereby greatly reducing the complexity of the intelligent task network, and further improving the accuracy and speed of the intelligent task network.
This is a continuation application of International Patent Application No. PCT/CN2021/122480, filed on Sep. 30, 2021, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/122480 | Sep 2021 | WO |
Child | 18618752 | US |