The present disclosure relates to an image decoding method, an image encoding method, an image decoding device, and an image encoding device.
The neural network is a series of algorithms that attempt to recognize underlying relationships in a dataset via a process of imitating the processing method of the human brain. In this sense, the neural network refers to a system of neurons that is essentially organic or artificial. Different types of neural network in deep learning, for example, convolution neural network (CNN), recurrent neural network (RNN), and artificial neural network (ANN) will change the way we interact with the world. These different types of neural network will be the core of power applications such as the deep learning revolution, unmanned aerial vehicles, autonomous vehicles, and speech recognition. The CNN, which includes a plurality of stacked layers, is a class of deep neural network most commonly applied to the analysis of visual images.
A feature image is a unique representation indicating a feature of an image or an object included therein. For example, in a convolutional layer of a neural network, a feature image is obtained as output of applying a desired filter to the entire image. A plurality of feature images is obtained by applying a plurality of filters in a plurality of convolutional layers, and a feature map can be created by arranging the plurality of feature images.
The feature map is typically associated with a task processing device that executes a task process such as a neural network task. This setup usually enables the best inference result for a particular machine analysis task.
When the decoder side uses the feature map created by the encoder side, the encoder encodes the created feature map to transmit a bitstream including encoded data on the feature map to the decoder. The decoder decodes the feature map on the basis of the received bitstream. The decoder inputs the decoded feature map into a task processing device that executes the prescribed task process such as the neural network task.
According to the background art, when a plurality of task processing devices on the decoder side executes a plurality of neural network tasks by using a plurality of feature maps, it is necessary to install a plurality of sets of encoders and decoders corresponding to each of the plurality of task processing devices, complicating the system configuration.
Note that the image encoding system architecture according to the background art is disclosed, for example, in Patent Literatures 1 and 2.
Patent Literature 1: US Patent Publication No. 2010/0046635
Patent Literature 2: US Patent Publication No. 2021/0027470
An object of the present disclosure is to simplify the system configuration.
An image decoding method according to one aspect of the present disclosure includes, by an image decoding device: receiving, from an image encoding device, a bitstream including encoded data of a plurality of feature maps for an image; decoding the plurality of feature maps using the bitstream; selecting a first feature map from the plurality of decoded feature maps and outputting the first feature map to a first task processing device that executes a first task process based on the first feature map; and selecting a second feature map from the plurality of decoded feature maps and outputting the second feature map to a second task processing device that executes a second task process based on the second feature map.
FIC. 21 is a block diagram showing a configuration of a decoding device according to the second embodiment of the present disclosure.
For example, the encoding device 1101A creates a feature map A on the basis of the input image or feature, and encodes the created feature map A, thereby transmitting a bitstream including encoded data on the feature map A to the decoding device 1102A. The decoding device 1102A decodes the feature map A on the basis of the received bitstream, and inputs the decoded feature map A into the task processing unit 1103A. The task processing unit 1103A executes the prescribed task process by using the input feature map A, thereby outputting the estimation result.
The problem of the background art shown in
To solve such a problem, the present inventor introduces a new method in which an image encoding device transmits a plurality of feature maps included in the same bitstream to an image decoding device, and the image decoding device selects a desired feature map from the plurality of decoded feature maps and inputs the selected feature map into each of the plurality of task processing devices. This eliminates the need to install a plurality of sets of image encoding devices and image decoding devices corresponding to the plurality of task processing devices, respectively, and can simplify the system configuration because one set of image encoding device and image decoding device is sufficient.
Next, each aspect of the present disclosure will be described.
An image decoding method according to one aspect of the present disclosure includes, by an image decoding device: receiving, from an image encoding device, a bitstream including encoded data of a plurality of feature maps for an image; decoding the plurality of feature maps using the bitstream; selecting a first feature map from the plurality of decoded feature maps and outputting the first feature map to a first task processing device that executes a first task process based on the first feature map; and selecting a second feature map from the plurality of decoded feature maps and outputting the second feature map to a second task processing device that executes a second task process based on the second feature map.
According to the present aspect, the image decoding device selects the first feature map from the plurality of decoded feature maps and outputs the first feature map to the first task processing device, and selects the second feature map from the plurality of decoded feature maps and outputs the second feature map to the second task processing device. This eliminates the need to install a plurality of sets of image encoding devices and image decoding devices corresponding to each of the plurality of task processing devices, simplifying the system configuration.
In the above-described aspect, the image decoding device selects the first feature map and the second feature map based on index information of each of the plurality of feature maps.
According to the present aspect, using the index information allows the selection of the feature map to be executed appropriately.
In the above-described aspect, the image decoding device selects the first feature map and the second feature map based on size information of each of the plurality of feature maps.
According to the present aspect, using the size information allows the selection of the feature map to be executed simply.
In the above-described aspect, the image decoding device decodes the second feature map by inter prediction using the first feature map.
According to the present aspect, using inter prediction for decoding the feature map allows reduction in the encoding amount.
In the above-described aspect, the image decoding device decodes the first feature map and the second feature map by intra prediction.
According to the present aspect, using intra prediction for decoding the feature map allows the plurality of feature maps to be decoded independently of each other.
In the above-described aspect, each of the plurality of feature maps includes a plurality of feature images for the image.
According to the present aspect, since the task processing device can execute the task process by using the plurality of feature images included in each feature map, accuracy of the task process can be improved.
In the above-described aspect, the image decoding device constructs each of the plurality of feature maps by decoding the plurality of feature images and arranging the plurality of decoded feature images in a prescribed scan order.
According to the present aspect, the feature map can be appropriately constructed by arranging the plurality of feature images in the prescribed scan order.
In the above-described aspect, each of the plurality of feature maps includes a plurality of segments, each of the plurality of segments includes the plurality of feature images, the image decoding device constructs each of the plurality of segments by arranging the plurality of decoded feature images in the prescribed scan order, and constructs each of the plurality of feature maps by arranging the plurality of segments in a prescribed order.
According to the present aspect, it is possible to control the process of dividing the stream on a segment-by-segment basis or the decoding process on a segment-by-segment basis, and flexible system configurations can be implemented.
In the above-described aspect, the image decoding device switches, based on a size of each of the plurality of decoded feature images, between ascending order and descending order for the prescribed scan order.
According to the present aspect, switching between ascending order and descending order for the scan order based on the size of each feature image makes it possible to construct the feature map appropriately.
In the above-described aspect, the bitstream includes order information which sets one of ascending order or descending order for the prescribed scan order, and the image decoding device switches, based on the order information, between ascending order and descending order for the prescribed scan order.
According to the present aspect, switching between ascending order and descending order for the scan order based on the order information makes it possible to construct the feature map appropriately.
In the above-described aspect, the plurality of feature images includes a plurality of types of feature images of different sizes, and the image decoding device decodes the plurality of feature images with a constant decoding block size corresponding to the smallest size of the plurality of sizes of the plurality of types of feature images.
According to the present aspect, by decoding the plurality of feature images with a constant decoding block size, the device configuration of the image decoding device can be simplified.
In the above-described aspect, the plurality of feature images includes a plurality of types of feature images of different sizes, and the image decoding device decodes the plurality of feature images with a plurality of decoding block sizes corresponding to the plurality of sizes of the plurality of types of feature images.
According to the present aspect, by decoding each feature image with a decoding block size corresponding to the size of each feature image, the number of headers required for each decoding block can be reduced, and encoding in a large area is possible, improving compression efficiency. In the above-described aspect, the prescribed scan order is raster scan order.
According to the present aspect, using the raster scan order enables fast processing by GPU or the like.
In the above-described aspect, the prescribed scan order is Z scan order.
According to the present aspect, using the Z scan order enables support for general video codecs.
In the above-described aspect, the bitstream includes encoded data on the image, the image decoding device decodes the image using the bitstream, and executes the decoding of the plurality of feature maps and the decoding of the image using a common decoding processing unit.
According to the present aspect, by executing the decoding of the feature maps and the decoding of the image by using a common decoding processing unit, the device configuration of the image decoding device can be simplified.
In the above-described aspect, the first task process and the second task process include at least one of object detection, object segmentation, object tracking, action recognition, pose estimation, pose tracking, and hybrid vision.
According to the present aspect, accuracy of each of these processes can be improved.
An image encoding method according to one aspect of the present disclosure includes, by an image encoding device: encoding a first feature map for an image; encoding a second feature map for the image; generating a bitstream including encoded data of the first feature map and the second feature map; and transmitting the generated bitstream to an image decoding device.
According to the present aspect, the image encoding device transmits the bitstream including the encoded data of the first feature map and the second feature map to the image decoding device. This eliminates the need to install a plurality of sets of image encoding devices and image decoding devices corresponding to each of the plurality of task processing devices installed on the image decoding device side, simplifying the system configuration.
An image decoding device according to one aspect of the present disclosure is configured to: receive, from an image encoding device, a bitstream including encoded data of a plurality of feature maps for an image; decode the plurality of feature maps using the bitstream; select a first feature map from the plurality of decoded feature maps and output the first feature map to a first task processing device that executes a first task process based on the first feature map; and select a second feature map from the plurality of decoded feature maps and output the second feature map to a second task processing device that executes a second task process based on the second feature map.
According to the present aspect, the image decoding device selects the first feature map from the plurality of decoded feature maps and outputs the first feature map to the first task processing device, and selects the second feature map from the plurality of decoded feature maps and outputs the second feature map to the second task processing device. This eliminates the need to install a plurality of sets of image encoding devices and image decoding devices corresponding to each of the plurality of task processing devices, simplifying the system configuration.
An image encoding device according to one aspect of the present disclosure is configured to: encode a first feature map for an image; encode a second feature map for the image; generate a bitstream including encoded data of the first feature map and the second feature map; and transmit the generated bitstream to an image decoding device.
According to the present aspect, the image encoding device transmits the bitstream including the encoded data of the first feature map and the second feature map to the image decoding device. This eliminates the need to install a plurality of sets of image encoding devices and image decoding devices corresponding to each of the plurality of task processing devices installed on the image decoding device side, simplifying the system configuration.
Embodiments of the present disclosure will be described in detail below with reference to the drawings. Note that elements denoted by the same reference numerals in different drawings represent the same or corresponding elements.
Note that each of the embodiments described below shows one specific example of the present disclosure. Numerical values, shapes, components, steps, order of steps, and the like shown in the following embodiments are merely one example, and are not intended to limit the present disclosure. A component that is not described in an independent claim representing the highest concept among components in the embodiments below is described as an arbitrary component. In all the embodiments, respective contents can be combined.
The encoding device 1201 creates a plurality of feature maps A to N on the basis of an input image or features. The encoding device 1201 encodes the created feature maps A to N to generate a bitstream including encoded data on the feature maps A to N. The encoding device 1201 transmits the generated bitstream to the decoding device 1202. The decoding device 1202 decodes the feature maps A to N on the basis of the received bitstream. The decoding device 1202 selects the feature map A as a first feature map from among the decoded feature maps A to N, and inputs the selected feature map A into the task processing unit 1203A as the first task processing device. The decoding device 1202 selects the feature map B as the second feature map from among the decoded feature maps A to N, and inputs the selected feature map B into the task processing unit 1203B as the second task processing device. The task processing unit 1203A executes a first task process such as the neural network task on the basis of the input feature map A, and outputs the estimation result. The task processing unit 1203B executes a second task process such as the neural network task on the basis of the input feature map B, and outputs the estimation result.
Image data from a camera 1301 is input into the image encoding unit 1305 and the feature extraction unit 1302. The image encoding unit 1305 encodes the input image and inputs the encoded data into the transmission unit 1306. Note that the image encoding unit 1305 may use a general video codec or still image codec as it is. The feature extraction unit 1302 extracts a plurality of feature images representing the features of the image from the input image, and inputs the plurality of extracted feature images into the feature transformation unit 1303. The feature transformation unit 1303 generates a feature map by arranging the plurality of feature images. The feature transformation unit 1303 generates a plurality of feature maps for one input image, and inputs the plurality of generated feature maps into the feature encoding unit 1304. The feature encoding unit 1304 encodes the plurality of input feature maps and inputs the encoded data into the transmission unit 1306. The transmission unit 1306 generates a bitstream including the encoded data on the input image and the encoded data on the plurality of feature maps, and transmits the generated bitstream to the decoding device 1202.
The reception unit 1309 receives the bitstream transmitted from the encoding device 1201, and inputs the received bitstream into the image decoding unit 1308 and the feature decoding unit 1307. The image decoding unit 1308 decodes the image on the basis of the input bitstream. The feature decoding unit 1307 decodes the plurality of feature maps on the basis of the input bitstream. Note that the example shown in FIG. S has a configuration in which both the image and the feature maps are encoded and decoded. However, if image display for human vision is not necessary, a configuration in which only the feature maps are encoded and decoded may be adopted. In that case, a configuration in which the image encoding unit 1305 and the image decoding unit 1308 are omitted may be adopted
The feature transformation unit 1303 generates a plurality of feature maps for one input image, and inputs the plurality of generated feature maps into the image encoding unit 1306. The image encoding unit 1305 encodes the input image and the plurality of feature maps, and inputs the encoded data on the input image and the plurality of feature maps into the transmission unit 1306. The transmission unit 1306 generates a bitstream including the encoded data on the input image and the plurality of feature maps, and transmits the generated bitstream to the decoding device 1202.
The reception unit 1309 receives the bitstream transmitted from the encoding device 1201, and inputs the received bitstream into the image decoding unit 1308. The image decoding unit 1308 decodes the image and the plurality of feature maps on the basis of the input bitstream. That is, in the configuration shown in
As shown in
In step S2001 of
More specifically, the encoding device 1201 encodes the plurality of feature maps about the input image. Each feature map indicates a unique attribute about the image, and each feature map is, for example, arithmetically encoded. Arithmetic encoding is, for example, context adaptive binary arithmetic coding (CABAC).
For example, the plurality of feature images F1 to F108 is arranged according to the hierarchical order of the neural network. That is, the arrangement is made in ascending order (order of size from smallest) or descending order (order of size from largest) of the hierarchy of the neural network.
With reference to
Note that the selection unit 2403 may select the feature maps A to N on the basis of a combination of the index information IA to IN and the size information SA to SN.
In step S2002 of
The task processing unit 2404A outputs a signal indicating execution results of the neural network task. The signal may include at least one of the number of detected objects, the trust level of the detected objects, boundary information or location information on the detected objects, and the classification category of the detected objects.
In step S2003 of
Note that the configuration shown in
As shown in
The bitstream including encoded data on the plurality of feature maps A to N is input into the decoding device 1202. The decoding device 1202 decodes the image from the input bitstream as necessary, and outputs an image signal for human vision to a display device. The decoding device 1202 decodes the plurality of feature maps A to N from the input bitstream and inputs the decoded feature maps A to N into the selection unit 1400. The plurality of feature maps A to N of the same time instance can be decoded independently. One example of independent decoding is using intra prediction. The plurality of feature maps A to N of the same time instance can be decoded in correlation. One example of correlation decoding is using inter prediction, and the second feature map can be decoded by inter prediction using the first feature map. The selection unit 1400 selects a desired feature map from among the plurality of decoded feature maps A to N, and inputs the selected feature map into each of the task processing units 1203A to 1203N.
Note that the selection unit 1400 may select the feature maps A to N on the basis of a combination of the index information IA to IN and the size information SA to SN.
In step S1002 of
The task processing unit 1203A outputs a signal indicating execution results of the neural network task. The signal may include at least one of the number of detected objects, the trust level of the detected objects, boundary information or location information on the detected objects, and the classification category of the detected objects.
In step S1003 of
According to the present embodiment, the encoding device 1201 transmits the bitstream including encoded data on the first feature map A and the second feature map B to the decoding device 1202. The decoding device 1202 selects the first feature map A from the plurality of decoded feature maps A to N and outputs the first feature map A to the first task processing unit 1203A, and selects the second feature map B from the plurality of decoded feature maps A to N and outputs the second feature map B to the second task processing unit 1203B. This eliminates the need to install a plurality of sets of encoding devices and decoding devices corresponding to each of the plurality of task processing units 1203A to 1203N, simplifying the system configuration.
Since video codecs generally have limited memory capacity, images are often encoded in Z scan order. However, when constructing a system using a GPU with a large memory capacity, faster processing is possible if images or features input by using raster scan order rather than Z scan order are sequentially loaded into a memory of the GPU. Therefore, in the present embodiment, in the process of constructing a feature map by arranging a plurality of feature images in the prescribed scan order, a system that can switch between general Z scan order and fast raster scan order will be described. The present embodiment is applicable to an image processing system including at least one task processing unit.
The encoding device 2101 creates a feature map on the basis of an input image or features, The encoding device 2101 encodes the created feature map to generate a bitstream including encoded data on the feature map. The encoding device 2101 transmits the generated bitstream to the decoding device 2102. The decoding device 2102 decodes the feature map on the basis of the received bitstream. The decoding device 2102 inputs the decoded feature map into the task processing unit 2103. The task processing unit 2103 executes the prescribed task process such as the neural network task on the basis of the input feature map, and outputs the estimation result.
As shown in
The feature map is input into the scan order setting unit 3201. As shown in
In step S4001 of
The scanning unit 3202 divides the feature map into a plurality of segments in the scan order set by the scan order setting unit 3201, and divides each segment into a plurality of feature images.
Note that in the example shown in
The scanning unit 3202 sequentially inputs the plurality of divided feature images into the entropy encoding unit 3203. The entropy encoding unit 3203 generates the bitstream by encoding and arithmetically encoding each feature image with the encoding block size. Arithmetic encoding is, for example, context adaptive binary arithmetic coding (CABAC). The encoding device 2101 transmits the bitstream generated by the entropy encoding unit 3203 to the decoding device 2102.
As shown in
Furthermore, the encoding device 2101 may be configured to reconstruct the divided feature map, input the reconstructed feature map into the task processing unit 3205, and output the estimation result by the task processing unit 3205 executing the neural network task.
In step S4002 of
For example, the plurality of feature images is arranged according to the hierarchical order of the neural network. That is, the arrangement is made in ascending order (order of size from smallest) or descending order (order of size from largest) of the hierarchy of the neural network.
The scan order setting unit 3201 sets ascending order or descending order of the scan order on the basis of the size of each of the plurality of input feature images. The reconstruction unit 3204 switches between ascending order and descending order according to the scan order set by the scan order setting unit 3201. For example, the reconstruction unit 3204 switches to ascending order when the plurality of feature images is input in order of size from smallest, and switches to descending order when the plurality of feature images is input in order of size from largest.
Alternatively, order information for setting ascending order or descending order of the prescribed scan order may be added to the bitstream header or the like, and the reconstruction unit 3204 may switch between ascending order and descending order of the scan order on the basis of the order information. The reconstruction unit 3204 inputs, into the task processing unit 3205, the feature map reconstructed by arranging the plurality of feature images in the prescribed scan order.
In step S4003 of
The task processing unit 3205 outputs a signal indicating execution results of the neural network task. The signal may include at least one of the number of detected objects, the trust level of the detected objects, boundary information or location information on the detected objects, and the classification category of the detected objects.
Note that the configuration shown in
As shown in
In step S3001 of
As shown in
A plurality of decoding blocks or a plurality of feature images is input into the scan order setting unit 2202 from the entropy decoding unit 2201.
In step S3002 of
The plurality of feature images divided into a plurality of segments is input into the scanning unit 2203. The scanning unit 2203 constructs the feature map by arranging the plurality of feature images in the scan order set by the scan order setting unit 2202.
For example, the plurality of feature images is arranged according to the hierarchical order of the neural network. That is, the arrangement is made in ascending order (order of size from smallest) or descending order (order of size from largest) of the hierarchy of the neural network.
The scan order setting unit 2202 sets ascending order or descending order of the scan order on the basis of the size of each of the plurality of input feature images. The scanning unit 2203 switches between ascending order and descending order according to the scan order set by the scan order setting unit 2202. For example, the scanning unit 2203 switches to ascending order when the plurality of feature images is input in order of size from smallest, and switches to descending order when the plurality of feature images is input in order of size from largest. Alternatively, the order information for setting ascending order or descending order of the prescribed scan order may be decoded from the bitstream header or the like, and the scamming unit 2203 may switch between ascending order and descending order of the scan order on the basis of the order information. The scanning unit 2203 inputs, into the task processing unit 2103, the feature map constructed by arranging the plurality of feature images in the prescribed scan order.
Note that in the example shown in
In step S3003 of
The task processing unit 2103 outputs a signal indicating execution results of the neural network task. The signal may include at least one of the number of detected objects, the trust level of the detected objects, boundary information or location information on the detected objects, and the classification category of the detected objects.
According to the present embodiment, the feature map can be appropriately constructed by arranging the plurality of feature images in the prescribed scan order.
The present disclosure is particularly useful for application to the image processing system including an encoder transmitting images and a decoder receiving images.
Number | Date | Country | |
---|---|---|---|
63178788 | Apr 2021 | US | |
63178751 | Apr 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2022/018475 | Apr 2022 | US |
Child | 18380253 | US |