The present invention relates to an event-driven AI inference system that infers an image captured by a camera only during a time when an event occurs.
Applying artificial intelligence (AI) inference to a moving image acquired by a camera is being considered. It is known that the amount of calculation and the amount of memory required for AI inference increase as the inference accuracy increases. Therefore, a method of transmitting an encoded moving image from a camera or the like and inferring the moving image after decoding in a server or the like is generally used.
However, there are few cases of use in which inference is required for all frames of a moving image acquired by a camera. The reason for this is that, for example, a security camera is required to perform inference only when a person is shown in the frame. That is, it is wasteful to perform inference when no person is shown in the frame. Therefore, event-driven inference has been proposed in which the inference is performed only during the time when an event occurs (NPL 1).
In the event-driven inference, light-weight and low-cost inference is applied to a frame after decoding, and it is determined whether or not highly accurate inference is required. In order to determine execution of highly accurate inference, it is general to use the confidence for the predicted class (hereinafter referred to as confidence), and if the confidence is high, highly accurate inference is executed. The confidence is a value indicating the degree of probability that a neural network (NN) model determines its own output to be correct (NPL 2).
For example, in the case of the above-mentioned security camera, frames of an image acquired by a camera 100 and encoded by an encoder circuit 101 are decoded by a decoder circuit 102 on a server side, for example, as shown in
The event-driven multi-layer inference is a technique for reducing the time required for executing the highly accurate inference and reducing the total cost by performing the first-layer inference using a low-cost model. In event-driven multi-layer inference, as a result of the first-layer inference, the second-layer inference may not be performed, but the stream delivery from the camera is always operating, and there is a problem that the cost of encoding/decoding cannot be reduced.
Embodiments of the present invention have been made to solve the above problem, and an object of embodiments of the present invention is to provide an AI inference system capable of reducing power consumption as compared with the conventional system.
An AI inference system according to embodiments of the present invention includes: an encoder circuit configured to encode an image captured by a camera; a decoder circuit configured to decode the image encoded by the encoder circuit; a first inference processing unit configured to execute first inference processing on the image decoded by the decoder circuit for each frame; a second inference processing unit configured to receive a confidence value of the first inference from the first inference processing unit and to execute second inference processing with higher accuracy than the first inference only for frames of an image whose confidence value exceeds a predetermined threshold value; and a frame rate control unit configured to reduce a frame rate of each of the encoder circuit and the decoder circuit from an initial value to a low speed value when the confidence value is equal to or less than the threshold value.
Further, in one configuration example of the AI inference system according to embodiments of the present invention, the encoder circuit and the decoder circuit are provided for each camera, and the frame rate control unit is configured to reduce the frame rate of each of the encoder circuit and the decoder circuit corresponding to a camera that outputs an image whose confidence value is equal to or less than the threshold value among a plurality of cameras from an initial value to a low speed value.
Further, in one configuration example of the AI inference system according to embodiments of the present invention, the encoder circuit and the decoder circuit are provided for each camera, and the frame rate control unit is configured to, when at least one first camera among a plurality of cameras outputs an image whose confidence value exceeds the threshold value, leave the frame rate of each of the encoder circuit and the decoder circuit corresponding to a second camera that is in a predetermined positional relationship with the first camera at the initial value, and to, when a third camera that is not in a predetermined positional relationship with the first camera outputs an image whose confidence value is equal to or less than the threshold value, reduce the frame rate of each of the encoder circuit and the decoder circuit corresponding to the third camera to the low speed value.
According to embodiments of the present invention, since the frame rate of each of the encoder circuit and the decoder circuit is reduced from the initial value to the low speed value when the confidence value of the first inference is equal to or less than the threshold value, power consumption can be reduced as compared with conventional event-driven multi-layer inference.
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
The decoder circuit 102, the first inference processing unit 103, the second inference processing unit 104, and the frame rate control unit 105 are provided inside, for example, a server device located at a position away from the monitoring camera 100.
The encoder circuit 101 encodes an image captured by the monitoring camera 100 (step S100 in
The decoder circuit 102 decodes the image encoded by the encoder circuit 101 (step S101 in
The first inference processing unit 103 executes inference processing for person detection on the decoded image for each frame of the image (step S102 in
When a confidence value output from the first inference processing unit 103 exceeds a predetermined threshold value (YES in step S103 in
The encoder circuit 101, the decoder circuit 102, the first inference processing unit 103, and the second inference processing unit 104 execute the processing of steps S100 to S104 for each frame.
On the other hand, when the confidence value output from the first inference processing unit 103 is equal to or less than the threshold value (NO in step S200 in
Subsequently, the frame rate control unit 105 accesses the respective clock generation circuits of the encoder circuit 101 and the decoder circuit 102, and rewrites the clock frequency value set in the register of the clock generation circuit to reduce the frame rate of each of the encoder circuit 101 and the decoder circuit 102 from an initial value to a predetermined low speed value (step S202 in
Then, the frame rate control unit 105 resets the encoder circuit 101 and the decoder circuit 102 to resume the stream output (step S203 in
Thus, encoding and decoding are performed at a low frame rate from a frame next to a frame whose confidence value becomes equal to or less than the threshold value.
Further, when the confidence value output from the first inference processing unit 103 exceeds the threshold value (YES in step S200), the frame rate control unit 105 temporarily stops stream output from the encoder circuit 101 and the decoder circuit 102 (step S205 in
Subsequently, the frame rate control unit 105 accesses the respective clock generation circuits of the encoder circuit 101 and the decoder circuit 102, and rewrites the clock frequency value set in the register of the clock generation circuit to restore the frame rate of each of the encoder circuit 101 and the decoder circuit 102 from the low speed value to the initial value (step S206 in
Then, the frame rate control unit 105 resets the encoder circuit 101 and the decoder circuit 102 to resume the stream output (step S207 in
Thus, when the confidence value becomes larger than the threshold value after the confidence value becomes equal to or less than the threshold value, encoding and decoding are performed at a normal frame rate from a frame next to a frame whose confidence value becomes equal to or less than the threshold value. The frame rate control unit 105 executes the processing of steps S200 to S207 for each frame.
As described above, in the present embodiment, when the confidence value of the inference by the first inference processing unit 103 is equal to or less than the threshold value, the highly accurate inference by the second inference processing unit 104 is stopped, and at the same time, the frame rates of the encoder circuit 101 and the decoder circuit 102 are reduced. Therefore, the power consumption can be further reduced as compared with the conventional event-driven multi-layer inference.
Since the frame rate of the monitoring camera 100 is at most about 60 frame per second (FPS), there is an interval of about 16.6 milliseconds between frames. Therefore, if the processing of steps S200 to S207 is completed within this interval, frame loss will not occur. When the frame rate is lowered, the cost (for example, dynamic random access memory (DRAM) access) required for data transfer in the encoder circuit 101 and the decoder circuit 102 is reduced, thereby reducing power consumption.
As shown in the example of
Further, in the configuration shown in
For example, in the example shown in
Further, the frame rate control unit 105 leaves the frame rates of the corresponding encoder circuits 101-1 and 101-3 and decoder circuits 102-1 and 102-3 at their initial values regardless of the confidence of inference for the monitoring cameras 100-1 and 100-3 installed within a range of a predetermined distance from the monitoring camera 100-2.
In addition, the frame rate control unit 105 sets the frame rate of each of the encoder circuit 101-4 and the decoder circuit 102-4 corresponding to the monitoring camera 100-4 that is installed outside a range of a predetermined distance from the monitoring camera 100-2 and that outputs an image whose confidence value is equal to or less than the threshold value to a low speed value. Also, the frame rate control unit 105 leaves the frame rates of the corresponding encoder circuit 101 and decoder circuit 102 at initial values for the monitoring camera that outputs an image whose confidence value exceeds the threshold value even when the monitoring camera is installed outside a range of a predetermined distance from the monitoring camera 100-2.
Information indicating the positional relationship of the respective monitoring cameras 100-1 to 100-4 is set in the frame rate control unit 105 in advance.
Thus, the power consumption of the whole AI inference system can be optimized by changing the frame rate according to the positional relationship of the respective monitoring cameras 100-1 to 100-4.
The first inference processing unit 103, the second inference processing unit 104, and the frame rate control unit 105 described in the present embodiment can be implemented by a computer having a central processing unit (CPU), a storage device, and an interface with the outside, and a program that controls these hardware resources.
The computer includes a CPU 300, a storage device 301, and an interface device (I/F) 302. The encoder circuits 101-1 to 101-4, the decoder circuit 102, and the like are connected to the I/F 302. In such a computer, a program for implementing the method according to embodiments of the present invention is stored in the storage device 301. The CPU 300 executes the processing described in the present embodiment according to the program stored in the storage device 301.
Embodiments of the present invention can be applied to techniques for improving the efficiency of AI inference.
This application is a national phase entry of PCT Application No. PCT/JP2021/042763, filed on Nov. 22, 2021, which application is hereby incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/042763 | 11/22/2021 | WO |