The present disclosure relates to a method of stabilizing bounding boxes for objects in a video stream. By using the method the user experience with respect to the stability of bounding boxes is improved. The disclosure further relates to an image processing system, which implements the method.
A bounding box is a type of annotation that refers to a box drawn around an object in an image or video. The bounding box is usually, but not necessarily, a rectangle defined by x (longitudinal) and y (latitude) coordinates, whose edges surround the object. A bounding box is not strictly limited to the shape of a rectangle, but may take any suitable shape. The dimensions of the bounding box usually depend on the height and width of the object. Bounding boxes are often labeled with the name of the type of object that they surround, for example, ‘CAR’ for a bounding box surrounding a car. It is also common that different colors are used for different types of objects.
A machine learning model can be trained to perform object detection and recognize certain object types. In a video stream, which comprises a sequence of image frames, the bounding boxes can either be updated for each image frame or, for example, updated for every N:th image frame. In, for example, a surveillance system comprising a camera, the bounding boxes in the video stream can be updated in substantially real-time such that the user can keep track of a number of objects. If new objects enter the view of the camera or if objects leave the view, the bounding boxes are updated accordingly.
An issue with the existing systems and the existing methods of generating bounding boxes in video streams is that they do not generate stable bounding boxes for objects in the video stream. Although not always visible for the human eye, the video stream will typically have pixel variations between the image frames. These variations make the system compute the bounding boxes slightly different from frame to frame, such that the position and/or size of the bounding box changes between the frames. This may be perceived as a flickering effect on the bounding boxes to the user. This undesired effect can be unpleasant for the user or can provide an experience to the user that the application does not work correctly.
An article, “Video object extraction and its tracking using background subtraction in complex environments”, Kumar et al, 2016, https://doi.org/10.1016/j.pisc.2016.04.064, discloses a method of studying moving blobs on foreground and updating the background to improve tracking accuracy.
Another article, “Improving Performance of CNN Based Vehicle Detection and Tracking by Median Algorithm”, Shah et al, 2021, DOI: 10.1109/ICCE-Asia53811.2021.9641942 proposes a median based label estimation scheme that predicts detection labels for objects in a frame using a median of history labels stored in the previous frames.
U.S. Pat. No. 10,511,846 discloses a method and apparatus for adaptive denoising of source video in a video conference application. Temporal denoising is adaptively applied to blocks of a source frame based on noise estimation and moving object detection.
The present disclosure relates to a method of stabilizing bounding boxes for objects in a video stream, the method comprising:
The inventors have realized that the measured noise level can be used to adapt the temporal filtering to achieve a positive, stabilizing effect on the bounding box in the video stream. This may be done by averaging a position of the bounding box in a given image over a number of preceding image frames. The preceding frames can be expressed as a number of frames or as a time window. The number of preceding frames, or the length of the time window, can be adapted to the level of noise. According to one embodiment the number of preceding image frames is adapted such that a higher noise level implies a temporal filtering over a greater number of preceding image frames, or a longer time window, and a lower noise level implies a temporal filtering over a smaller number of preceding image frames, or a shorter time window.
The noise in a video is typically not fixed but varies over time and depends on the environment. When suppressing instability by averaging the position of the bounding box, the inventors have found it to be beneficial to update the length of the time window accordingly. A video that has a high level of noise is prone to produce very unstable bounding boxes, whereas a low level of noise is prone to produce less unstable bounding boxes. By dynamically and continuously updating the time window or the number of preceding frames used in the temporal filtering, it is possible to make the bounding boxes more stable while keeping them as responsive as possible in terms of how fast they react to, for example, movement.
Averaging is, generally, a stabilization technique that would be known to a person skilled in the art. Averaging over a very long time window is, however, not always useful since it creates latency. In other words, what a user or application gains in stability of bounding boxes in a noisy video, the user or application may at the same time lose in latency. A bounding box that has too much latency may be as unpleasant for the user as flickering effects on the bounding boxes caused by noise. The presently disclosed method describes a method and a system that can adapt the stabilization based on the noise level, wherein latency is reduced for less noisy video sequences and wherein slightly increased latency may be tolerated for video sequences having higher noise level as a result of increasing the stability of the bounding boxes.
The present disclosure further relates to an image processing system comprising:
The system may further comprise a display for displaying the video stream and the stabilized bounding box. The image processing system may be used in, for example, a camera-based surveillance system.
A person skilled in the art will recognize that the presently disclosed method of stabilizing bounding boxes for objects in a video stream may be performed using any embodiment of the presently disclosed image processing system, and vice versa.
Various embodiments are described hereinafter with reference to the drawings. The drawings are examples of embodiments and are intended to illustrate some of the features of the presently disclosed method and system for stabilizing bounding boxes for objects in a video stream, and are not limiting to the presently disclosed method and system.
The present disclosure relates to a method of stabilizing bounding boxes for objects in a video stream.
The step of temporally filtering the bounding box over a plurality of image frames based on the measured noise level may be done such that a position of the temporally filtered bounding box in a given image frame is a combination, such as an average, of positions of the bounding box for a number of preceding image frames.
The step of temporally filtering the bounding box over a plurality of image frames may be based on the measured noise level and may be performed for every N:th image frame, where N is an integer greater than 1. Alternatively, the step of temporally filtering the bounding box over a plurality of image frames based on the measured noise level is performed for every image frame. In the example of
The step of temporally filtering the bounding box over a plurality of image frames may comprise temporally smoothing the bounding box. Temporal smoothing is a term that would generally be known to a person skilled in the art. It may refer to averaging the bounding box over multiple image frames to create a more stable position and/or size. The temporal filtering can apply a simple technique, such as plain averaging, but may also apply, for example, exponential smoothing and/or Kalman filtering.
If the length of the time window, or the number of preceding image frames based on which the temporal filtering is done, is adapted to a noise level, a more flexible temporal filtering can be achieved.
Noise, commonly known as static, white noise, or snow may be a dot pixel pattern that appears in a video. Noise may refer to random variations in brightness or color in the video. It may appear as grainy or speckled texture in the video. In one embodiment the noise is visual noise in the video stream. The noise may be a dot pattern, such as a pixel pattern, varying, such as randomly varying, between the image frames, superimposed on the image frames. Alternatively, or in combination, the noise may comprise fluctuations of color and/or luminance and/or contrast. Video noise may be, but is not necessarily, visible to the user.
Noise can occur due to a number of factors, such as the camera's sensitivity, ISO settings or digital amplification settings. The noise may comprise internal noise, such as noise caused by electricity, heat or illumination levels, and/or compression artifacts, and/or interference noise, such as Gaussian noise and/or fixed-pattern noise and/or salt and pepper noise and/or shot noise and/or quantization and/or anisotropic noise.
Generally, a person skilled in the art would understand what noise in a video can refer to. The term shall, within the context of the present application, be construed broadly to cover any noise that can cause instability of bounding boxes in the video. It would be clear to a person skilled in that art that when there is noise in a video, the detection of objects, and in the end the computation of bounding boxes, may vary slightly.
The step of measuring a noise level for the video stream can be carried out in several ways. Typically, the noise may be expressed as a magnitude of variation of a dot pattern between the image frames. If there, for example, is a high level of similarity between image frames, not taking into account other events like a moving object, the video can be said to have lower noise than if there is a low level of similarity. The noise level can be said to be a measure of the random variations or distortions present in the video.
One measure that can be used to express a noise level is the Signal-to-Noise Ratio (SNR), which is a measure of the ratio of the useful information in the video compared to the noise. A higher SNR indicates a lower noise level. This may be an average over time or for a number of image frames. The Peak-Signal-to-Noise Ratio (PSNR) is another example of how the noise level can be expressed for a person skilled in art. These and other techniques and standards for determining a noise level of a video would be readily available to a person skilled in the art.
The noise level can be measured for an entire region of the image frames, or for a region covering the bounding box, or for a region covering the bounding box and an additional region surrounding the bounding box. In the example of
Similarly, sub-regions comprising moving objects may be disregarded in the step of measuring the noise level since the moving objects may be seen as pixel changes that are accounted to as noise although the moving objects are not noise in the video.
Object detection is a computer vision technique that involves locating and identifying objects within the image frames. A person skilled in that art would generally be familiar with such techniques and would know how to implement them.
Convolutional Neural Networks (CNN) or other machine learning-based methods have gained popularity as they are typically very accurate and fast, but there are a number of object detection techniques that do not rely on CNN or machine learning.
One example of an object detection algorithm is the Viola-Jones detection framework. In this method the image frames are scanned with a sliding window, where each region is classified as containing or not containing an object. The method uses Haar features and a cascaded classifier to detect objects.
There are various types of such object detection methods known in the art where cascades of identifiers are used to detect objects, e.g. as described in Viola, Paul, and Michael Jones. “Rapid object detection using a boosted cascade of simple features.” Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001. Since it is the visual features that are important for these algorithms groups of objects that share similar visual features may be detected, examples of such groups are faces, vehicles, humans etc. Any of these methods may be used separately or in combination to detect an object in image data. Several objects may also be detected in the same set of image data.
When an object has been detected, a set of identification characteristics may be created to describe the visual appearance of the detected object. Image data from a single image frame or a video sequence may be used to create the identification characteristics for the detected object. Various image and/or video analysis algorithms may be used to extract and create the identification characteristics from the image data. Examples of such image or video analysis algorithms are various algorithms e.g. for extracting features in a face such in Turk, Matthew A., and Alex P. Pentland. “Face recognition using eigenfaces.” Computer Vision and Pattern Recognition, 1991. Proceedings CVPR'91., IEEE Computer Society Conference on. IEEE, 1991., gait features such in Lee, Lily, and W. Eric L. Grimson. “Gait analysis for recognition and classification.” Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 2002., or colors such in U.S. Pat. No. 8,472,714 by Brogren et al.
A database may comprise a number of objects and a number of identification characteristics. In the presently disclosed method of stabilizing bounding boxes for objects in a video stream, the step of detecting an object in the image frames may comprise matching identification characteristics to identification characteristics in the database to classify an object as a certain type of object, for example, a car, a person or any other item. In one embodiment the step of detecting an object in the image frames comprises comparing the image frames against reference images in a database to match features corresponding to the object in the image frames and the reference images.
A Scale-Invariant Feature Transform (SIFT) is another feature-based object detection technique that uses keypoint extraction and matching to detect and track objects in a video. It works by identifying distinctive features in the image, such as edges, corners and blobs, and matching them across multiple frames.
The step of detecting an object in the image frames comprises processing the image frames to identify predefined features, such as shapes, that are characteristic for the object.
As mentioned above, alternatively, or in combination, the step of detecting an object in the image frames may comprise applying a machine learning model, such as a neural network, trained to recognize the object. The neural network may comprise, for example, a deep learning model.
Classification of objects and/or of events may be achieved by means of a neural network. Classifying neural networks are often used in applications like character recognition, monitoring, surveillance, image analysis, natural language processing etc. There are many neural network algorithms/technologies that may be used for classifying objects, e.g. Convolutional Neural Networks, Recurrent Neural Networks, etc.
According to a non-limiting example of a setup for object detection using a neural network, a neural network is fed with labeled data. The labeled data is, for example, an image of an object to be classified, wherein the image is labeled with the correct class, i.e. the labeled data includes a ground truth of the image data and the image data itself. The image data is inputted to the classifier and the ground truth is sent to a loss function calculator. A classifier processes the data representing an object to be classified and generates a classification identifier. The processing in the classifier may include applying weights to values as the data is fed through the classifier. The classification identifier may be a feature vector, a classification vector, or a single value identifying a class. In the loss function the classification identifier is compared to the ground truth using, for example, a loss function. The result from the loss function is then transferred to a weight adjustment function that is configured to adjust the weights used in the classifier. When the classifier is fully trained it may be used to perform a classification by loading the data to be classified into the classifier. The data to be classified may be in the same form as the labeled data used during training, but without the label. The classifier can then output data identifying the class determined for the data inputted.
The present disclosure further relates to an image processing system comprising:
The system may, but does not necessarily have to, include a display for displaying the video stream and stabilized bounding box. The stabilization of bounding boxes for objects in a video stream may be a useful visual feature for a user viewing the display. However, in further applications the bounding boxes are not necessarily displayed but used in additional applications. The additional application may include, for example, extracting statistics or further information from the video, for example, analyzing sizes or orientations of objects. As would be realized by a person skilled in the art, neither the system nor the method have to be limited to displaying the bounding boxes on a display.
The system may further comprise peripheral components, such as one or more memory units, which may be used for storing instructions that can be executed by the processing unit. The system may further comprise any of: internal and external network interfaces, input and/or output ports, a keyboard or mouse etc.
As would be understood by a person skilled in the art, a processing unit also may be a single processor in a multi-core/multiprocessor system. Both the computing hardware accelerator and the central processing unit may be connected to a data communication infrastructure.
The system may include a memory unit, such as a random access memory (RAM) and/or a read-only memory (ROM), or any suitable type of memory. The system may further comprise a communication interface that allows software and/or data to be transferred between the system and external devices. Software and/or data transferred via the communications interface may be in any suitable form of electric, optical or RF signals. The communications interface may comprise, for example, a cable or a wireless interface.
The present disclosure further relates to a computer program having instructions which, when executed by a computing device or computing system, cause the computing device or computing system to carry out any embodiment of the presently disclosed method of stabilizing bounding boxes for objects in a video stream. The computer program may be stored on any suitable type of storage media, such as non-transitory storage media.
Number | Date | Country | Kind |
---|---|---|---|
23172013.7 | May 2023 | EP | regional |