The present disclosure relates to computer vision technology, and more particularly, to methods and devices of detecting moving object, methods and devices for intelligent driving control, and electronic apparatuses, computer-readable storage medium, and computer programs thereof.
In technical fields such as intelligent driving and security monitoring, it is necessary to sense moving objects and their moving directions. Sensed moving objects and their moving directions may be provided to a decision-maker so that the decision-maker can make a decision based on the sensed results. For example, for an intelligent driving system in a case that a moving object (such as a person or an animal) at the side of the road is sensed as approaching the center of the road, the decision-maker can control a vehicle to slow down or even stop to ensure the safe driving of the vehicle.
Embodiments of the present disclosure provide a technical solution for detecting moving object.
According to a first aspect of the present disclosure, a method of detecting moving object is provided, the method includes: acquiring depth information of pixels of an image to be processed; acquiring optical flow information between the image to be processed and a reference image; wherein the reference image and the image to be processed are two images that are collected by an image pickup device in a continuous photographing mode and have a timing sequence relationship; acquiring a three-dimensional motion field of the pixels of the image to be processed with respect to the reference image according to both the depth information and the optical flow information; and determining a moving object involved in the image to be processed according to the three-dimensional motion field.
According to a second aspect of the present disclosure, there is provided a method of intelligent driving control, including: acquiring, by an image pickup device mounted on a vehicle, a video stream of a road where the vehicle is located; performing, through the method of detecting moving object, moving object detection on at least one video frame of the video stream to determine a moving object involved in the video frame; and generating and outputting a control instruction for the vehicle according to the moving object.
According to a third aspect of the present disclosure, a device for detecting moving object is provided, including: a first acquiring module, configured to acquire depth information of pixels of an image to be processed; a second acquiring module, configured to acquire optical flow information between the image to be processed and the reference image, wherein the reference image and the image to be processed are two images that are collected by an image pickup device in a continuous photographing mode and have a timing sequence relationship; a third acquiring module, configured to acquire a three-dimensional motion field of pixels of the image to be processed with respect to the reference image according to both the depth information and the optical flow information; a moving object determining module, configured to determine a moving object involved in the image to be processed according to the three-dimensional motion field.
According to a fourth aspect of the present disclosure, there is provided a device for intelligent driving control, including: a fourth acquiring module, configured to acquire a video stream of a road where the vehicle is located by an image pickup device mounted on the vehicle; the above-mentioned device for detecting moving object, configured to perform moving object detection on at least one video frame of the video stream to determine a moving object involved in the video stream; and a control module, configured to generate and output a control instruction for the vehicle according to the moving object.
According to a fifth aspect of the present disclosure, there is provided an electronic apparatus, including: a processor, memory, a communication interface, and a communication bus, wherein the processor, the memory, the communication interface communicate each other through the communication bus. The memory is configured to store at least one executable instruction, the executable instruction causes the processor to execute the above method.
According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, in a case that the computer program is executed by a processor, operations of the method according to any one of the embodiments of the present disclosure are implemented.
According to the seventh aspect of the present disclosure, there is provided a computer program, including computer instructions, which, in a case that the computer instructions is run in a processor of an apparatus, implements operations of the method according to any one of the embodiments of the present disclosure.
Based on the method of detecting moving object, the method of intelligent driving control, the device for detecting moving object, the device for intelligent driving control, and electronic apparatus, computer-readable storage medium, and computer program thereof provided by the present disclosure, a three-dimensional motion field of pixels of an image to be processed with respect to a reference can be determined according to both the depth information of the pixels of the image to be processed and the optical flow information between the image to be processed and the reference image. As the three-dimensional motion field may reflect a moving object, the moving objected involved in the image to be processed may be determined according to the three-dimensional motion field. Thus, the technical solution according to the present disclosure helps to improve the accuracy of sensing moving object, which facilitates to improve the safety of intelligent vehicle driving.
The technical solutions according to the present disclosure will be further described in detail below through the drawings and embodiments.
The drawings constituting a part of the specification illustrate the embodiments of the present disclosure, and serve to explain the principle of the present disclosure along with the description.
With reference to the drawings, the present disclosure can be understood more clearly according to the following detailed description, in which:
Various exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that unless specifically stated otherwise, the relative arrangement of components and steps, numerical expressions and numerical values set forth in these embodiments cannot be construed as a limit to the scope of the present disclosure.
And meanwhile, it should be understood that, for ease of description, the various components illustrated in the drawings are not drawn in scale. The following description of at least one exemplary embodiment is actually only illustrative, and in no way serves as any limitation to the present disclosure and its application or use. The techniques, methods, and equipment known to one of ordinary skill in the relevant arts may not be discussed in detail, but where appropriate, the techniques, methods, and equipment should be regarded as a part of the specification. It should be noted that similar reference numerals and letters designate similar items in the following drawings, and therefore, once an item is defined in one drawing, it does not need to be discussed further in subsequent drawings.
Embodiments of the present disclosure may be applied to electronic apparatuses such as terminal devices, computer systems, and servers, which may be operated with many other general-purpose or special-purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for using with electronic devices such as terminal devices, computer systems, and servers, include but not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, large computer systems, and distributed cloud computing technology environments including any of the above systems, etc.
Electronic apparatuses such as terminal devices, computer systems, and servers may be described in the general context of computer system executable instructions (such as program modules) executed by the computer system. Typically, program modules may include routines, programs, object programs, components, logic, and data structures, etc., which perform specific tasks or implement specific abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In the distributed cloud computing environment, tasks are executed by remote processing equipment connected via a communication network. And in a distributed cloud computing environment, program modules may be located on a storage medium of a local or remote computing system including a storage device.
S100: Depth information of pixels of the image to be processed is acquired.
In an optional example of the present disclosure, depth information of the pixels (such as all the pixels) of the image to be processed may be acquired through a disparity map of the image to be processed. That is, the disparity map of the image to be processed is acquired first, and then, the depth information of the pixels of the image to be processed is acquired according to the disparity map of the image to be processed.
In an optional example of the present disclosure, for clarity of description, the disparity map of the image to be processed is referred to as a first disparity map of the image to be processed hereinafter. The first disparity map in the present disclosure is intended to describe a disparity of the image to be processed. The disparity may be referred to a difference between positions of a target object when being observed from two points apart from each other a certain distance. An example of the image to be processed is illustrated in
In an optional example of the present disclosure, the image to be processed in the present disclosure is typically a monocular image. That is, the image to be processed is typically an image collected by a monocular image pickup device. In a case that the image to be processed is a monocular image, moving object detection can be achieved in the present disclosure without a binocular image pickup device, thereby reducing the cost of moving object detection.
In an optional example of the present disclosure, the first disparity map of the image to be processed may be acquired through a successfully pre-trained convolution neural network. For example, the image to be processed is input into a convolutional neural network for performing a disparity analysis by the convolution neural network, and a disparity analysis result is output from the convolutional neural network, such that the first disparity map of the image to be processed may be obtained on the basis of the disparity analysis result. The disparity map may be acquired through acquiring the first disparity map of the image to be processed by the convolution neural network, without calculating disparity between two images pixel-by-pixel and without calibrating the image pickup device. Thus, convenience and real-time performance of obtaining the disparity map is improved.
In an optional example of the present disclosure, the convolutional neural network typically includes but is not limited to: a plurality of convolutional layers (Cony) and a plurality of deconvolutional layers (Deconv). The convolutional neural network of the present disclosure may be divided into two parts, namely an encoding part and a decoding part. The image to be processed input into the convolutional neural network (the image to be processed as illustrated in
In an optional example of the present disclosure, low-level information and high-level information of the convolution neural network may be fused in a manner of skip connection. For example, output of at least one convolutional layer of the encoding part is provided to at least one deconvolutional layer of the decoding part through a skip connection. In an optional embodiment of the present disclosure, input of all convolutional layers of the convolutional neural network typically includes: output of the preceding layer (such as a convolutional layer or a deconvolutional layer), and input of at least one deconvolutional layer (such as part of the deconvolution layers or all the deconvolution layers) of the convolution neural network includes: upsample result of the output of the preceding convolution layer and output of a convolution layer of the coding part in skip connection with the deconvolution layer. For example, contents designated by a solid arrow drawn from the bottom of the convolutional layer on the right side of
In an optional example of the present disclosure, the first disparity map of the image to be processed that is obtained through the convolution neural network may further be optimally adjusted, to make the first disparity map more accurate. In an optional embodiment of the present disclosure, the first disparity map of the image to be processed may be optimally adjusted by using a horizontal mirror image (for example, the left mirror image or the right mirror image) of the image to be processed. For ease of description, the horizontal mirror image of the image to be processed is referred to as a first horizontal mirror image, and a disparity map of the first horizontal mirror image is referred to as a second disparity map hereinafter. An example of optimally adjusting the first disparity map in the present disclosure is as follows:
Step A: a horizontal mirror image of the second disparity map is obtained.
In an optional embodiment of the present disclosure, the first horizontal mirror image is intended to indicate that the mirror image is a mirror image generated by performing a horizontal mirroring (rather than vertical mirroring) on the image to be processed. For ease of description, the horizontal mirror image of the second disparity map is referred to as a second horizontal mirror image hereinafter. In an optional embodiment of the present disclosure, the second horizontal mirror image in the present disclosure refers to a mirror image generated by performing a horizontal mirroring on the second disparity map. The second horizontal mirror image is still a disparity map.
In an optional embodiment of the present disclosure, left mirroring or right mirroring may be performed on the image to be processed first (as the left mirroring result is the same as the right mirroring result, it is possible to perform either of the left mirroring and the right mirroring on the image to be processed in the present disclosure), to obtain the first horizontal mirror image; and then, the disparity map of the first horizontal mirror image is acquired; and finally, left mirroring or right mirroring is performed on the second disparity map (as the left mirroring result of the second disparity map is same as the right mirroring result, it is possible to perform either of the left mirroring and the right mirroring on the second disparity map) to acquire a second horizontal mirror image. For convenience of description, the second horizontal mirror image is referred to as the third disparity map hereinafter.
As can be seen from the above description of the present disclosure, in the case of performing the horizontal mirroring on the image to be processed, it is possible to ignore whether the image to be processed is taken as a left-eye image or as a right-eye image. That is, in the present disclosure, regardless of whether the image to be processed is taken as a left-eye image or a right-eye image, either left mirroring or right mirroring may be performed on the image to be processed, thereby acquiring the first horizontal mirror image. Similarly, in the present disclosure, in the case of performing the horizontal mirroring on the second disparity map, it may by not considered whether the left mirroring or the right mirroring should be performed on the second disparity map.
It should be noted that, in the process of training the convolutional neural network which is configured to generate the first disparity map of the image to be processed, in a case that the left-eye image of the binocular image sample is provided as input to the convolutional neural network for training, the successfully trained convolutional neural network will take an input image to be processed as the left-eye image in testing and practical applications, that is, the image to be processed in the present disclosure is taken as the left-eye image to be processed. And in a case that the right-eye image of the binocular image sample is provided as input to the convolutional neural network for training, the successfully trained convolutional neural network will take the input image to be processed as the right-eye image in testing and practical applications, that is, the image to be processed in the present disclosure is taken as the right-eye image to be processed.
In an optional embodiment of the present disclosure, the aforementioned convolutional neural network may further be configured to acquire the second disparity map. For example, the first horizontal mirror image is input into the convolutional neural network for performing a disparity analysis by the convolution neural network, and the convolution neural network outputs disparity analysis result. Thus, in the present disclosure, a second disparity map may be acquired according to the output disparity analysis result.
Step B: Both a weight distribution map of the disparity map (that is, the first disparity map) of the image to be processed and a weight distribution map of the second horizontal mirror image (that is, the third disparity map) are acquired.
In an optional example of the present disclosure, the weight distribution map of the first disparity map is intended to describe respective weight value of a plurality of disparity values (for example, all disparity values) of the first disparity map. The weight distribution map of the first disparity map may include but is not limited to: a first weight distribution map of the first disparity map and a second weight distribution map of the first disparity map.
In an optional embodiment of the present disclosure, the first weight distribution map of the first disparity map is a weight distribution map set uniformly for the first disparity maps of a plurality of different images to be processed, i.e., the first weight distribution map of the first disparity map may be oriented to the first disparity maps of the plurality of different images to be processed, that is, the first disparity maps of the plurality of different images to be processed adopt the same first weight distribution map. Therefore, in the present disclosure, the first weight distribution map of the first disparity map may be referred to as a global weight distribution map of the first disparity map. The global weight distribution map of the first disparity map is intended to describe respective global weight values of a plurality of disparity values (such as all the disparity values) of the first disparity map.
In an optional embodiment of the present disclosure, the second weight distribution map of the first disparity map is a weight distribution map set for the first disparity map of a single image to be processed, i.e., the second weight distribution map of the first disparity map is a first disparity map for a single image to be processed, that is, the first disparity maps of different images to be processed adopt different second weight distribution maps. Therefore, in the present disclosure, the second weight distribution map of the first disparity map is referred to as a local weight distribution map of the first disparity map. The local weight distribution map of the first disparity map is intended to describe respective local weight value of a plurality of disparity values (such as all the disparity values) of the first disparity map.
In an optional example of the present disclosure, a weight distribution map of the third disparity map is intended to describe respective weight values of a plurality of disparity values of the third disparity map. The weight distribution map of the third disparity map may include, but not limited to: a first weight distribution map of the third disparity map and a second weight distribution map of the third disparity map.
In an optional embodiment of the present disclosure, the first weight distribution map of the third disparity map is a weight distribution map set uniformly for the third disparity maps of a plurality of different images to be processed, i.e., the first weight distribution map of the third disparity map may be for the third disparity maps of a plurality of different images to be processed, that is, the third disparity maps of different images to be processed adopt a same first weight distribution map. Therefore, in the present disclosure, the first weight distribution map of the third disparity map may be referred to as a global weight distribution map of the third disparity map. The global weight distribution map of the third disparity map is intended to describe respective global weight value of a plurality of disparity values (such as all disparity values) of the third disparity map.
In an optional embodiment of the present disclosure, the second weight distribution map of the third disparity map is a weight distribution map set for the third disparity map of a single image to be processed, i.e., the second weight distribution map of the third disparity map is for the third disparity map of a single image to be processed, that is, the third disparity maps of different images to be processed adopt different second weight distribution maps. Therefore, in the present disclosure, the second weight distribution map of the third disparity map may be referred to as a local weight distribution map of the third disparity map. The local weight distribution map of the third disparity map is intended to describe respective local weight values of a plurality of disparity values (such as all the disparity values) of the third disparity map.
In an optional example of the present disclosure, the first weight distribution map of the first disparity map includes: at least two horizontally juxtaposed regions with different weight values. In an optional embodiment of the present disclosure, the relationship between a weight value of a left region and a weight value of a right region is typically depends on whether the image to be processed is taken as a left-eye image to be processed or a right-eye image to be processed.
For example, in a case that the image to be processed is taken as a left-eye image, for any two regions of the first weight distribution map of the first disparity map, a weight value of a right region is greater than a weight value of the left region.
For another example, when the image to be processed is taken as a right-eye image, for any two regions of the first weight distribution map of the first disparity map, a weight value of the left region is greater than that of the right region.
In an optional embodiment of the present disclosure, the first weight distribution map of the third disparity map includes: at least two horizontally juxtaposed regions with different weight values. A relationship between a weight value of the left region and a weight value of the right region typically depends on whether the image to be processed is taken as a left-eye image or a right-eye image.
For example, when the image to be processed is taken as a left-eye image, for any two regions of the first weight distribution map of the third disparity map, a weight value of the right region is greater than a weight value of a left region. In addition, any region of the first weight distribution map of the third disparity map may have a same weight value, or may have different weight values. In the case that a region of the first weight distribution map of the third disparity map has different weight values, a weight value of the left part of the region is typically not greater than a weight value of the right part of the region.
For another example, in a case that the image to be processed is taken as a right-eye image, for any two regions of the first weight distribution map of the third disparity map, a weight value of the left region is greater than that of the right region. In addition, any region of the first weight distribution map of the third disparity map may have an equal eight value or different weight values. In a case that a region of the first weight distribution map of the third disparity map has different weight values, a weight value of the right part of the region is typically not greater than a weight value of the left part of the region.
In an optional embodiment of the present disclosure, a manner of setting the second weight distribution map of the first disparity map may include the following steps:
First, horizontal mirroring (for example, left mirroring or right mirroring) is performed on the first disparity map to generate a mirror disparity map. For ease of description, the mirror disparity map is referred to as a fourth disparity map.
Secondly, for a pixel in the fourth disparity map, in a case that a disparity value of the pixel is greater than the first variable for the pixel, a weight value of the pixel for the second weight distribution map of the first disparity map of the image to be processed is set to a first value, and otherwise, set to a second value. In the present disclosure, the first value is greater than the second value. For example, the first value may be 1, and the second value may be 0.
In an optional embodiment of the present disclosure, an example of the second weight distribution map of the first disparity map is illustrated in
In an optional embodiment of the present disclosure, the first variable for the pixel may be set according to both a disparity value of a corresponding pixel in the first disparity map and a constant value greater than zero. For example, a product of the disparity value of the corresponding pixel in the first disparity map and a constant value greater than zero is taken as the first variable for the corresponding pixel in the fourth disparity map.
In an optional embodiment of the present disclosure, the second weight distribution map of the first disparity map may be expressed by the following Formula (1):
In the above Formula (1), Ll represents the second weight distribution map of the first disparity map; dflipl′ represents the disparity value of the corresponding pixel in the fourth disparity map; dl represents the disparity value of the corresponding pixel in the first disparity map; thresh1 represents a constant value greater than zero, thresh1 ranges from 1.1 to 1.5, such as thresh1=1.2 thresh2=1.25, and so on.
In an optional example, the second weight distribution map of the third disparity map can be set as follows: for a pixel in the first disparity map, in a case that a disparity value of the pixel in the first disparity map is greater than a second variable for the pixel, a weight value of the pixel for the second weight distribution map of the third disparity map is set to a first value, and otherwise, set to a second value otherwise. In an optional embodiment of the present disclosure, the first value is greater than the second value. For example, the first value may be 1 and the second value may be 0.
In an optional embodiment of the present disclosure, the second variable for the pixel may be set according to both a disparity value of a corresponding pixel in the fourth disparity map and a constant value greater than zero. For example, a left/right mirroring is performed on the first disparity map to generate a mirror disparity map, i.e., a fourth disparity map, and then, a product of a disparity value of a corresponding pixel in the fourth disparity map and a constant value greater than 0, is taken as the second variable for a corresponding pixel in the first disparity map.
In an optional embodiment of the present disclosure, an example of the third disparity map generated based on the image to be processed of
In an optional embodiment of the present disclosure, the second weight distribution map of the third disparity map may be expressed by the following Formula (2):
In the above Formula (2), Ll′ represents the second weight distribution map of the third disparity map; dflipl′ represents the disparity value of the corresponding pixel in the fourth disparity map; dl′ represents the disparity value of the corresponding pixel in the first disparity map; thresh2 represents a constant value greater than zero, and a value range of thresh2 may be 1.1-1.5, for example, thresh2=1.2 or thresh2=1.25, and so on.
Step C: the first disparity map of the image to be processed is optimally adjusted according to both the weight distribution map of the first disparity map of the image to be processed and the weight distribution map of the third disparity map of the image to be processed, to acquire an optimally adjusted disparity map, which is a finally obtained disparity map of the image to be processed.
In an optional example of the present disclosure, a plurality of disparity values of the first disparity map may be adjusted with both the first weight distribution map of the first disparity map and the second weight distribution map of the first disparity map, to acquire an adjusted first disparity map; a plurality of disparity values of the third disparity map are adjusted with both the first weight distribution map of the third disparity map and the second weight distribution map of the third disparity map, to acquire an adjusted third disparity map; and then, the adjusted first disparity map and the adjusted third disparity map are combined to acquire an optimally adjusted first disparity map of the image to be processed.
In an optional embodiment of the present disclosure, an example of acquiring the optimally adjusted first disparity map of the image to be processed is as follows:
Firstly, the first weight distribution map of the first disparity map and the second weight distribution map of the first disparity map are combined to acquire a third weight distribution map. The third weight distribution map may be expressed by the following Formula (3):
W
l
=M
l
+L
l·0.5 Formula (3)
In the Formula (3), Wl represents the third weight distribution map; Ml represents the first weight distribution map of the first disparity map; Ll represents the second weight distribution map of the first disparity map; 0.5 may also be changed to be other constant values.
Secondly, the first weight distribution map of the third disparity map and the second weight distribution map of the third disparity map are combined to acquire a fourth weight distribution map. The fourth weight distribution map may be expressed by the following Formula (4):
W
l
′=M
l
′+L
l′·0.5 Formula (4)
In Formula (4), Wl′ represents the fourth weight distribution map, Ml′ represents the first weight distribution map of the third disparity map; Ll′ represents the second weight distribution map of the third disparity map; 0.5 may also be changed to be other constant values.
Thirdly, a plurality of disparity values of the first disparity map are adjusted according to the third weight distribution map, to acquire an adjusted first disparity map. For example, for a disparity value of a pixel in the first disparity map, the disparity value of the pixel is replaced with a product of the disparity value of the pixel and a weight value of a pixel at a corresponding position of the third weight distribution map. After all pixels in the first disparity map are subjected to the above replacement, the adjusted first disparity map is acquired.
And next, a plurality of disparity values of the third disparity map is adjusted according to the fourth weight distribution map, to acquire an adjusted third disparity map. For example, for a disparity value of a pixel in the third disparity map, a disparity value of the pixel is replaced with a product of the disparity value of the pixel and a weight value of a pixel at a corresponding position in the fourth weight distribution map. After all the pixels of the third disparity map are subjected to the above replacement, the adjusted third disparity map is acquired.
Finally, the adjusted first disparity map and the adjusted third disparity map are combined to finally acquire a disparity map of the image to be processed (that is, the final first disparity map). The finally acquired disparity map of the image to be processed may be expressed by the following Formula (5):
d
final
=W
l
·d
l
+W
l
′·d
flip
l′ Formula (5)
In the Formula (5), dfinal represents the finally acquired disparity map of the image to be processed (as illustrated in the first view on the right side of
It should be noted that the sequence of the two steps of combining the first weight distribution map and the second weight distribution map is not limited in the present disclosure. For example, the two combining steps may be performed simultaneously or in sequence. In addition, the sequence of adjusting the disparity values of the first disparity map and adjusting the disparity values of the third disparity map is not limited in the present disclosure. For example, the two adjusting steps can be performed simultaneously or in sequence.
In an optional embodiment of the present disclosure, in a case that the image to be processed is taken as a left-eye image, phenomena such as missing left-side disparity and the left edge of the object being blocked usually exist. These phenomena make disparity values of corresponding regions in the disparity map of the image to be processed inaccurate. Similarly, in a case that the image to be processed is taken as a right-eye image, phenomena such as missing right-side disparity and the right edge of the object being blocked usually exist. These phenomena make disparity values of corresponding regions in the disparity map of the image to be processed inaccurate. In the present disclosure, the image to be processed is left/right mirrored, the disparity map of the mirrored image is also mirrored, and the disparity map of the image to be processed is adjusted with the disparity map that has been mirrored, thus, the phenomena that the disparity values of the corresponding regions in the disparity map of the image to be processed are inaccurate is mitigated, which benefits to improve accuracy of moving object detection.
In an optional example of the present disclosure, in an application scenario where the image to be processed is a binocular image, a manner of acquiring the first disparity map of the image to be processed according to the present disclosure includes, but not limited to: acquiring a first disparity map of the image to be processed through stereo matching. For example, the first disparity of the image to be processed may be acquired through a stereo matching algorithm, such as Block Matching (BM) algorithm, Semi-Global Block Matching (SGBM) algorithm, or Graph Cuts (GC) algorithm or the like. For another example, disparity processing is performed on the image to be processed by a convolutional neural network, which is configured to acquire a disparity map of a binocular image, to acquire the first disparity map of the image to be processed.
In an optional example of the present disclosure, after acquiring the first disparity map of the image to be processed, depth information of pixels of the image to be processed may be acquired through the following Formula (6):
In the above Formula (6), Depth represents a depth value of a pixel; fx is a known value representing a focal length of the image pickup device in the horizontal direction (X-axis direction of the three-dimensional coordinate system); b is a known value representing a baseline of the binocular image sample which is adopted by the convolutional neural network which is configured to acquire disparity map, and b is a calibrated parameter of the binocular image pickup device; and Disparity represents a disparity of a pixel.
S110: Optical flow information between the image to be processed and a reference image is acquired.
In an optional example of the present disclosure, the image to be processed and the reference image may be two images that are collected by an image pickup device in a continuous photographing mode (such as multiple continuous photographing or video recording) and have a timing sequence relationship. Time interval between capturing the two images is usually short to ensure that most contents of the two images are the same. For example, the time interval between capturing the two images may be a time interval between two adjacent video frames. For another example, the time interval between capturing the two images may be a time interval between two adjacent photos in the continuous photographing mode of the image pickup device. In an optional embodiment of the present disclosure, the image to be processed may be a video frame (such as the current video frame) of a video collected by the image pickup device, and the reference image for the image to be processed is another video frame of the video, for example, the reference image is a preceding video frame of the current video frame. In the present disclosure, the case that the reference image is a video frame subsequent to the current video frame is not excluded. In an optional embodiment of the present disclosure, the image to be processed may be one of a plurality of images collected by the image pickup device in a continuous photographing mode, and the reference image for the image to be processed may be another image among the plurality of images, such as a preceding image or a subsequent image of the image to be processed. The image to be processed and the reference image in the present disclosure may both be RGB (Red Green Blue, red green blue) images or the like. The image pickup device in the present disclosure may be an image pickup device provided on a moving object, for example, an image pickup device mounted on vehicles, trains, airplanes, and other transportations.
In an optional example of the present disclosure, the reference image is typically a monocular image. That is, the reference image is usually an image collected by a monocular image pickup device. In the case that the image to be processed and the reference image are both monocular images, moving object detection can be achieved in the present disclosure without a binocular image pickup device, thereby reducing the cost for moving object detection.
In an optional example of the present disclosure, the optical flow information between the image to be processed and the reference image may be considered to be a two-dimensional motion field of the pixels of the image to be processed and the reference image, and the optical flow information does not characterize real movement of the pixel in the three-dimension space. In the present disclosure, in the process of acquiring the optical flow information between the image to be processed and the reference image, posture change information of the image pickup device between capturing the image to be processed and capturing the reference image, is introduced, that is, the optical flow information between the image to be processed and the reference image is acquired according to the posture change information of the image pickup device between capturing the image to be processed and capturing the reference image, which eliminates interference introduced due to the posture change information of the image pickup device. In the present disclosure, a manner of acquiring the optical flow information between the image to be processed and the reference image according to the posture change information of the image pickup device may comprises following steps:
Step 1. Posture change information of the image pickup device between capturing the image to be processed and capturing the reference image is acquired.
In an optional embodiment of the present disclosure, the posture change information refers to the difference between the posture of the image pickup device when the image to be processed is collected and the posture when the reference image is collected. The posture change information is a posture change information based on the three-dimensional space. The posture change information may include translation information of the image pickup device and rotation information of the image pickup device. The translation information of the image pickup device may include displacement amount of the image pickup device on three coordinate axes (for example, the coordinate system illustrated in
For example, the rotation information of the image pickup device can be expressed as the following Formula (7):
In the above Formula (7):
R represents rotation information, which is a 3×3 matrix, R11 represents cos α cos γ−cos β sin α sin γ,
R12 represents −cos β cos γ sin α−cos α sin γ represents R13 sin α sin β,
R21 represents cos γ sin α+cos α cos β sin γ R22 represents cos α cos β cos γ−sin α sin γ,
R23 sin α sin β R31 represents sin β sin γ, R32 represents cos γ sin β, R33 represents cos β, and
Euler angle (α,β,γ) represents the rotation angle based on Roll, Yaw and Pitch.
In an optional embodiment of the present disclosure, the posture change information of the image pickup device between capturing the image to be processed and capturing the reference image may be acquired through vision technology, for example, Simultaneous Localization And Mapping (SLAM). Further, in the present disclosure, the posture change information of the image pickup device may be acquired through an RGBD model based on open source ORB-SLAM system, wherein RGBD stands for Red Green Blue Depth, ORB stands for Oriented FAST and Rotated BRIEF, which is a descriptor, and SLAM stands for Simultaneous Localization And Mapping. For example, an image to be processed (an RGB image), a depth map of the image to be processed, and a reference image (an RGB image) are input into an RGBD model, to acquire posture change information according to output of the RGBD model. In addition, the posture change information may also be acquired through other manners in the present disclosure. For example, the posture change information may be obtained through GPS (Global Positioning System) and an angular velocity sensor.
In an optional embodiment of the present disclosure, the posture change information may be expressed by a homogeneous matrix of 4×4 as indicated by the following Formula (8):
In the above Formula (8), Tlc represents the posture change information of the image pickup device between capturing the image to be processed (for example, the current video frame c) and capturing the reference image (the preceding video frame l of the current video frame c), such as a posture change matrix; R represents the rotation information of the image pickup device, which is 3×3 Matrix of
t represents the translation information of the image pickup device, that is, a translation vector; t is expressed by three translation components tx, ty and tz, wherein tx represents a translation component in the X axis direction, ty represents a translation component in the Y axis direction, and tz represents a translation component in the Z-axis direction.
Step 2: a correspondence between pixel values of the pixels in the image to be processed and pixel values of the pixels in the reference image is established according to the posture change.
In an optional embodiment of the present disclosure, in a case that the image pickup device is in motion, the posture of the image pickup device when capturing the image to be processed is usually different from the posture of the image pickup device when capturing the reference image. Therefore, a three-dimensional coordinate system corresponding to the image to be processed (that is, the three-dimensional coordinate system of the image pickup device when capturing the image to be processed) is different from a three-dimensional coordinate system corresponding to the reference image (that is, the three-dimensional coordinate system of the image pickup device when capturing the reference image). In the present disclosure, in the case of establishing the correspondence, three dimensional spatial positions of the pixels may be converted first, such that the pixels of the image to be processed and the pixels of the reference image are within the same three-dimensional coordinate system.
In an optional embodiment of the present disclosure, first coordinates of the pixels (for example, all the pixels) of the image to be processed within a three-dimensional coordinate system of the image pickup device corresponding to the image to be processed is first acquired according to the acquired depth information and a parameter of the image pickup device (known values). That is, in the present disclosure, the pixels of the image to be processed are first converted into a three-dimensional space, to acquire the three-dimensional coordinates of the pixels (i.e., three-dimensional coordinates). For example, in the present disclosure, three-dimensional coordinates of a pixel of the image to be processed may be acquired through the following Formula (9):
In the above Formula (9), Z represents a depth value of the pixel, and X, Y, and Z represent the three-dimensional coordinates of the pixel (i.e., the first coordinates); fx represents the focal length of the image pickup device in the horizontal direction (X-axis direction of the three-dimensional coordinate system); fy represents the focal length of the image pickup device in the vertical direction (the Y-axis direction of the three-dimensional coordinate system); (u,v) represents two-dimensional coordinates of the pixel in the image to be processed; cx,cy represents the coordinates of the principal point of the image pickup device; Disparity represents a disparity of the pixel.
In an optional embodiment of the present disclosure, assume that any pixel of the image to be processed is expressed as pi(ui,vi), and after a plurality of pixels are converted into three-dimensional space, any pixel is expressed as Pi(Xi,Yi,Zi), Then, a three-dimensional space point set constituted by a plurality of pixels (such as all the pixels) in the three-dimensional space can be expressed as {Pic}. Pic represents three-dimensional coordinates of the i-th pixel of the image to be processed, namely Pi(Xi,Yi,Zi); C represents the image to be processed, and the value range of i depends on the number of the plurality of pixels. For example, if the number of the plurality of pixels is N (N is an integer greater than 1), the value range of i may be 1 to N or 0 to N−1.
In an optional embodiment of the present disclosure, after acquiring the first coordinates of the plurality of pixels (such as all pixels) of the image to be processed, the first coordinates of the plurality of pixels are converted into a three-dimensional coordinate system of the image pickup device corresponding to the reference image according to the posture change, to acquire second coordinates of the plurality of pixels. For example, in the present disclosure, the second coordinates of any pixel in the image to be processed may be acquired through the following Formula (10):
P
i
l
=T
l
c
P
i
c Formula (10)
In the above Formula (10), Pil represents the second coordinates of the i-th pixel in the image to be processed, Tlc represents posture change information of the image pickup device between capturing the image to be processed (such as the current video frame c) and capturing the reference image (such as the preceding video frame l of the current video frame c), such as a posture change matrix, namely
and Pic represents the first coordinates of the i-th pixel in the image to be processed.
In an optional embodiment of the present disclosure, after acquiring the second coordinates of the plurality of pixels in the image to be processed, a projected two-dimensional coordinates of the image to be processed that has been converted into the three-dimensional coordinate system corresponding to the reference image is acquired by performing a projection on the second coordinates of the plurality of pixels based on the two-dimensional coordinate system of the two-dimensional image. For example, in the present disclosure, the projected two-dimensional coordinates may be acquired through the following Formula (11):
In the above Formula (11), (u,v) represents the projected two-dimensional coordinates of the pixels in the image to be processed; fx represents the focal length of the image pickup device in the horizontal direction (X-axis direction of the three-dimensional coordinate system); fy represents the focal length of the image pickup device in the vertical direction (the Y-axis direction of the three-dimensional coordinate system); cx,cy represents the coordinates of the principal point of the image pickup device; (X,Y,Z) represents the second coordinates of the pixel in the image to be processed.
In an optional embodiment of the present disclosure, after acquiring the projected two-dimensional coordinates of the pixels in the image to be processed, a correspondence between pixel values of the pixels in the image to be processed and pixel values of the pixels in the reference image may be established according to both the projected two-dimensional coordinates and the two-dimensional coordinates in the reference image. The correspondence may express, for any pixel at the same position of the image formed by projecting two-dimensional coordinates and of the reference image, the pixel values of the pixel in the image to be processed and the pixel values of the pixel in the reference image.
Step 3: A conversion is performed on the reference image according to the correspondence.
In an optional embodiment of the present disclosure, warping is performed on the reference image with the correspondence, so that the reference image is converted into the image to be processed.
Step 4. The optical flow information between the image to be processed and the reference image is calculated according to the image to be processed and the reference image that has been subject to conversion.
In an optional embodiment of the present disclosure, the optical flow information includes, but not limited to density optical flow information. For example, the optical flow information is calculated for all pixels of the image. The optical flow information may be acquired through vision technology in the present disclosure. For example, the optical flow information may be acquired through OpenCV (Open Source Computer Vision Library). Further, in the present disclosure, the image to be processed and the reference image that has been subject to conversion may be input into a model based on OpenCV, which outputs optical flow information between the two input images, so that the optical flow information between the image to be processed and the reference image may be acquired. The algorithm adopted in the model for calculating optical flow information includes but not limited to Gunnar Farneback algorithm.
In an optional embodiment of the present disclosure, it is assumed that the optical flow information of any pixel in the image to be processed acquired in the present disclosure may be expressed as Iof (Δu, Δv), Then, the optical flow information of the pixel typically satisfies the following Formula (12):
I
t(ut,vt)+Iof(Δu,Δv)=It+1(ut+1,vt+1) Formula (12)
In the above Formula (12), It(ut,vt) represents a pixel of the reference image; It+1 (ut,vt+1) represents a pixel at a corresponding position of the image to be processed.
In an optional embodiment of the present disclosure, the reference image that has been subject to Warping (such as the preceding video frame that has been subject to warping), the image to be processed (such as the current video frame) and the calculated optical flow information are illustrated in
S120: a three-dimensional motion field of the image to be processed with respect to the reference image is acquired according to both the depth information and the optical flow information.
In an optional example of the present disclosure, after acquiring the depth information and the optical flow information, the three-dimensional motion field of the pixels (such as all the pixels) of the image to be processed with respect to the reference image (which may be referred to as 3D motion field of pixels in the image to be processed). The three-dimensional motion field in the present disclosure can be considered as: a three-dimensional motion field generated by scene motion in a three-dimensional space. In other words, the three-dimensional motion field of the pixels of the image to be processed may be considered as: three-dimensional spatial displacement of the pixels of the image to be processed with respect to the reference image. The three-dimensional motion field may be represented by Scene Flow.
In an optional embodiment of the present disclosure, in the present disclosure, a scene flow of a plurality of pixels of the image to be processed Isf(ΔX, ΔY, ΔZ) may be expressed by the following Formula (13):
In the above Formula (13), (ΔX, ΔY, ΔZ) represents displacement of any pixel of the image to be processed in the three coordinate axis directions of the three-dimensional coordinate system; ΔIdepth represents a depth value of the pixel, (Δu, Δv) represents optical flow information of the pixel, that is, the displacement of the pixel in the two-dimensional image between the image to be processed and the reference image; fx represents the focal length of the image pickup device in the horizontal direction (X-axis direction of the three-dimensional coordinate system); fy represents the focal length of the image pickup device in the vertical direction (the Y-axis direction in the three-dimensional coordinate system); cx,cy represents the coordinates of the principal point of the image pickup device.
S130: A moving object involved in the image to be processed is determined according to the three-dimensional motion field.
In an optional example of the present disclosure, three-dimensional movement information of an object involved in the image to be processed may be determined according to the three-dimensional motion field. The three-dimensional movement information of the object may indicate whether the object is a moving object. In an optional embodiment of the present disclosure, the three-dimensional movement information of the pixels in the image to be processed may be acquired first according to the three-dimensional motion field; and then, a clustering is performed on the pixels according to the three-dimensional movement information of the pixels; and finally, the three-dimensional movement information of an object involved in the image to be processed is determined according to a result of the clustering, to determine a moving object involved in the image to be processed.
In an optional example of the present disclosure, the three-dimensional movement information of the pixels in the image to be processed may include, but not limited to: three-dimensional speeds of a plurality of pixels (such as all the pixels) of the image to be processed. The speeds here are typically in a form of a vector, that is, the speed of a pixel in the present disclosure can reflect the speed magnitude of the pixel and the speed direction of the pixel. In the present disclosure, the three-dimensional movement information of the pixels in the image to be processed can be easily acquired by means of the three-dimensional motion field.
In an optional example of the present disclosure, the three-dimensional space in the present disclosure includes: a three-dimensional space based on a three-dimensional coordinate system. The three-dimensional coordinate system may be: the three-dimensional coordinate system of the image pickup device that captures the image to be processed. The Z axis of the three-dimensional coordinate system is typically the optical axis of the image pickup device, that is, the depth direction. In an application scene that the image pickup device is mounted on a vehicle, an example of the X axis, the Y axis, the Z axis and the origin of the three-dimensional coordinate system of the present disclosure is illustrated in
In an optional example of the present disclosure, speeds of a pixel of the image to be processed in the three coordinate axis directions of a three-dimensional coordinate system of the image pickup device corresponding to the image to be processed may be calculated according to both the three-dimensional motion field and time difference Δt between capturing the image to be processed and capturing the reference image by the image pickup device. Further, in the present disclosure, the speed may be acquired through the following Formula (14):
In the above Formula (14), vx, vy and vz respectively represent the speed of a pixel of the image to be processed in the three coordinate axis directions of the three-dimensional coordinate system of the image pickup device corresponding to the image to be processed; (ΔX, ΔY, ΔZ) represents displacements of the pixel of the image to be processed in the three coordinate axis directions of the three-dimensional coordinate system of the image pickup device corresponding to the image to be processed; Δt represents time difference between capturing the image to be processed and capturing the reference image by the image pickup device.
The magnitude |v| of the above speed may be expressed by the following Formula (15):
|v|=vx2+vy2+vz2 Formula (15)
The direction {right arrow over (v)} of the above speed may be expressed by the following Formula (16):
In an optional example of the present disclosure, a motion region of the image to be processed may be first determined, and a clustering is performed on the pixels of the motion region. For example, a clustering is performed on the pixels of the motion region according to three-dimensional movement information of the pixels in the motion region. For another example, a clustering is performed on pixels of the motion region according to the three-dimensional movement information of pixels in the motion region and three-dimensional positions of pixels. In an optional embodiment of the present disclosure, the motion region in the image to be processed may be determined through a motion mask. For example, in the present disclosure, a motion mask of the image to be processed may be acquired according to the three-dimensional movement information of the pixels.
In an optional embodiment of the present disclosure, the speeds magnitude of a plurality of pixels (such as all the pixels) of the image to be processed may be filtered with a preset speed threshold, to form a motion mask of the image to be processed according to the filter result. For example, in the present disclosure, the motion mask of the image to be processed may be obtained through the following Formula (17):
In the above Formula (17), Imotion represents a pixel in the motion mask; in a case that the speed magnitude |v| of the pixel is greater than or equal to the preset speed threshold v_thresh, a value of the pixel is 1, indicating that the pixel belongs to the motion region of the image to be processed; otherwise, the value of the pixel is 0, indicating that the pixel does not belong to the motion region of the image to be processed.
In an optional embodiment of the present disclosure, a region composed of pixels with a value of 1 in the motion mask as a motion region, and the size of the motion mask is as same as the size of the image to be processed. Therefore, in the present disclosure, the motion region of the image to be processed may be determined according to the motion region of the motion mask. An example of the motion mask in the present disclosure is illustrated in
In an optional example of the present disclosure, in a case of performing a clustering according to the three-dimensional spatial position and movement information of the pixels in the motion region, the three-dimensional spatial position and the movement information of the pixels in the motion region are first standardized, so that the three-dimensional coordinates of the pixels in the motion region are converted into a predetermined coordinate interval (such as [0, 1]), and the speed of the pixels of the motion region is converted into a predetermined speed interval (such as [0, 1]). And then, a density clustering is performed with the converted three-dimensional space coordinates and the converted speed, to acquire at least one class cluster.
In an optional embodiment of the present disclosure, the standardization includes, but not limited to min-max standardization, and Z-score standardization, and etc.
For example, the min-max standardization for the three-dimensional spatial position information of the pixels of the motion region can be expressed by the following Formula (18), and the min-max standardization for the movement information of the pixels in the motion region may be expressed by the following Formula (19):
In the above Formula (18), (X,Y,Z) represents the three-dimensional spatial position information of a pixel of the motion region of the image to be processed; (X*,Y*,Z*) represents the three-dimensional spatial position information of the pixel that has been subject to standardization; (Xmin, Ymin, Zmin) represents the minimum X coordinate, the minimum Y coordinate, and the minimum Z coordinate of the three-dimensional spatial position information of all pixels of the motion region; (Xmax,Ymax,Zmax) represents the maximum X coordinate, the maximum Y coordinate, and the maximum Z coordinate of the three-dimensional spatial position information of all pixels of the motion region.
In the above Formula (19), (vx,vy,vz) represents the three-dimensional speeds of the pixels of the motion region in the three coordinate axis directions; (vx*,vy*,vz*) represents a speed after the min max standardization of (vx,vy,vz); (vx min,vy min,vz min) represents the minimum three-dimensional speed of all pixels of the motion region in the three coordinate axis directions; (vx max,vy max,vz max) represents the maximum three-dimensional speed of all pixels of the motion region in the three coordinate axis directions.
In an optional example of the present disclosure, clustering algorithms adopted for the clustering includes, but not limited to a density clustering algorithm, for example, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and so on. Each class cluster acquired by clustering corresponds to an instance of moving object, that is, each class cluster may be regarded as a moving object involved in the image to be processed.
In an optional example of the present disclosure, for any class cluster, the speed magnitude and the speed direction of a moving object instance corresponding to the class cluster may be determined according to speed magnitudes and speed directions of a plurality of pixels of the class cluster (for example, all the pixels). In an optional example of the present disclosure, the speed magnitude and the speed direction of the moving object instance corresponding to the class cluster may be expressed by average speed and average direction of all pixels of the class cluster. For example, the speed magnitude and the speed direction of the moving object instance corresponding to a class cluster may be expressed by the following Formula (20):
In the above Formula (20), |vo| represents the speed magnitude of the moving object instance corresponding to a class cluster obtained by clustering; |vi| represents the speed magnitude of the i-th pixel of the class cluster; n represents the number of pixels contained in the class cluster; {right arrow over (v)}o represents the speed direction of the moving object instance corresponding to the class cluster; {right arrow over (v)}i represents the speed direction of the i-th pixel of the class cluster.
In an optional example of the present disclosure, a moving object bounding box in the image to be processed for the moving object instance corresponding to the class cluster may be determined according to position information of a plurality of pixels of a class cluster (for example, all the pixels) in a two-dimensional image (i.e., two-dimensional coordinates in the image to be processed). For example, in the present disclosure, for a class cluster, the maximum column coordinates umax and the minimum column coordinates umin of all pixels of the class cluster in the image to be processed may be calculated, and the maximum row coordinate vmax and the minimum row coordinates vmin of all pixels of the class cluster may be calculated (Note: It is assumed that the origin of the image coordinate system is positioned at the upper left corner of the image). In the present disclosure, the coordinates of the acquired moving object bounding box in the image to be processed may be expressed as (umin,vmin,umax,vmax).
In an optional embodiment of the present disclosure, an example of the determined moving object bounding box in the image to be processed is illustrated in the lower view of
In an optional example of the present disclosure, the three-dimensional position information of the moving object may also be determined according to the three-dimensional position information of a plurality of pixels of the same class cluster. The three-dimensional position information of the moving object includes but not limited to coordinates of the moving object on the horizontal coordinate axis (X coordinate axis), coordinates of the moving object on the depth coordinate axis (Z coordinate axis), a height of the moving object in the vertical direction (i.e., the height of the moving object), and etc.
In an embodiment of the present disclosure, the distances between all pixels of a class cluster and the image pickup device may be determined firstly according to the three-dimensional position information of all pixels of the same class cluster, and then the three-dimensional position information of the pixel with minimum distance is taken as the three-dimensional position information of the moving object.
In an optional embodiment of the present disclosure, a distance between each of a plurality of pixels of a class cluster and the image pickup device is calculated through the following Formula (21), and the minimum distance is selected:
d
min=min(√Xi2+Zi2) Formula (21)
In the above Formula (21), dmin represents the minimum distance; Xi represents the X coordinate of the i-th pixel of a class cluster; Zi represents the Z coordinate of the i-th pixel of the class cluster.
After determining the minimum distance, the X coordinate and the Z coordinate of the pixel with the minimum distance may be taken as the three-dimensional position information of the moving object, as expressed in the following Formula (22):
O
X
=X
close
O
Z
=Z
close Formula (22)
In the above Formula (22), OX represents the coordinate of the moving object on the horizontal coordinate axis, that is, the X coordinate of the moving object; OZ represents the coordinate of the moving object on the depth coordinate axis (Z coordinate axis), that is, the Z coordinate of the moving object; Xclose represents the calculated X coordinate of the pixel with the minimum distance; Zclose represents the calculated Z coordinate of the pixel with the minimum distance.
In an optional embodiment of the present disclosure, the height of the moving object may be calculated through the following Formula (23):
O
H
=Y
max
−Y
min Formula (23)
In the above Formula (23), OH represents the height of the moving object in the three-dimensional space; Ymax represents the maximum Y coordinate of all pixels of a class cluster in the three-dimensional space; Ymin represents the minimum Y coordinate of all pixels of a class cluster in the three-dimensional space.
S1700. a monocular image sample of a binocular image sample is input into a convolutional neural network to be trained.
In an optional embodiment of the present disclosure, the image sample input into the convolutional neural network may always be a left-eye image sample of binocular image samples, or a right-eye image sample of binocular image samples. In the case that the image sample input into the convolutional neural network is always the left-eye image sample of the binocular image samples, the successfully trained convolutional neural network will take an input image to be processed as the left-eye image to be processed in testing or actual application scenarios. In the case that the image sample input into the convolutional neural network is always the right-eye image sample of the binocular image sample, the successfully trained convolutional neural network will take an input image to be processed as the left-eye image to be processed in testing or actual application scenarios.
S1710: Disparity analysis is performed by a convolutional neural network, and a disparity map of the left-eye image sample and a disparity map of the right-eye image sample are acquired based on output of the convolutional neural network.
S1720: A right-eye image is reconstructed according to the left-eye image sample and the disparity map of the right-eye image sample.
In an optional embodiment of the present disclosure, a manner of reconstructing the right-eye image includes but not limited to: performing a re-projection on the left-eye image sample and the disparity map of the right-eye image sample to acquire a reconstructed right-eye image.
S1730: A left-eye image is reconstructed according to the right-eye image sample and the disparity of the left-eye image sample.
In an optional embodiment of the present disclosure, a manner of reconstructing the left-eye image includes but not limited to: performing a re-projection on the right-eye image sample and the disparity map of the left-eye image sample, to acquire a reconstructed left-eye image.
S1740: A network parameter of the convolutional neural network is adjusted according to both a difference between the reconstructed left-eye image and the left-eye image sample and a difference between the reconstructed right-eye image and the right-eye image sample.
In an optional embodiment of the present disclosure, in a case of determining the differences, an adopted loss function includes but not limited to: L1 loss function, smooth loss function, lr-Consistency loss function, etc. In addition, in the present disclosure, in a case that the calculated loss is back propagated to adjust the network parameter of the convolutional neural network (such as a weight of a convolution kernel), the loss may be back propagated according to a gradient calculated based on chain derivation of the convolutional neural network, which helps to improve training efficiency of the convolutional neural network.
In an optional example of the present disclosure, in a case that the training for the convolutional neural network satisfies a predetermined iterative condition, training process ends. The predetermined iterative conditions in an embodiment of the present disclosure may include: a difference between the left-eye image reconstructed according to the disparity map output by the convolution neural network and the left-eye image sample, and a difference between the right-eye image reconstructed according to the disparity map output by the convolution neural network and the right-eye image sample, meet with the predetermined difference requirements. If the difference meets the requirements, the training of the convolutional neural network is successfully completed this time. The predetermined iterative conditions in the present disclosure may further include: the number of binocular image samples used for training the convolutional neural network reaches a predetermined number requirement, etc. In a case that the number of binocular image samples used for training the convolutional neural network meets the predetermined number requirement, the difference between the left-eye image reconstructed according to the disparity map output by the convolutional neural network and the left-eye image samples, and the difference between the right-eye image reconstructed according to the disparity map output by the convolutional neural network and the right-eye image samples do not meet the predetermined difference requirements, the training of the convolutional neural network is not successfully trained this time.
S1800: a video stream of the road where the vehicle is located is acquired through an image pickup device mounted on the vehicle. The image pickup device includes, but is not limited to, an RGB-based image pickup device.
S1810: moving object detection is performed on at least one video frame of the video stream to acquire a moving object involved in the video frame, for example, to acquire movement information of an object in the video frame in a three-dimensional space. For the specific implementation process of this step, please refer to the description of
S1820: a vehicle control instruction is generated and output according to the moving object involved in the video frame. For example, the vehicle control instruction is generated according to the three-dimensional movement information of the object in the video frame and is output to control the vehicle.
In an optional embodiment of the present disclosure, the generated control instruction include but not limited to: a speed maintaining control instruction, a speed adjusting control instruction (such as for decelerating, or for accelerating, and etc.), a direction maintaining control instruction, a direction adjusting control instruction (such as for turning left, for turning right, for changing to the left lane, for changing to the right lane and etc.), a whistling instruction, a warn prompting control instruction or a driving mode switching control instruction (such as switching to automatic cruise driving mode, etc.).
It should be particularly noted that the moving object detection technology according to the present disclosure can be applied in the field of intelligent driving control, and can further be applied in other fields, for example, moving object detection in industrial manufacturing, moving object detection indoors such as supermarkets, and moving object detection in the security field, and etc. The present disclosure does not limit the application scenarios of moving object detection technology.
The first acquiring module 1900 is configured to acquire depth information of pixels of the image to be processed. In an optional embodiment of the present disclosure, the first acquiring module 1900 may include: a first sub-module and a second sub-module. The first sub-module is configured to acquire a first disparity map of the image to be processed. The second sub-module is configured to obtain the depth information of the pixels of the image to be processed according to the first disparity map of the image to be processed. In an optional embodiment of the present disclosure, the image to be processed includes: a monocular image. The first sub-module includes: a first unit, a second unit, and a third unit. The first unit is configured to input the image into be processed into a convolutional neural network, for a disparity analysis by the convolution neural network, to acquire the first disparity map of the image to be processed based on output of the convolutional neural network. The convolutional neural network is trained by a training module with binocular image samples. The second unit is configured to acquire a second horizontal mirror image of a second disparity map of a first horizontal mirror image of the image to be processed. The first horizontal mirror image of the image to be processed is a mirror image generated through performing a horizontal mirroring on the image to be processed. The second horizontal mirror image of the second disparity map is a mirror image generated through performing a horizontal mirroring on the second disparity map. The third unit is configured to adjust disparity of the first disparity map of the image to be processed according to a weight distribution map of the first disparity map of the image to be processed and a weight distribution map of the second horizontal mirror image of the second disparity map, to finally acquire the first disparity map of the image to be processed.
In an optional embodiment of the present disclosure, the second unit may input the first horizontal mirror image of the image to be processed into the convolutional neural network for a disparity analysis by the convolution neural network, to acquire a second disparity map of the first horizontal mirror image of the image to be processed based on output of the convolution neural network; the second unit performs mirroring on the second disparity map of the first horizontal mirror image of the image to be processed, to obtain a second horizontal mirror image of the second disparity map of the first horizontal mirror image of the image to be processed.
In an optional embodiment of the present disclosure, the weight distribution map includes at least one of: a first weight distribution map indicating a weight distribution map uniformly set for a plurality of images to be processed, and a second weight distribution map indicating a weight distribution map set individually for each of different images to be processed. The first weight distribution map includes at least two horizontally juxtaposed regions with different weight values.
In a case that the image to be processed is taken as a left-eye image, for any two regions in the first weight distribution map of the first disparity map of the image to be processed, a weight value of a right region is greater than a weight value of a left region, and for any two regions in the first weight distribution map of the second horizontal mirror image of the second disparity map, a weight value of a right region is greater than a weight value of a left region. For at least one region of the first weight distribution map of the first disparity map of the image to be processed, a weight value of the left part of the region is not greater than a weight value of the right part of the region; for at least one region of the first weight distribution map of the second horizontal mirror image of the second disparity map, a weight value of the left part of the region is not greater than a weight value of the right part of the region.
In a case that the image to be processed is taken as a right-eye image, for any two regions in the first weight distribution map of the first disparity map of the image to be processed, a weight value of a left region is greater than a weight value of a right region. For any two regions in the first weight distribution map of the second horizontal mirror image of the second disparity map, a weight value of the left region is greater than a weight value of the right region. For at least one region of the first weight distribution map of the first disparity map of the image to be processed, a weight value of the right part of the region is not greater than a weight value of the left part of the region; and for at least one region of the first weight distribution map of the second horizontal mirror image of the second disparity map, a weight value of the right part of the region is not greater than a weight value of the left part of the region.
In an optional embodiment of the present disclosure, the third unit is further configured to set a second weight distribution map of the first disparity map of the image to be processed. For example, the third unit performs horizontal mirroring on the first disparity map of the image to be processed to generate a mirror disparity map. For a pixel of the mirror disparity map, in a case that a disparity value of the pixel is greater than a first variable for the pixel, a weight value of the pixel for the second weight distribution map of the image to be processed is set to a first value, and otherwise, set to a second value; wherein the first value is greater than the second value. The first variable for the pixel is set according to both the disparity value of the pixel in the first disparity map of the image to be processed and a constant value greater than zero.
In an optional embodiment of the present disclosure, the third unit is further configured to set a second weight distribution map of the second horizontal mirror image of the second disparity map. For example, for a pixel of the second horizontal mirror image of the second disparity map, the third unit sets a weight value of the pixel for the second weight distribution map of the second horizontal mirror image of the second disparity map to a first value in a case that a disparity value of the pixel in the first disparity map of the image to be processed is greater than a second variable for the pixel, and to a second value otherwise; wherein the first value is greater than the second value. The second variable for the pixel is set according to both a disparity value of a corresponding pixel in the horizontal mirror image of the first disparity map of the image to be processed and a constant value greater than zero.
In an optional embodiment of the present disclosure, the third unit may be further configured to: firstly, adjust the disparity value of the first disparity map of the image to be processed according to both the first weight distribution map of the first disparity map of the image to be processed and the second weight distribution map of the first disparity map of the image to be processed; and next, the third unit adjusts a disparity value of the second horizontal mirror image of the second disparity map according to both the first weight distribution map of the second horizontal mirror image of the second disparity map and the second weight distribution map of the second horizontal mirror image of the second disparity map; and finally, the third unit combine the first disparity map that has been subject to disparity value adjustment and the second horizontal mirror image that has been subject to disparity value adjustment, to finally acquire the first disparity map of the image to be processed. For the operations performed by the first acquiring module 1900 and the sub-modules and the units thereof, reference may be made to the foregoing description of S100, which is not elaborated here.
The second acquiring module 1910 is configured to acquire optical flow information between the image to be processed and a reference image. The reference image and the image to be processed are two images that are collected by an image pickup device in a continuous photographing mode and have a timing sequence relationship. For example, the image to be processed is a video frame of a video collected by the image pickup device, and the reference image for the image to be processed includes: a preceding video frame of the video frame.
In an optional embodiment of the present disclosure, the second acquiring module 1910 may include: a third sub-module, a fourth sub-module, a fifth sub-module, and a sixth sub-module. The third sub-module is configured to acquire posture change of the image pickup device between capturing the image to be processed and capturing the reference image; the fourth sub-module is configured to establish a correspondence between pixel values of pixels in the image to be processed and pixel values of pixels in the reference image according to the posture change; the fifth sub-module is configured to convert the reference image according to the correspondence; the sixth sub-module is configured to calculate the optical flow information between the image to be processed and the reference image according to both the image to be processed and the converted reference image. The fourth sub-module may first acquire first coordinates of the pixels of the image to be processed within a three-dimensional coordinate system of the image pickup device corresponding to the image to be processed according to both the depth information and a preset parameter of the image pickup device; next, the fourth sub-module converts the first coordinates to a second coordinates within the three-dimensional coordinate system of the image pickup device corresponding to the reference image according to the posture change; and then, the fourth sub-module performs a projection on the second coordinates based on a two-dimensional coordinate system of the two-dimensional image to acquire projected two-dimensional coordinates of the image to be processed; and finally, the fourth sub-module establishes a correspondence between pixel values of pixels in the image to be processed and pixels values of pixels in the reference image according to both the projected two-dimensional coordinates of the image to be processed and two-dimensional coordinates of the reference image. For specific operations performed by the second acquiring module 1910 and by the sub-modules and the units of the second acquiring module, please refer to the foregoing description of S110, which is not elaborated here.
The third acquiring module 1920 is configured to acquire a three-dimensional motion field of the pixels of the image to be processed with respect to the reference image according to the depth information and the optical flow information. For the specific operations performed by the third acquiring module 1920, please refer to the above description of S120, which is not elaborated here.
The moving object determining module 1930 is configured to determine a moving object involved in the image to be processed according to the three-dimensional motion field. In an optional embodiment of the present disclosure, the moving object determining module may include: a seventh sub-module, an eighth sub-module, and a ninth sub-module. The seventh sub-module is configured to acquire movement information of the pixels in the image to be processed in a three-dimensional space according to the three-dimensional motion field. For example, the seventh sub-module can calculate speeds of the pixels in the image to be processed in the three coordinate axis directions of the three-dimensional coordinate system of the image pickup device corresponding to the image to be processed according to both the three-dimensional motion field and the time difference between capturing the image to be processed and capturing the reference image. The eighth sub-module is configured to perform a clustering on the pixels according to their three-dimensional movement information. For example, the eighth sub-module includes: a fourth unit, a fifth unit, and a sixth unit. The fourth unit is configured to acquire a motion mask of the image to be processed according to the three-dimensional movement information of the pixels. The three-dimensional movement information of the pixels includes three-dimensional speeds magnitudes of the pixels. The fourth unit can filter the speeds magnitudes of the pixels in the image to be processed with a preset speed threshold to generate the motion mask of the image to be processed. The fifth unit is configured to determine the motion region in the image to be processed according to the motion mask. The sixth unit is configured to perform a clustering on the pixels in the motion region according to the three-dimensional spatial position and the movement information of the pixels in the motion region. For example, the sixth unit can convert the three-dimensional coordinates of the pixels in the motion region into a predetermined coordinate interval; and then, the sixth unit converts the speeds of the pixels of the motion region into a predetermined speed interval; and finally, the sixth unit performs a density clustering on the pixels of the motion region according to the converted three-dimensional spatial coordinates and the converted speeds of the pixels of the motion region, to obtain at least one class cluster. The ninth sub-module is configured to determine a moving object involved in the image to be processed according to the result of the clustering. For example, for any class cluster, the ninth sub-module may determine the speed magnitude and the speed direction of the moving object according to the speed magnitudes and the speed directions of a plurality of pixels of the class cluster; wherein, a class cluster is taken as a moving object involved in the image to be processed. The ninth sub-module is further configured to determine a moving object bounding box in the image to be processed according to the spatial position of pixels belonging to a same class cluster. For the specific operations performed by the moving object determining module 1930 and by its sub-modules and its units, reference may be made to the foregoing description of S130, which is not elaborated here.
The training module is configured to input a plurality of monocular images of the binocular image samples into a convolutional neural network to be trained, for performing a disparity analysis by the convolutional neural network. Based on output of the convolutional neural network, the training module acquires a disparity map of a left-eye image sample and a disparity map of a right-eye image. The training module reconstructs a right-eye image according to both the left-eye image sample and the disparity map of the right-eye image sample, and reconstructs a left-eye image according to both the right-eye image sample and the disparity map of the left-eye image sample. And the training module adjusts network parameter of the convolution neural network according to both a difference between the reconstructed left-eye image and the left-eye image sample and a difference between the reconstructed right-eye image and the right-eye image sample. Specific operations performed by the training module may be referred to the above description with respect to
A device for intelligent driving control according to the present disclosure is illustrated in
Exemplary Apparatus
For the operations implemented by the foregoing instructions, reference may be made to the related descriptions in the foregoing method embodiments, and detailed descriptions are omitted here. In addition, the RAM 2103 may further store various programs and data required for the operation of the apparatus. The CPU 2101, the ROM 2102, and the RAM 2103 are connected to each other via the bus 2104.
In the case of RAM2103, ROM2102 is optional. The RAM 2103 stores executable instructions, or writes executable instructions into the ROM 2102 during operation, and the executable instructions cause the CPU 2101 to implement operations of the above-mentioned method of detecting moving object or the method of intelligent driving control. An input/output (I/O) interface 2105 is also connected to the bus 2104. The communication section 2112 may be integrated or may be configured to have a plurality of sub-modules (for example, a plurality of IB network cards), each of which are connected to the bus respectively.
The following components are connected to the I/O interface 2105: an input component 2106 such as a keyboard, a mouse, etc.; an output component 2107 such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and speakers, and etc.; a storage section 2108 including a hard disk, and the like, and a communication section 2109 including a network interface card such as a LAN card, a modem, etc. The communication section 2109 performs communication via a network such as the Internet. A driver 2110 is further connected to the I/O interface 2105 as needed. A removable medium 2111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 2110 as needed, so that the computer program read therefrom is installed in the storage portion 2108 as needed.
It should be noted that the architecture illustrated in
In particular, according to the embodiments of the present disclosure, the process described above with reference to the flowcharts can be implemented as a computer software program. For example, the embodiments of the present disclosure involve a computer program product, which includes a computer program product tangibly contained on a machine-readable medium. The computer program includes program codes for implementing the operations of the method as illustrated in the flowchart. The program codes may include instructions corresponding to the steps of the method according to the present disclosure.
In such embodiments, the computer program may be downloaded from the network through the communication part 2109 and installed, and/or installed from the removable medium 2111. When the computer program is executed by the CPU 2101, the instructions described in the present disclosure to implement the above-mentioned corresponding steps are executed.
In one or more optional implementation manners, embodiments of the present disclosure further provide a computer program product for storing computer-readable instructions, which, upon being executed, cause a computer to implement operations of the method of detecting moving object or operations of the method of intelligent driving control method according to any embodiment of the present disclosure. The computer program product can be specifically implemented by hardware, software or a combination thereof. In an optional example, the computer program product may be embodied as a computer storage medium. In another optional example, the computer program product may be embodied as a software product, such as a software development kit (SDK), etc.
In one or more optional implementation manners, the embodiments of the present disclosure further provide another method of detecting moving object or another method of intelligent driving control and corresponding devices and electronic apparatus, computer storage media, computer programs, and computer program products. The method includes: transmitting, by a first device and to a second device, an instruction to detect moving object or to drive intelligently, the instruction causes the second device to implement operations of the method of moving object detection according to any one of the possible embodiments of the present disclosure or the method of intelligent driving control according to any one of the possible embodiments of the present disclosure; and receiving, by the first device and from the second device, a result of moving object detection or a result or intelligent driving control.
In some embodiments, the instruction to detect moving object or the instruction to drive intelligently may be a calling instruction, and the first device may instruct the second device to perform moving object detection or intelligent driving control by calling, and correspondingly, in response to the calling instruction, the second device may implement the steps and/or processes of the method of detecting moving object according to any one of the possible embodiments of the present disclosure or of the method of intelligent driving control according to any one of the possible embodiments of the present disclosure.
It should be understood that terms such as “first” and “second” in the embodiments of the present disclosure are only for distinguishing purposes, and should not be construed as a limit to the embodiments of the present disclosure. It should also be understood that in the present disclosure, the term “plural” may refer to two or more, and the term “at least one” can refer to one, two, or more than two. It should also be understood that any component, data, or structure mentioned in the present disclosure can typically be understood as one or more unless it is clearly defined or the context gives opposite enlightenment. It should also be understood that the description of the various embodiments of the present disclosure focuses the differences between the various embodiments, and the same or similarities can be referred to each other, and for the sake of brevity, the details are not elaborated.
The method and apparatus, electronic apparatus, and computer-readable storage medium of the present disclosure may be implemented in many ways. For example, the method and device, electronic apparatus, and computer-readable storage medium of the present disclosure can be implemented by software, hardware, firmware or any combination of software, hardware, and firmware. The above-mentioned sequence of the steps of the method is just illustrative, and the steps of the method of the present disclosure are not limited to the sequence specifically described above, unless otherwise specified. In addition, in some embodiments, the present disclosure can further be implemented as programs recorded in a recording medium, and these programs include machine-readable instructions for implementing operations of the method according to the present disclosure. Thus, the present disclosure further covers a recording medium storing a program for executing the method according to the present disclosure.
The description of the present disclosure is given for the sake of illustration and description, rather than being exhaustive or limiting the present disclosure to the disclosed form. Many modifications and variants are obvious to those of ordinary skill in the art. The selection and description of the embodiments are to better explain the principles and practical applications of the present disclosure, and to enable those of ordinary skill in the art to understand that the embodiments of the present disclosure can design various implementations with various modifications suitable for specific purposes.
Number | Date | Country | Kind |
---|---|---|---|
201910459420.9 | May 2019 | CN | national |
This application is a continuation application of International Patent Application No. PCT/CN2019/114611 filed with the China National Intellectual Property Administration (CNIPA) on Oct. 31, 2019, which is based on and claims the priority to and benefits of Chinese Patent Application No. 201910459420.9, entitled “METHODS, DEVICES, MEDIA, AND APPARATUSES OF DETECTING MOVING OBJECT, AND OF INTELLIGENT DRIVING CONTROL” and filed with the CNIPA on May 29, 2019. The content of all of the above-identified applications is incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2019/114611 | Oct 2019 | US |
Child | 17139492 | US |