This patent application claims the benefit of document FR 17 57049 filed on Jul. 25, 2017 which is hereby incorporated by reference.
The present invention relates to an electronic device for generating, from at least one pair of successive images of a scene including a set of object(s), a depth map of the scene.
The invention also relates to a drone comprising an image sensor configured to take at least one pair of successive images of the scene and such an electronic device for generating the depth map of the scene.
The invention also relates to a method for generating, from at least one pair of successive images of a scene including a set of object(s), a depth map of the scene, the method being carried out by such an electronic generating device.
The invention also relates to a non-transitory computer-readable medium including a computer program including software instructions which, when executed by a computer, implement such a generating method.
The invention relates to the field of drones, i.e., remotely-piloted flying motorized apparatuses. The invention in particular applies to rotary-wing drones, such as quadcopters, while also being applicable to other types of drones, for example fixed-wing drones.
The invention is particularly useful when the drone is in a tracking mode in order to track a given target, such as the pilot of the drone engaging in an athletic activity and must then be capable of detecting obstacles that may be located on its trajectory or nearby.
The invention offers many applications, in particular for improved obstacle detection.
For obstacle detection by a drone, a drone is known equipped with a remote laser detection device or LIDAR (Light Detection and Ranging) device or LADAR (LAser Detection and Ranging) device. Also known is a drone equipped with a camera working on the time of flight (TOF) principle. To that end, the TOF camera illuminates the objects of the scene with a flash of light and calculates the time that this flash takes to make the journey between the object and the camera. Also known is a drone equipped with a stereoscopic camera, such as a SLAM (Simultaneous Localization And Mapping) camera.
When the drone is equipped with a monocular camera, the detection is more delicate, and it is then generally known to use the movement of the camera, and in particular the structure of the movement. Other techniques, for example SLAM, are used with non-structured movements, producing very approximate three-dimensional maps and requiring significant calculations to keep an outline of the structure of the scene and to align newly detected points on existing points.
However, such an obstacle detection with a monocular camera is not very effective.
The aim of the invention is then to propose an electronic device and an associated method that allow a more effective generation of a depth map of the scene, from at least one pair of successive images of a scene.
To that end, the invention relates to an electronic device for generating, from at least one pair of successive images of a scene including a set of object(s), a depth map of the scene, the device comprising:
According to other advantageous aspects of the invention, the electronic generating device comprises one or more of the following features, considered alone or according to all technically possible combinations:
The invention also relates to a drone comprising an image sensor configured to take at least one pair of successive images of the scene including a set of object(s), and an electronic generating device configured to generate a depth map of the scene, from the at least one pair of successive images of the scene taken by the sensor, in which the electronic generating device is as defined above.
The invention also relates to a method for generating, from at least one pair of successive images of a scene including a set of object(s), a depth map of the scene, the method being carried out by such an electronic generating device, and comprising:
The invention also relates to a non-transitory computer-readable medium including a computer program including software instructions which, when executed by a computer, implement a generating method as defined above.
These features and advantages of the invention will appear more clearly upon reading the following description, provided solely as a non-limiting example, and done in reference to the appended drawings, in which:
In the following description, the expression “substantially equal to” defines a relationship of equality to within plus or minus 10%.
In
The drone 10 is a motorized flying vehicle able to be piloted remotely, in particular via a joystick 18 equipped with a display screen 19.
The drone 10 is for example a rotary-wing drone, including at least one rotor 20. In
The drone 10 includes a transmission module 22 configured to exchange data, preferably by radio waves, with one or several pieces of electronic equipment, in particular with the lever 18, or even with other electronic elements to transmit the image(s) acquired by the image sensor 12.
The image sensor 12 is for example a front-viewing camera making it possible to obtain an image of the scene toward which the drone 10 is oriented. Alternatively or additionally, the image sensor 12 is a vertical-viewing camera, not shown, pointing downward and configured to capture successive images of terrain flown over by the drone 10.
The image sensor 12 extends in an extension plane. The image sensor 12 for example comprises a matrix photodetector including a plurality of photosites, each photosite corresponding to a respective pixel of the image taken by the sensor 12. The extension plane then corresponds to the plane of the matrix photodetector.
The electronic generating device 14 is for example on board the drone 10, as shown in
Alternatively, the electronic generating device 14 is a separate electronic device remote from the drone 10, the electronic generating device 14 then being suitable for communicating with the drone 10, in particular with the image sensor 12, via the transmission module 22 on board the drone 10.
The electronic generating device 14 comprises an acquisition module 24 configured to acquire at least one pair of successive images It−Δt, It of the scene S, taken by the image sensor 12. The acquired successive images It−Δt, It have been taken at respective moments in time t-Δt and t, t representing the moment in time at which the last acquired image of the pair was taken and Δt representing the time deviation between the respective moments at which the two acquired images of the pair were taken.
The electronic generating device 14 comprises a computation module 26 configured to compute, via a neural network 28, at least one intermediate depth map 30, each intermediate map 30 being computed for a respective acquired pair of images It−Δt, It and having a value indicative of a depth for each object of the scene S. An input variable 32 of the neural network 28 is the acquired pair of images It−Δt, It, and an output variable 34 of the neural network 28 is the intermediate map 30, as shown in
The depth is the distance between the sensor 12 and a plane passing through the respective object, parallel to a reference plane of the sensor 12. The reference plane is a plane parallel to the extension plane of the sensor 12, such as a plane combined with the extension plane of the sensor 12. The depth is then preferably the distance between the plane of the matrix photodetector of the sensor 12 and a plane passing through the respective object, parallel to the reference plane of the sensor 12.
The electronic generating device 14 comprises a generating module 36 configured to generate the depth map 16 of the scene S from at least one computed intermediate map 30.
In the example of
The depth map 16 of the scene S includes a set of element(s), each element being associated with an object and having a value dependent on the depth between the sensor 12 and said object. Each element of the depth map 16 is for example a pixel, and each object is the entity of the scene corresponding to the pixel of the taken image. The value dependent on the depth between the sensor 12 and said object, shown on the depth map 16, as well as each intermediate map 30, is for example a gray level or an RGB value, typically corresponding to a percentage of a maximum depth value, this percentage then providing a correspondence with the value of the depth thus shown.
The lever 18 is known in itself and makes it possible to pilot the drone 10. In the example of
The lever 18 comprises a radio antenna and a radio transceiver, not shown, for exchanging data by radio waves with the drone 10, both uplink and downlink.
In the example of
In an alternative that is not shown, the acquisition module 24, the computing module 26 and the generating module 36 are each made in the form of a programmable logic component, such as an FPGA (Field Programmable Gate Array), or in the form of a dedicated integrated circuit, such as an ASIC (Applications Specific Integrated Circuit).
The computing module 26 is configured to compute, via the neural network 28, the at least one intermediate depth map 30.
As an optional addition, the computing module 26 is configured to compute at least two intermediate maps 30 for the same scene S.
Also as an optional addition, the computing module 26 is further configured to modify an average of the indicative depth values between first and second intermediate maps 30, respectively computed for first and second pairs of acquired images, by selecting the second pair, also called following pair, or next pair, with a temporal deviation Δt+1 between the images that is modified relative to that Δt of the first pair, also called previous pair.
According to this optional addition, the computing module 26 is for example configured to compute an optimal movement Doptimal(t+1) for the next pair of acquired images from an average depth target value)
The optimal movement Doptimal(t+1) is for example computed using the following equations:
where E((t)) is the average of the values of the first intermediate map 30, i.e., the previous intermediate map from which the target movement, then the temporal deviation, is recomputed for the next pair of acquired images,
α is a dimensionless parameter linking the depth to the movement of the sensor 12;
where Dmax represents a maximum movement of the sensor 12 between two successive image acquisitions, and
D0 represents a reference movement used during learning of the neural network 28.
The depth target average value
Also as an optional addition, the computing module 26 is configured to compute at least two intermediate maps 30 for the same scene S, the computed intermediate maps 30 having respective averages of indicative depth values that are different from one intermediate map 30 to the other. According to this optional addition, the computing module 26 is further configured to compute a merged intermediate map 45 by obtaining a weighted sum of the computed intermediate maps 30. According to this optional addition, the generating module 36 is then configured to generate the depth map 16 from the merged intermediate map 45.
According to this optional addition, the computing module 26 is for example further configured to perform partitioning in k-averages on a computed intermediate map 30, in order to determine n desired different respective averages for a later computation of n intermediate maps, n being an integer greater than or equal to 2.
In the example of
In
Each optimal movement Di, where i is an integer index comprised between 1 and n representing the number of the corresponding respective average, or the corresponding centroid, is for example computed using the following equations:
where E(i(t)) is the average of the values of the partitioned depth map i(t) with index i,
α is the dimensionless parameter defined by the preceding equation (2).
According to
Each movement D*i is for example computed using the following equation:
D*
i
=D(t,Δi)=∥∫t−Δ
where V is the speed of the sensor 12 between the moments in time t−Δi and t.
The speed of the sensor 12 is typically deduced from that of the drone 10, which is obtained via a measuring device or speed sensor, known in itself.
In
Each partitioned depth map i(t) for example verifies the following equation:
where NN(It−Δ
D*i represents the movement computed by the block INT, for example according to equation (4), and
D0 represents the reference movement used during learning of the neural network 28.
The aforementioned equation (4) is also written in the form:
i(t)=α·β(It−Δ
where α is the dimensionless parameter defined by the aforementioned equation (2), D*i represents the movement computed by the block INT, and
β(It−Δ
with NN(It−Δ
In
The weighted sum is preferably a weighted average, the sum of the weights of which is equal to 1.
The weighted sum, such as the weighted average, is for example done pixel by pixel where, for each pixel of the merged intermediate map 45, a weight set is computed.
The computation of the merged intermediate map 45 for example verifies the following equations:
ωi,j,kε+ƒ(β(It−Δ
where the function f is defined by:
where FUSION(t) designates the merged intermediate map 45,
i is the integer index comprised between 1 and n, defined above, j and k are indices on the x-axis and y-axis defining the pixel of the map in question, and
ε, βmin and βmax are predefined parameters.
These parameters, as well as the depth target average value
ε=10−3; βmin=0.1;
One skilled in the art will note that equations (2), (5) to (7) depend on distance ratios and that the dimensionless parameter α and the partitioned depth map i(t) alternatively verify the following equations depending on speed ratios instead of distance ratios, assuming that the speed of the image sensor 12 is constant between two successive image acquisitions:
where Vmax represents a maximum speed of the sensor 12, and
V0 represents a reference speed used during learning of the neural network 28.
where NN(It−Δ
Vi represents the speed of the sensor 12 during this new image acquisition, and
V0 represents the reference speed used during learning of the neural network 28.
The aforementioned equation (12) is also written in the form:
i(t)=α′·γ(It−Δ
where α′ is the dimensionless parameter defined by the aforementioned equation (11), Vi represents the speed of the sensor 12 during this new image acquisition, and γ(It−Δ
with NN(It−Δ
The neural network 28 includes a plurality of artificial neurons 46 organized in successive layers 48, 50, 52, 54, i.e., an input layer 48 corresponding to the input variable(s) 32, an output layer 50 corresponding to the output variable(s) 34, and optional intermediate layers 52, 54, also called hidden layers and arranged between the input layer 48 and the output layer 50, as shown in
The artificial neural network 28 is in particular a convolutional neural network. The artificial neural network 28 for example includes artificial neurons 46 arranged in successive processing layers.
The artificial neural network 28 includes one or several convolution kernels. A convolution kernel analyzes a characteristic of the image to obtain, from the original image, a new characteristic of the image in a given layer, this new characteristic of the image also being called channel (also referred to as a feature map). The set of channels forms a convolutional processing layer, in fact corresponding to a volume, often called output volume, and the output volume is comparable to an intermediate image.
The artificial neural network 28 includes one or several convolution kernels arranged between the convolution kernels and the output variable(s) 34.
The learning of the neural network 28 is supervised. It then for example uses a back-propagation algorithm of the error gradient, such as an algorithm based on minimizing an error criterion by using a so-called gradient descent method.
The supervised learning of the neural network 28 is done by providing it, as input variable(s) 32, with one or several pair(s) of acquired images It−Δt, It and, as reference output variable(s) 34, with one or several corresponding intermediate map(s) 30, with the expected depth values for the acquired image pair(s) It−Δt, It provided as input variable(s) 32.
The learning of the neural network 28 is preferably done with a predefined temporal deviation Δ0 between two successive image acquisitions. This temporal deviation typically corresponds to the temporal period between two image acquisitions of the sensor 12 operating in video mode, or conversely to the corresponding frequency. Depending on the sensor 12, the image acquisition for example varies between 25 images per second, or even 120 images per second. The predefined temporal deviation Δ0 is then comprised between 40 ms and 16 ms, or even 8 ms.
During the learning of the neural network 28, the speed of the sensor 12 being assumed to be constant between two image acquisitions and equal to V0, also called reference speed, the predefined temporal deviation Δ0 corresponds to a predefined movement D0 of the sensor 12, also called reference movement.
The acquired pair of images It−Δt, It, provided as input variable 32 for the neural network 28, preferably has dimensions smaller than or equal to 512 pixels×512 pixels.
The generating module 36 is configured to generate the depth map 16 from the at least one computed intermediate map 30 or from the merged intermediate map 45, said merged immediate map 45 in turn resulting from computed intermediate maps 30.
The generating module 36 is preferably configured to generate the depth map 16 by applying a corrective scale factor to the or each computed intermediate map 30, or to the merged intermediate map 45 if applicable. The corrective scale factor depends on a ratio between the temporal deviation Δt between the images of the acquired pair from which the intermediate map 30 has been computed and a predefined temporal deviation Δ0, used for prior learning of the neural network 28.
When the speed of the sensor 12 is further assumed to be constant between two image acquisitions, the corrective scale factor depends, similarly, on a ratio between the movement D(t,Δt) of the sensor 12 between the two image acquisitions for the acquired pair from which the intermediate map 30 has been computed and the predefined movement D0, used for the prior learning of the neural network 28.
The corrective scale factor is then equal to D(t,Δt)/D0, and the corrected depth map for example verifies the following equation:
where NN(It−Δt,It) represents the intermediate map 30 derived from the neural network 28 for the pair of successive images (It−Δt, It),
D(t,Δt) represents said movement of the sensor 12 between the two image acquisitions, and
D0 represents the aforementioned reference movement.
Said movement D(t,Δt) for example verifies the following equation:
D(t,Δt)=∥∫t−ΔttV(τ)·dτ∥ (16)
where V is the speed of the sensor 12 between the moments in time t−Δt and t.
The operation of the drone 10 according to the invention, in particular its electronic generating module 14, will now be described using
During an initial step 100, the electronic generating device 14 acquires, via its acquisition module 24, at least one pair of successive images of the scene S from among the various images taken by the image sensor 12.
The electronic generating device 14 computes, during the following step 110 and via its computing module 26, in particular via its neural network 28, at least one intermediate depth map 30, the neural network 28 receiving, as previously indicated, each acquired pair of successive images in one of its input variables 32 and delivering the computed intermediate map 30 from said pair of acquired images in the respective one of its output variables 34.
As an optional addition, the electronic generating device 14 computes, via its computing module 26 and during a following optional step 120, the merged intermediate map 45 by obtaining the weighted sum of at least two intermediate maps 30 computed for the same scene S, the computed intermediate maps 30 having respective averages of indicative depth values that are different from one intermediate map to the other.
The computation of the merged intermediate map 45 with said weighted sum is for example done using the FUSION block of
To determine different intermediate maps 30 intended to be merged, the computing module of 36 further performs, according to an optional addition and for example using the unit K_m, the partitioning in k-averages on the intermediate map 30 previously computed, in order to determine n desired separate respective averages for the subsequent computation of n intermediate maps 30. The n desired separate respective averages, such as the n centroids C1, . . . , Cn, are next provided to the successive units 1/
As an optional addition, the electronic generating device 14 computes, via its generating module 36 and during the following optional step 130, a corrective scale factor to be applied directly to the intermediate map 30 computed by the neural network 28, or to the merged intermediate map 45. The application of the corrective scale factor for example verifies equations (15) and (16) previously described, and makes it possible to correct the intermediate map based on any offset between the predefined temporal deviation Δ0, used for the prior learning of the neural network 28, and the temporal deviation Δr between the images of the acquired pair, from which the intermediate map 30 has been computed.
The electronic generating device 14 lastly generates, during step 140 and via its generating module 36, the depth map 16 of the scene S.
One skilled in the art will understand that when the optional steps 120 and 130 for computing the merged map 45 and respectively applying the corrective scale factor are not carried out and the electronic generating device 14 goes directly from step 110 to step 140, the depth map 16 is generated directly from the intermediate map 30 derived from the neural network 28. In other words, the depth map 16 generated by the generating module 34 is then identical to the intermediate map 30 derived from the neural network 28 of the computing module.
The electronic generating device 14 then makes it possible to provide a depth map 16 of the scene S with good precision and quickly through the use of the neural network 28. The average depth error between the depth thus estimated and the actual depth has a small value.
For example, in
This good precision of the determination of the depth map 16 by the electronic generating device 14 is also visible in
In
When, as an optional addition, the electronic generating device 14 further computes the merged intermediate map 45 by obtaining the weighted sum of at least two intermediate maps 30 computed for the same scene S, the depth map 16 thus obtained has a wider range of depth values, as illustrated in
One skilled in the art will therefore understand that the electronic generating device 14 according to the invention then allows the drone 10 to perform more effective obstacle detection.
One can then see that the electronic generating device 14 according to the invention and the associated generating method allow more effective generation of the depth map 16 of the scene S, from at least one pair of successive images It−Δt, It of the scene S.
Number | Date | Country | Kind |
---|---|---|---|
17 57049 | Jul 2017 | FR | national |