The present invention relates to the technical field of digital holography.
It relates in particular to a method and a device for estimating a depth map associated with a digital hologram representing a scene. It also relates to an associated computer program.
Digital holography is an immersive technology that records the characteristics of a wave diffracted by an object present in a three-dimensional scene so as to reproduce a three-dimensional image of that object. The digital hologram obtained then contains all the information allowing this three-dimensional scene to be described.
However, extracting this information from the digital hologram itself in order to reconstruct the three-dimensional scene is not simple because the plane of the digital hologram does not comprise the spatial location of objects in the three-dimensional scene.
It is particularly known to use depth determination methods (or “Depth from focus” according to the designation of Anglo-Saxon origin) in relation to the plane of the hologram in order to determine, from the hologram, information concerning the geometry of the scene. An example of such a method is for example described in the article “Depth from focus”, by Grossmann, Pavel, Pattern Recognit. Lett. 5, 63-69, 1987.
In such methods, a reconstruction volume is obtained from several holographic reconstruction planes calculated at different focus distances chosen within a predefined interval. From these reconstruction planes, the associated depth is estimated by applying, on each of these planes, focusing operators in order to select, for each pixel, the depth of the reconstruction plane for which the focus is optimal.
However, such methods take a long time to be implemented and are very expensive in terms of computational resources.
In this context, the present invention proposes to improve the determination of depth values relative to the plane of a digital hologram associated with a three-dimensional scene.
More particularly, according to the invention, we propose a method for estimating a depth map associated with a digital hologram representing a scene, the method comprising steps of:
Thus, the different thumbnails processed are independent of each other because they are made up of sets of disjoint pixels. This independence then allows faster implementation of the method. In addition, the necessary computing resources are less expensive thanks to the limited number of areas (that means the different thumbnails) to analyze.
Furthermore, the use of the artificial neural network allows faster processing of all the thumbnails and a more precise determination of the focuslevels associated with the pixels of the thumbnails.
Other non-limiting and advantageous characteristics of the method according to the invention, taken individually or in all technically possible combinations, are as follows:
The present invention also relates to a device for estimating a depth map associated with a digital hologram representing a scene, the device comprising:
The present invention finally relates to a computer program comprising instructions executable by a processor and designed to implement a method as introduced previously when these instructions are executed by the processor.
Of course, the different characteristics, variants and embodiments of the invention can be associated with each other in various combinations as long as they are not incompatible or exclusive of each other.
In addition, various other characteristics of the invention emerge from the appended description made with reference to the drawings which illustrate non-limiting forms of embodiment of the invention and where:
It should be noted that, in these figures, the structural and/or functional elements common to the different variants may have the same references.
The digital hologram H represents a given three-dimensional scene. This three-dimensional scene comprises, for example, one or more objects. The three-dimensional scene is defined in a marker (O, x, y, z).
As shown in
The digital hologram H, for example, has a size of 1024×1024 pixels here.
The device 1 for estimating a depth map C is designed to estimate the depth map C associated with the digital hologram H. For this, the device 1 comprises a processor 2 and a storage device 4. The storage device 4 is for example a hard disk or a memory.
The device 1 also comprises a set of functional modules. It comprises for example a reconstruction module 5, a decomposition module 6, a module 8 for determining a focus map Ci;j,k (or focusing map) and a module 9 for determining a depth value djs+q,ks+r.
Each one of the different modules described is for example implemented by means of computer program instructions designed to implement the module concerned when these instructions are executed by the processor 2 of the device 1 for estimating the depth map C.
However, as a variant, at least one of the aforementioned modules can be implemented by means of a dedicated electronic circuit, for example an integrated circuit with a specific application.
The processor 2 is also designed to implement an artificial neural network NN, involved in the process of estimating the depth map C associated with the digital hologram H.
An example of architecture of this artificial neural network NN is shown in
Generally speaking, such an artificial neural network NN comprises a plurality of convolution layers distributed according to different levels, as explained below and represented in
In order to describe the architecture of the artificial neural network NN, we consider here that an image Ie is provided at an input of this network of artificial neurons NN. In practice, this image Ie is an image derived from the digital hologram H, as will be explained subsequently.
As shown in
The first part 10 is a so-called contraction part. Generally speaking, this first part 10 has the encoder function and makes it possible to reduce the size of the image provided at an input while retaining (saving) its characteristics. For this, it comprises four levels here 12, 14, 16, 18. Each level 12, 14, 16, 18 comprises a convolution block Cony and a subsampling block D.
The convolution block Cony comprises at least one convolution layer whose kernel is a matrix of size n×n. Preferably here, each convolution block has two successive convolution layers. Here, each convolution layer has a kernel with a matrix of size 3×3.
Then, the convolution layer (or convolution layers if there are several) is followed by an activation function of rectified linear unit type (or ReLu for “Rectified Linear Unit” according to the commonly used designation of Anglo-Saxon origin). Finally, the convolution block Cony comprises, for the result obtained after application of the activation function, a so-called batch normalization. Here, this batch is composed by (or constituted by) the number of images provided as an input to the artificial neural network NN. During the learning step of the artificial neural network as described below, the batch size is greater than or equal to 1 (that means. at least two images are provided at the input to allow training the network of artificial so neurons). In the case of the depth map estimation method as described subsequently, the batch size is, for example here, given by the number of reconstructed images Ii (see below).
As shown in
Thus, taking the example shown in
Then, this first data X0, 0 is provided as an input to the second level 14 of the first part 10 so as to obtain, at the output thereof, a second data X1, 0. Here, this second data X1, 0 has for example dimensions reduced by half compared to the first data X0, 0.
The second data data X1, 0 is provided as an input to the third level 16 of the first part 10 so as to obtain, at the output thereof, a third data X2, 0. Here, this third data X2, 0 has for example dimensions reduced by half compared to the second data X1, 0.
Then, this third data X2, 0 is provided as an input to the fourth level 18 of the first part 10 so as to obtain, at the output, a fourth data X3, 0. Here, this fourth data X3, 0 has for example dimensions reduced by half compared to the third data X2, 0.
Thus, the processing operations of the input image Ie by the first part 10 of the artificial neural network NN can be expressed in the following form:
X
i,j
=D(Conv(Xi-1,j))
As can be seen in
The second part 30 of the artificial neural network NN is called expansion. Generally speaking, this second part 30 has the decoder function and makes it possible to form an image having the size of the image provided at the input and which only contains the characteristics essential to the processing.
For this, the second part 30 here comprises four levels 32, 34, 36, 38. By analogy with the first part 10, we define the first level 38 of the second part 30 as that positioned at the same level as the first level 12 of the first part 10. The second level 36 of the second part 30 is positioned at the same level as the second level 14 of the first part 10 of the artificial neural network NN. The third level 34 of the second part is positioned at the same level as the third level 16 of the first part 10 of the artificial neural network NN. Finally, the fourth level 32 of the second part 30 is positioned at the same level as the fourth level 18 of the first part 10 of the artificial neural network NN. This definition is used to match the levels of the artificial neural network processing data of the same dimensions.
Each level 32, 34, 36, 38 comprises an oversampling block U, a concatenation block Conc and a convolution block Cony (such as that introduced previously in the first part).
Each oversampling block U aims at increasing the dimensions of the data received at an input. This is an “upscaling” operation according to the commonly used Anglo-Saxon expression. For example here, the dimensions are multiplied by 2.
Following the oversampling block U, each level 32, 34, 36, 38 comprises the concatenation block Conc. The latter aims at concatenating the data obtained at the output of the oversampling block U of the level concerned with the data of the same size obtained at the output of one of the levels 12, 14, 16, 18 of the first part 10 of the artificial neural network NN. The involvement of data from the first part of the artificial neural network NN in the concatenation operation is shown in broken lines in
This concatenation block then allows the transmission of information of the extracted high frequencies obtained in the first part 10 of the artificial neural network NN also in the second part 30. Without this concatenation block Conc, this information could be lost following the multiple operations of undersampling and oversampling present in the artificial neural network NN.
Then, at the output of the concatenation block Conc, each level 32, 34, 36, 38 of the second part 30 comprises a convolution block Cony such as that described previously in the first part 10 of the artificial neural network NN. Here, each convolution block Cony notably comprises at least one convolution layer followed by a rectified linear unit type activation function and a batch normalization operation.
Based on the example shown in
That sixth data item X3, 1 then is provided at an input of the third level 34 of the second part 30 and, especially, at an input of the oversampling bloc U. At the output of that oversampling bloc U, a second intermediate data Xint2, which has the same dimensions as the third data X2, 0, is obtained. The second intermediate data Xint2 and the third data X2, 0 are concatenated by the concatenation block Conc. The result obtained at the output of the concatenation block Conc is provided at the input of the convolution block Conv so as to obtain, at the output, a seventh data item X2, 2.
Then, as shown in
Then, this eighth data X1, 3 is provided at an input of the first level 38 of the second part 30. The oversampling block U then makes it possible to obtain a fourth data Xint4. The latter has the same dimensions as the first data X0, 0. The fourth intermediate data Xint4 and the first data X0, 0 are then concatenated by the concatenation block Conc. The result obtained at the output of the concatenation block Conc is provided at an input of the convolution block Conv so as to obtain, at the output, a final data X0.4. This final data X0.4 has the same dimensions and the same resolution as the input image Ie. In practice, here, this final data X0, 4 is for example associated with a focus map (also denoted focusing map) as described below.
Thus, the processing operations of the fifth data item X4, 0 by the second part 30 of the artificial neural network NN can be expressed in the following form:
X
i,j=Conv(Conc[Xi,0;U(Xi+1,j−1)])
As shown in
The method then continues with a step E4 of reconstructing a plurality of two-dimensional images of the three-dimensional scene represented by the digital hologram H.
For this, the reconstruction module 5 is configured to reconstruct n images Ii of the scene by means of the digital hologram H, with i being an integer ranging from 1 to n.
Each reconstructed image Ii is defined in a reconstruction plane which is perpendicular to the depth axis of the digital hologram H. In other words, each reconstruction plane is perpendicular to the depth axis z. Each reconstruction plane is associated with a depth value, making it possible to associate a depth zi with each reconstructed image Ii, the index i referring to the index of the reconstructed image Ii. Each depth value defines a distance between the plane of the digital hologram and the reconstruction plane concerned.
Preferably here, the reconstruction step E4 is implemented in such a way that the depths zi associated with the reconstructed images Ii are uniformly distributed between the minimum depth zmin and the maximum depth zmax. In other words, the reconstructed images Ii are uniformly distributed along the depth axis, between the minimum depth zmin and the maximum depth zmax. Thus, the first reconstructed image I1 is spaced from the plane of the digital hologram H by the minimum depth zmin while the last reconstructed image In is spaced from the plane of the digital hologram H by the maximum depth zmax.
The reconstruction planes associated with the reconstructed images Ii are for example spaced two by two by a distance ze. The distance ze between each reconstruction plane is for example of the order of 50 micrometers (μm).
Preferably, the n images obtained in reconstruction step E4 are calculated using a propagation of the angular spectrum defined by the following formula:
with F and F−1 corresponding to direct and inverse Fourier transforms, respectively, and fx and fy being the frequency coordinates of the digital hologram H in the Fourier domain in a first spatial direction x and in a second spatial direction y of the digital hologram, λ being the acquisition wavelength of the digital hologram H, i being the index of the reconstructed image I with i ranging from 1 to n and zi being the depth given in the reconstruction plane of the image Ii.
Each reconstructed image Ii is defined by a plurality of pixels. Preferably, the reconstructed images are formed of as many pixels as the digital hologram H. Thus, the reconstructed images Ii and the digital hologram H are of the same size. For example, in the case of a digital hologram H of size 1024×1024, each reconstructed image Ii also has a size of 1024×1024.
As shown in
Each thumbnail Ji;j,k is defined by the following formula:
J
i;j,k
={|I
i|(j·s:(j+1)·s,k·s:(k+1)·s)}
with
with sW and sH being the dimensions (respectively height and width) of the reconstructed image Ii, s being the size of the thumbnail Ji;j,k, |x1| being the notation corresponding to the module of the data x1 and └x2┘, the notation corresponding to the lower integer part of the number x2. The notation y1:y2 means that, for the variable concerned, the thumbnail Ji; j,k is defined between pixel y1 and pixel y2. In other words, here, the previous formula defines the thumbnail Ji; j,k, according to dimension x, between pixels js and (j+1)s of the reconstructed image Ii and, according to dimension y, between pixels ks and (k+1)s of the reconstructed image Ii.
Each thumbnail Ji; j,k comprises a plurality of pixels. This plurality of pixels corresponds to a part of the pixels of the associated reconstructed image Ii.
Here, the thumbnails Ji; j,k are adjacent to each other. In practice, each thumbnail Ji;j,k is formed from a set of contiguous pixels of the reconstructed image Ii. Here, the sets of pixels of the reconstructed image Ii (respectively forming each one of the thumbnails Ji;j,k) are disjoint. In other words, this means that the thumbnails Ji; j,k associated with a reconstructed image Ii do not overlap with each other. Each thumbnail Ji; j,k therefore comprises pixels which do not belong to the other thumbnails associated with the same reconstructed image Ii. In other words, the thumbnails Ji; j,k associated with a reconstructed image Ii are independent of each other.
This property of independence between the thumbnails is particularly advantageous for the method according to the invention because it allows faster implementation. In addition, the necessary computing resources are less expensive thanks to the limited number of areas to analyze (namely the different thumbnails).
Since each thumbnail Ji; j,k is derived from a reconstructed image Ii associated with a depth zi, each thumbnail Ji;j,k is also associated with this same depth zi (of the three-dimensional scene).
In the case where the digital hologram H has a size of 1024×1024, each thumbnail Ji;j,k can for example have a size of 32×32.
In the case where the digital hologram H has a size of IH×IW, each thumbnail Ji; j,k has a size of (32×sH)×(32×sW) with sH=IH/1024 and sW=IW/1024.
This definition of the size of the thumbnails makes it possible to ensure a size of these thumbnails adapted to the size of the digital hologram H so as to improve the speed of implementation of the method for estimating the depth map associated with the digital hologram H.
As shown in
Here, each element of the focus map Ci; j,k corresponds to a focus level (corresponding to a focus level associated with the pixel concerned in the thumbnail Ji; j,k). In other words, the focusing map Ci; j,k associates with each pixel of the thumbnail Ji;j,k concerned a level of focus.
In practice, this step E8 is implemented via the artificial neural network NN., At an input, the latter receives each one of the thumbnails Ji; j,k and provides at the output the focus levels (also denoted focusing levels) associated with each one of the pixels in the thumbnail Ji; j,k concerned.
More particularly, the artificial neural network NN receives, at the input, each one of the pixels of the thumbnail Ji; j,k and provides, at the output, the associated focus level (or focusing level). This focusing level is for example comprised between 0 and 1 and is equivalent to a level of sharpness associated with the pixel concerned. For example, in the case of a blurry pixel, the focus level is close to 0 while in the case of a noticeably sharp pixel, the focus level is close to 1.
Advantageously, the use of the artificial neural network allows faster processing of all the thumbnails and a more precise determination of the focusing levels associated with the pixels of the thumbnails.
Prior to implementing the estimation method, a learning step (not shown in the figures) allows the training of the artificial neural network NN. For this, computer-calculated holograms are used, for example. For these computed holograms, the exact geometry of the scene (and therefore the associated depth map) is known. A set of basic images comes from these calculated holograms.
For each base image in this set, each pixel is associated with a focus level. Indeed, for each pixel of each base image, the focus level is equal to 1 if the corresponding pixel, in the depth map, is equal to the associated depth. Otherwise, the focus level is 0.
The training step then consists of adjusting the weights of the nodes of the different convolution layers comprised in the different convolution blocks described previously so as to minimize the error between the focusing levels obtained at the output of the artificial neural network NN (when the basic images are provided at an input of this network) and those determined from the known depth map. For example, a crossed-entropy loss method can be used here in order to minimize the distance between the focus levels obtained at the output of the artificial neural network NN (when the base images are provided at an input of this network) and those determined from the known depth map.
In other words, the weights of the nodes of the different convolution layers are adjusted so as to converge the focus levels obtained at the output of the artificial neural network NN towards the focus levels determined from the known depth map.
In practice, the artificial neural network NN receives at the input all the thumbnails Ji;j,k associated with each reconstructed image Ii and proceeds to parallel processing of each of the thumbnails Ji;j,k.
Alternatively, the thumbnails Ji; j,k could be processed successively, one after the other.
At the end of step E8, the processor 2 therefore knows, for each thumbnail Ji;j,k, the associated focusing map Ci; j,k which lists the focusing levels obtained at the output of the artificial neural network NN associated with each pixel of the thumbnail Ji; j,k concerned. Each focusing map Ci; j,k is associated with the corresponding thumbnail Ji;j,k, and thus with the depth zi (of the three-dimensional scene).
As shown in
Indeed, during step E8, as each reconstructed images of the plurality of reconstructed images Ii corresponds to a different depth zi, a pixel of a thumbnail is associated with different focusing levels (depending on the depth of the reconstructed image I from which the thumbnail concerned is derived). In other words, for each pixel (associated with a depth value djs+q, ks+r of the depth map C), several focusing levels are known.
Thus, here, processor 2 determines, for each pixel associated with the depth value djs+q, ks+r concerned, the depth for which the focusing level is the highest:
d
js+q,ks+r=argmaxi=1 . . . N(Ci;j,k(q,r))
In other words, for each index pixel (js+q, ks+r), processor 2 determines the depth at which the focus level is highest. This depth then corresponds to the depth value djs+q, ks+r (element of depth map C).
Alternatively, the depth value could be determined using another method than determining the maximum value of the focus level. For example, an area formed by a plurality of adjacent pixels may be defined and the depth value may be determined by considering the depth for which a maximum deviation is observed from the average of the focus levels over the defined pixel area.
At the end of step E10, therefore, the depth map C is for example estimated here from these determined depth values. Thus, each element of the depth map C comprises a depth value djs+q, ks+r associated with each pixel having the index (js+q, ks+r).
This estimated depth map C ultimately makes it possible to have spatial information in the form of a matrix of depth values representing the three-dimensional scene associated with the digital hologram H.
Of course, the method described above for a digital hologram applies in the same way to a plurality of holograms. For a plurality of digital holograms, the implementation of the method can be successive for each hologram or in parallel for the plurality of digital holograms.
Number | Date | Country | Kind |
---|---|---|---|
2211433 | Nov 2022 | FR | national |