The present invention relates to a device for detecting objects or features in the focal plane of a scene based on a measurement using a method of compressing the three-dimensional hyperspectral scene into a non-homogeneous image in two dimensions, and a treatment of the obtained image to detect the features sought in the scene.
The invention finds a particularly advantageous application for embedded systems intended to detect objects or features in a scene from their shape, their texture and their luminous reflectance.
The invention can be applied to a large number of technical fields in which hyperspectral detection is sought. In a non-exhaustive manner, the invention can be used, for example, in the medical and dental field, to aid diagnosis. In the plant and mycological field, the invention can also be used to carry out phenotyping, to detect symptoms of stress or disease or to differentiate species. In the field of chemical analysis, the invention can equally be used to measure concentrations. In the field of the fight against counterfeiting, the invention can be used to discern a counterfeit.
For the purposes of the invention, a hyperspectral acquisition detection corresponds to the detection of features in the focal plane of a scene from an acquired two-dimensional image containing a representation of the spatial and spectral information of the focal plane of the scene.
Different methods of compression of the focal plane of a hyperspectral scene are described in the literature. The purpose of these methods is to acquire the focal plane of the hyperspectral scene in a single acquisition without the need to scan the focal plane of the scene in spatial or spectral dimensions.
For example, the thesis “Non-scanning imaging spectrometry”, Descour, Michael Robert, 1994, The University of Arizona, proposes a way to acquire a single two-dimensional image of the observed scene containing all the information for different wavelengths. This method, called CTIS (for “Computed Tomography Imaging Spectrometer”), proposes to capture a diffracted image of the focal plane of the scene observed by means of a diffraction grating disposed upstream of a digital sensor. This diffracted image acquired by the digital sensor takes the form of multiple projections. Each projection makes it possible to represent the focal plane of the observed scene and contains all the spectral information of the focal plane.
Another method, named CASSI (for “Coded Aperture Snapshot Spectral Imaging”), described in the thesis “Compressive spectral imaging”, D. Kittle, 2010, proposes a way of acquiring a single two-dimensional encoded image containing all spatial and spectral information. This method proposes to capture a diffracted image, by means of a diffraction prism, and encoded, by means of an encoding mask, of the focal plane of the observed scene.
These methods, although satisfactory for solving the problem of instantaneous acquisition of the focal plane of the hyperspectral scene, require complex algorithms that are expensive in computing resources in order to estimate the uncompressed hyperspectral scene. The review “Review of snapshot spectral imaging technologies,” Nathan Hagen, Michael W. Kudenov, Optical Engineering 52 (9), September 2013, presents a comparison of hyperspectral acquisition methods and the algorithmic complexities associated with each of them. He Mingyi et al., “Multi-scale 3D deep convolutional neural network for hyperspectral image classification,” 2017 IEEE International Conference on Image Processing, IEEE, Sep. 17, 2017, pp. 3904-3908, Chen Yushi et al., “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, IEEE Service Center, col. 54, no. 10, Oct. 1, 2016, pp. 6232-6251, and Qiangqiang Yuan et al., “Hyperspectral image denoising employing a spatial-spectral deep residual convolutional neural network”, Cornell University Library, Jun. 1, 2018, are examples of such publications.
Indeed, the CTIS method requires an estimation process based on a two-dimensional matrix representing the transfer function of the diffraction optics. This matrix must be inverted to reconstruct the hyperspectral image. Since the matrix of the transfer function is not completely defined, iterative matrix inversion methods, which are expensive in computing resources, make it possible to approach the result step by step.
The CASSI method and its derivatives also require non-completely defined matrix computations, and use iterative computation methods that are expensive in computing resources in order to approach the result.
In addition, the three-dimensional hyperspectral image reconstructed by these computational methods does not contain additional spatial or spectral information with respect to the two-dimensional compressed image obtained by these acquisition methods.
The estimation by the calculation of the hyperspectral image in three dimensions is therefore not necessary for a direct detection of the features sought in the focal plane of the scene.
Image processing methods for the purpose of detecting features are widely described in the scientific literature. For example, a method based on neural networks is described in “auto-association by multilayer perceptrons and singular value decomposition.” Biological cybernetics, 59 (4): 291-294, 1988. ISSN 0340-1200, H. Bourlard and Y. Kamp. AT.
New methods based on deep and convolutional neural networks are also widely used with results showing very low false detection rates. For example, such a method is described in “Stacked Autoencoders Using Low-Power Accelerated Architectures for Object Recognition in Autonomous Systems,” Neural Processing Letters, Vol. 43, no. 2, pp. 445-458, 2016, J. Maria, J. Amaro, G. Falcao, L. A. Alexander.
These methods are particularly suitable for detecting elements in a color image (generally having 3 channels—Red, Green and Blue) of a scene taking into account the characteristics of shapes, textures and colors of the feature to detect. These methods consider the image homogeneous, and convolutionally process the entire image by the same process.
The processing of the two-dimensional compressed images obtained by the CTIS and CASSI methods can therefore not be performed using a standard deep convolutional neuron network. Indeed, the image obtained by these methods is not homogeneous, and contains nonlinear features in the spectral or spatial dimensions.
The technical problem of the invention consists in directly detecting the features or objects sought from the acquisition of at least one compressed, non-homogeneous, and nonlinear two-dimensional representation containing all the spatial and spectral information of a hyperspectral scene in three dimensions.
The present invention proposes to answer this technical problem by directly detecting the desired features by means of a deep and convolutional formal neural network, whose architecture is adapted to direct detection, applied on a compressed two-dimensional image of a three-dimensional hyperspectral scene of the scene.
The three-dimensional hyperspectral image contains no more spatial and spectral information than the compressed image obtained by the CTIS or CASSI acquisition methods since the three-dimensional hyperspectral image is reconstructed from the compressed image. Thus the invention proposes to detect directly in the compressed image the desired features in the focal plane of a scene.
For this purpose, the invention relates to a device for detecting features in a hyperspectral scene.
The invention is characterized in that the device comprises a system for direct detection of features in said hyperspectral scene which integrates a deep convolutional neural network designed to detect the feature(s) sought in said hyperspectral scene from the at least one compressed image of the hyperspectral scene.
In practice, unlike the state-of-the-art of the CTIS method, the invention makes it possible to detect features in said hyperspectral scene in real time between two acquisitions of the hyperspectral focal plane of the observed scene. In doing so, it is no longer necessary to defer the processing of the compressed images and it is no longer necessary to store these compressed images after the detection. Also it is no longer necessary to reconstruct the hyperspectral image in three dimensions before applying the detection method.
In a variant, the compressed image obtained by the optical system contains the focal plane diffracted and encoded according to the coding scheme of a mask introduced into the optical path before diffraction of the scene. Thus, the neural network uses, for the direct detection of the desired features, the following information:
In practice, contrary to the state of the classical art of the CASSI method, the invention makes it possible to detect features in a hyperspectral scene in real time between two acquisitions of the hyperspectral focal plane of the observed scene. In doing so, it is no longer necessary to defer the processing of the compressed images and it is no longer necessary to store these compressed images after the detection. Also it is no longer necessary to reconstruct the hyperspectral image in three dimensions before applying the detection method.
According to one embodiment, there is provided a device for capturing an image of a hyperspectral scene and detecting features in this three-dimensional hyperspectral scene further comprising a system for acquiring the at least one compressed image of the hyperspectral scene in three dimensions.
According to one embodiment, the acquisition system comprises a compact mechanical embodiment integrable in a portable autonomous device and the detection system is included in said portable and autonomous device.
According to one embodiment, the acquisition system comprises a compact mechanical realization integrable in front of the lens of a camera of a smartphone and the detection system is included in the smartphone.
According to one embodiment, said at least one compressed image is obtained by an infrared sensor of the acquisition system. This embodiment makes it possible to obtain information that is invisible to the human eye.
According to one embodiment, said compressed image is obtained by a sensor of the acquisition system whose wavelength is between 0.001 nanometer and 10 nanometers. This embodiment makes it possible to obtain information on the X-rays present on the observed scene.
According to one embodiment, said compressed image is obtained by a sensor of the acquisition system whose wavelength is between 10,000 nanometers and 20,000 nanometers. This embodiment makes it possible to obtain information on the temperature of the observed scene.
According to one embodiment, said at least one compressed image is obtained by a sensor of the acquisition system whose wavelength is between 300 nanometers and 2000 nanometers. This embodiment makes it possible to obtain information in the domain that is visible and invisible to the human eye.
According to one embodiment, said at least one compressed image is obtained by a sensor of the acquisition system comprising:
This embodiment is particularly simple to implement and can be adapted to an existing sensor.
According to one embodiment, said at least one compressed image is obtained by a sensor of the acquisition system comprising:
This embodiment is particularly simple to implement and can be adapted to an existing sensor.
For the detection of features from said compressed image, the invention uses a convoluted deep convolutional neural network to calculate a probability of presence of the one or more features sought in said hyperspectral scene. Learning from said deep and convolutional neural network makes it possible to indicate the probability of presence of the features sought for each x and y coordinates of said hyperspectral scene. For example, learning through retro-propagation of the gradient or its derivatives from training data can be used.
According to one embodiment, the neural network is designed to calculate a chemical concentration in said hyperspectral scene from the compressed image.
According to one embodiment, an output of the neural network is scalar or boolean.
According to one embodiment, an output layer of the neural network comprises a layer CONV(u) where u is greater than or equal to 1 and corresponds to the number of desired features.
The convolutional deep neural network for direct detection from the compressed image has an input layer structure adapted for direct detection. The invention has several architectures of the deep layers of said neural network. Among these, a self-encoding architecture as described in the document “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”, Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla is adapted to indicate the probability of presence of the features sought for each x and y coordinates of the hyperspectral scene.
Said input layer of the neural network is adapted to the structure of the compressed image obtained by the acquisition means. Thus, the input layer is a third order tensor and has two spatial dimensions of size XMAX and YMAX, and a dimension of depth of size DMAX.
The invention uses the nonlinear relation f(xt, yt, dt)→(ximg, yimg) defined for xtϵ[0 . . . XMAX[, ytϵ[0 . . . YMAX[ and dtϵ[0 . . . DMAX[ for calculating the coordinates ximg and yimg of the pixel of said compressed image whose intensity is copied into the third order tensor of said input layer of the neural network at coordinates (xt, yt, dt).
According to one embodiment, the compressed image contains the diffractions of the hyperspectral scene obtained with diffraction filters. The compressed image obtained contains an image portion of the non-diffracted scene, as well as the projections diffracted along the axes of the different diffraction filters. The input layer of the neural network contains a copy of the chromatic representations of the hyperspectral scene of the compressed image according to the following nonlinear relationship:
with:
n=floor (M (dt−1)/DMAX);
λ=(dt−1) mod (DMAX/M);
M the number of diffractions of the compressed image;
dt between 1 and DMAX, the depth of the input layer of the neural network;
xt between 0 and XMAX, the width of the input layer of the neural network;
yt between 0 and YMAX, the length of the input layer of the neural network;
XMAX the size along the x-axis of the third order tensor of the input layer;
YMAX the size along the y-axis of the third order tensor of the input layer;
DMAX, the depth of the third order tensor of said input layer;
λsliceX, the constant of the spectral pitch of the pixel along the x-axis of said compressed image;
λsliceY, the constant of the spectral pitch of the pixel along the y axis of said compressed image;
xoffsetX(n) corresponding to the shift along the x-axis of the diffraction n;
yoffsetY(n) corresponding to the shift along the y-axis of the diffraction n.
According to one embodiment, the compressed image contains an encoded two-dimensional representation of the hyperspectral scene obtained with a mask and a prism. The obtained compressed image contains an image portion of the diffracted and encoded scene. The input layer of the neuron network contains a copy of the compressed image according to the following nonlinear relationship: f(xt, yt, dt)={(ximg=xt); (yimg=yt)} (Img=MASK if dt=0; Img=CASSI if dt>0),
with:
dt between 0 and DMAX;
xt between 0 and XMAX;
yt between 0 and YMAX;
XMAX the size along the x-axis of the third order tensor of the input layer;
YMAX the size along the y-axis of the third order tensor of the input layer;
DMAX, the depth of the third order tensor of said input layer;
MASK: image of the compression mask used,
CASSI: measured compressed image,
Img: Selected image whose pixel is copied.
These non-linear relationships make it possible to quickly search for the intensity of the pixels of interest in each diffraction. Indeed, some pixels can be neglected if the wavelength of the diffracted image is not significant.
The architecture of the convolutional deep neural network is composed of an encoder making it possible to search for the elementary features specific to the desired detection, followed by a decoder making it possible to generate an image of probabilities of presence of the features to be detected in said compressed image of the hyperspectral focal plane. The encoder/decoder structure makes it possible to search for the elementary and specific features of the main feature sought in said hyperspectral focal plane.
According to one embodiment, the encoder is composed of a succession of convolutional neuron layers alternating with pooling layers (decimation operator of the previous layer) to reduce the spatial dimension.
According to one embodiment, the decoder is composed of a succession of deconvolution neuron layers alternating with unpooling layers (interpolation operation of the previous layer) allowing an increase in the spatial dimension.
For example, such an encoder/decoder structure is described in “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”, Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla.
According to one embodiment, a set of fully connected neuron layers may be positioned between the encoder and the decoder.
According to one embodiment, the convolutional neural network is designed to detect the one or more features sought in said hyperspectral scene from at least one compressed image and at least one non-diffracted standard image of the hyperspectral scene.
The invention thus makes it possible to correlate the information contained in the different diffractions of the compressed image with information contained in the non-diffracted central part of the image obtained.
The compressed image obtained by the optical system contains the focal plane of the non-diffracted scene at the center, as well as the diffracted projections along the axes of the different diffraction filters. Thus, the neural network uses, for the direct detection of the desired features, the following information of said at least one diffracted image:
The sets of standard images and compressed images are thus fused by means of said deep and convolutional formal neural network, taking into account the offsets of images taken from the different optical sources, and a direct detection of said desired features is made from information merged using this same deep and convolutional neural network.
For example, such a structure of encoders merging different images of the same focal plane is described in “Multimodal deep learning for robust rgb-d object recognition. In Intelligent Robots and Systems (IROS)”, Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., and Burgard, W. (2015) IEEE/RSJ International Conference on, pages 681 #687. IEEE.
The present invention utilizes different standard and compressed images of the same hyperspectral focal plane. A deep and convolutional neural network image fusion method is presented in “Multimodal deep learning for robust object-recognition rgb-d. In Intelligent Robots and Systems (IROS), Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., and Burgard, W. (2015) IEEE/RSJ International Conference on, pages 681 #687. IEEE. This document presents a deep and convolutional neural network structure using two processing paths, one path per image type of the same scene, completed by layers merging the two paths; the function implemented by this deep and convolutional neural network is a classification of images. This structure as such is not adapted for the present invention because it is not adapted to two-dimensional compressed images of a three-dimensional hyperspectral focal plane, and because its function is the classification of the scene and not the detection of features in this scene.
The different diffractions of the compressed image containing significant spectral information, but each pixel of which contains a sum of the diffractions at different wavelengths, an embodiment of the invention can also use the central part of the image, not diffracted, and allows to search in the complete image the spatial (shape, texture, etc.) and spectral (reflectance) characteristics.
According to one embodiment, the neural network is designed to calculate a probability of presence of the one or more features sought in said hyperspectral scene from the set of said at least one compressed image and said at least one standard not-diffracted image.
According to one embodiment, said convolutional neural network is designed so as to take into account the offsets of the focal planes of the different image acquisition sensors and to integrate the homographic function making it possible to merge the information of the different sensors by taking into account the parallaxes of the different images.
According to one embodiment, there is provided a device for capturing an image of a hyperspectral scene and for detecting features in this three-dimensional hyperspectral scene furthermore comprising a system for acquiring at least one non-diffracted standard image of said hyperspectral scene.
According to one embodiment, said at least one non-diffracted standard image is obtained by an infrared sensor of the acquisition system. This embodiment makes it possible to obtain information that is invisible to the human eye.
According to one embodiment, said at least one non-diffracted standard image is obtained by a sensor whose wavelength is between 300 nanometers and 2000 nanometers. This embodiment makes it possible to obtain information in the domain that is visible and invisible to the human eye.
According to one embodiment, said at least one non-diffracted standard image and said at least one compressed image are obtained by a set of semi-transparent mirrors so as to capture the hyperspectral scene on several sensors simultaneously. This embodiment makes it possible to instantly capture identical planes.
According to one embodiment, the acquisition system comprises means for acquiring at least one compressed image of a focal plane of the hyperspectral scene.
According to one embodiment, the compressed image is non-homogeneous.
According to one embodiment, the compressed image is a two-dimensional image.
According to one embodiment, the neural network is designed to generate an image for each desired feature whose value of each pixel at the coordinates (x; y) corresponds to the probability of presence of said feature at the same coordinates of the hyperspectral scene.
According to one embodiment, the obtained compressed image contains the image portion of the non-diffracted scene in the center.
According to one embodiment, the direct detection system does not implement calculation of a hyperspectal cube of the scene for the detection of features.
According to another aspect, the invention relates to a method for detecting features in a three-dimensional hyperspectral scene, characterized in that a system for direct detection of features in said hyperspectral scene integrating a convolutional neural network detects the one or more features sought in said hyperspectral scene from at least one compressed image of the hyperspectral scene.
According to one embodiment, M=7.
According to another aspect, the invention relates to a computer program comprising instructions which, when the program is executed by a computer, cause it to implement the method.
The manner of carrying out the invention as well as the advantages which result therefrom will clearly emerge from the following embodiment, given by way of indication but without limitation, in support of the appended figures in which
By “direct”, when discussing the detection of a feature, it is thus described that the output result of the detection system is the desired feature. We exclude the cases where the output result of the detection system does not correspond to the sought feature, but only corresponds to an intermediate in the calculation of the feature. However, the output result of the direct detection system may, in addition to corresponding to the sought feature, also be used for subsequent processing. In particular, by “direct”, it is meant that the output of the feature detection system is not a hyperspectral cube of the scene which, in itself, does not constitute a feature of the scene.
By “compressed”, we refer to a two-dimensional image of a three-dimensional scene comprising spatial and spectral information of the three-dimensional scene. The spatial and spectral information of the three-dimensional scene is thus projected by means of an optical system on a two-dimensional capture surface. Such a “compressed” image may comprise one or more diffracted images of the three-dimensional scene, or parts thereof. In addition, it may also include a portion of a non-diffracted image of the scene. Thus, the term “compressed” is used because a two-dimensional representation of a three-dimensional spectral information is possible. By “spectral”, we understand that we go beyond, in terms of the number of frequencies detected, a “standard” RGB image of the scene.
By “standard”, as opposed to a “compressed” image, reference is made to a non-diffractive image of the hyperspectral scene. Such an image can still be obtained by optical manipulations through reflecting mirrors or lenses.
By “non-homogeneous”, reference is made to an image whose properties are not identical throughout the image. For example, a “non-homogeneous” image may contain, at certain locations, pixels whose information essentially comprises spectral information at a certain wavelength band, as well as, in other locations, pixels whose information essentially comprises non-spectral information. Computer processing of such a “non-homogeneous” image is not possible because the properties required for its processing are not identical according to the locations in this image.
By “feature”, we refer to a characteristic of the scene—this characteristic can be spatial, spectral, correspond to a shape, a color, a texture, a spectral signature or a combination of these, and can in particular be interpreted semantically.
By “object”, reference is made to the common sense used for this term. An object detection on an image corresponds to the location and to a semantic interpretation of the presence of the object on the imaged scene. An object can be characterized by its shape, color, texture, spectral signature or a combination of these features.
As illustrated in
The structure of this optical assembly is relatively similar to that described in the scientific publication “Computed tomography imaging spectrometer: experimental calibration and reconstruction results”, published in APPLIED OPTICS, volume 34 (1995) number 22.
This optical structure makes it possible to obtain a compressed image 11, illustrated in
Alternatively, three diffraction axes may be used on the diffraction grating 24 so as to obtain a compressed image 11 with sixteen diffractions. The three diffraction axes can be equally distributed, that is to say separated from each other by an angle of 60°.
Thus, in a general way, the compressed image comprises 2R+1 diffractions if R diffraction gratings are used equidistant, that is to say separated by the same angle from each other.
Capture surfaces 26 or 46 (shown below) may correspond to a CCD sensor (for “charge-coupled device” in the English literature, ie a charge transfer device), a CMOS sensor (for “complementary metal-oxide-semiconductor” in the English literature, a technology for manufacturing electronic components), or any other known sensor. For example, the scientific publication “Practical Spectral Photography”, published in Euro-graphics, volume 31 (2012) number 2, proposes to associate this optical structure with a standard digital camera to sense the diffracted image.
Alternatively, as illustrated in
The structure of this optical assembly is relatively similar to that described in the scientific publication “Compressive Coded Aperture Spectral Imaging”, Gonzalo R. Arce, David J. Brady, Lawrence Carin, Henry Arguello, and David S. Kittle.
Alternatively, the capture surfaces 26 or 46 may correspond to the photographic acquisition device of a computer or any other portable device including a photographic acquisition arrangement, by adding the capture device 2 of the hyperspectral scene 3 in front of the photographic acquisition device.
In a variant, the acquisition system 4 may comprise a compact mechanical embodiment integrable in a portable and autonomous device and the detection system is included in said portable and autonomous device.
For example, each pixel of the compressed image 11 is coded on three colors red, green and blue and on 8 bits thus making it possible to represent 256 levels on each color.
Alternatively, the capture surfaces 26 or 46 may be a device whose wavelengths are not captured in the visible part. For example, the device 2 can integrate sensors whose wavelength is between 0.001 nanometer and 10 nanometers or a sensor whose wavelength is between 10,000 nanometers and 20000 nanometers, or a sensor whose length of wave is between 300 nanometers and 2000 nanometers. It can be an infrared device.
When the image 11 of the observed hyperspectral focal plane is obtained, the detection system 1 implements an array of neurons 12 to detect a feature in the scene observed from the information of the compressed image 11.
This neural network 12 aims to determine the probability of presence of the feature sought for each pixel located at the x and y coordinates of the hyperspectral scene 3 observed.
For this purpose, as illustrated in
The input layer 30 is populated from the pixels forming the compressed image. Thus, the input layer is a three-order tensor, and has two spatial dimensions of size XMAX and YMAX, and a size depth dimension DMAX, corresponding to the number of subsets of the compressed image copied into the input layer. The invention uses the nonlinear relation f(xt, yt, dt)→(ximg, yimg) defined for xtϵ[0 . . . XMAX[, ytϵ[0 . . . YMAX[ and dtϵ[0 . . . DMAX[ for calculating the coordinates ximg and yimg of the pixel of the compressed image whose intensity is copied to the third order tensor of said input layer of the neural network at coordinates (xt, yt, dt).
For example, in the case of a compressed image 11 obtained from the capture device of
with:
n=floor (M (dt−1)/DMAX);
n between 0 and M, the number of diffractions of the compressed image;
λ=(dt−1) mod (DMAX/M);
dt between 1 and DMAX;
xt between 0 and XMAX;
yt between 0 and YMAX;
XMAX the size along the x-axis of the third order tensor of the input layer;
YMAX the size along the y-axis of the third order tensor of the input layer;
DMAX the depth of the third order tensor of the input layer;
λsliceX, the spectral pitch constant along the x-axis of said compressed image;
λsliceY, the spectral pitch constant along the y-axis of said compressed image;
XoffsetX (n) corresponding to the offset along the x-axis of the diffraction n;
yoffsetY (n) corresponding to the offset along the y-axis of the diffraction n.
Floor is a well known truncation operator.
Mod represents the modulo mathematical operator.
As is particularly clearly seen in
In a variant, the invention makes it possible to correlate the information contained in the different diffractions of the diffracted image with information contained in the non-diffracted central part of the image.
According to this variant, it is possible to add an additional slice in the direction of the depth of the input layer, the neurons of which will be populated with the intensity detected in the pixels of the compressed image corresponding to the non-diffracted detection. For example, if we assign to this slice the coordinate dt=0, we can preserve the formula above for the population of the input layer for dt greater than or equal to 1, and populate the layer dt=0 in the following way:
x
img=(Imgwidth/2)−XMAX+xt;
y
img=(Imgheight/2)−YMAX+yt;
With:
Imgwidth the size of the compressed image along the x axis;
Imgheight the size of the compressed image along the y axis.
The compressed image obtained by the optical system contains the focal plane of the non-diffracted scene at the center, as well as the diffracted projections along the axes of the different diffraction filters. Thus, the neural network uses, for the direct detection of the desired features, the following information of said at least one diffracted image:
Alternatively, in the case of a compressed image 13 obtained from the capture device of
f(xt, yt, dt)={(ximg=xt); (yimg=yt)} (Img=MASK if dt=0; Img=CASSI if dt>0),
With:
MASK: image of the compression mask used,
CASSI: measured compressed image,
Img: Selected image whose pixel is copied.
On slice 0 of the third order tensor of the input layer the image of the employed compression mask is copied.
On the other slices of the third order tensor of the input layer the compressed image of the hyperspectral scene is copied.
The architecture of said neural network 12, 14 is composed of a set of convolutional layers assembled linearly and alternately with layers of decimation (pooling), or interpolation (unpooling).
A convolutional depth layer, denominated CONV(d), is defined by d convolution kernel, each of these convolution kernel being applied to the volume of the third order input tensor of size yinput, dinput. The convolutional layer thus generates an output volume, tensor of order three, having a depth d. An ACT activation function is applied to the calculated values of the output volume of this convolutional layer.
The parameters of each convolutional kernel of a convolutional layer are specified by the neural network learning procedure.
Different activation functions ACT can be used. For example, this function can be a ReLu function, defined by the following equation:
ReLu (x)=max (0, x)
In alternation with the convolutional layers, layers of decimation (pooling), or layers of interpolation (unpooling) are inserted.
A decimation layer reduces the width and height of the input of the third-order tensor for each depth of said third order tensor. For example, a decimation layer MaxPool(2,2) selects the maximum value of a tile sliding on the surface of 2×2 values. This operation is applied to all depths of the input tensor and generates an output tensor having the same depth and a width divided by two, and a height divided by two.
An interpolation layer makes it possible to increase the width and height of the input of the third order tensor for each depth of said third order tensor. For example, a MaxUnPool interpolation layer (2.2) copies the input value of a point sliding onto the surface of 2×2 output values. This operation is applied to all depths of the input tensor and generates an output tensor with the same depth and a width multiplied by two, and a height multiplied by two.
A neural network architecture for the direct detection of features in the hyperspectral scene can be as follows:
CONV (64)
MaxPool (2,2)
CONV (64)
MaxPool (2,2)
CONV (64)
MaxPool (2,2)
CONV (64)
CONV (64)
MaxUnpool (2,2)
CONV (64)
MaxUnpool (2,2)
CONV (64)
MaxUnpool (2,2)
CONV (1)
Output
Alternatively, the number of layers of convolution CONV(d) and decimation MaxPool (2,2) can be modified to facilitate the detection of features having higher semantic complexity. For example, a higher number of convolutional layers makes it possible to process more complex signatures of shape, texture, or spectral characteristics of the feature sought in the hyperspectral scene.
As a variant, the number of layers of deconvolution CONV (d) and interpolation MaxUnpool (2, 2) can be modified in order to facilitate the reconstruction of the output layer. For example, a higher number of deconvolution layers makes it possible to reconstruct an output with greater precision.
Alternatively, convolution layers CONV(64) may have a different depth than 64 in order to process a different number of local features. For example, a depth of 128 allows local processing of 128 different features in a complex hyperspectral scene.
Alternatively, the MaxUnpool interpolation layers (2, 2) may be of different interpolation size. For example, a MaxUnpool layer (4, 4) increases the processing dimension of the upper layer.
As a variant, the activation layers ACT of the ReLu (x) type inserted after each convolution and deconvolution may be of different type. For example, the softplus function defined by the equation: f(x)=log (1+ex) can be used.
Alternatively, the MaxPool decimation layers (2, 2) may be of different decimation size. For example, a MaxPool layer (4, 4) can reduce the spatial dimension more quickly and focus the semantic search of the neural network on local features.
Alternatively, fully connected layers may be inserted between the two central convolutional layers at line 6 of the description to process the detection in a higher mathematical space. For example, three fully connected layers of size 128 can be inserted.
In a variant, the dimensions of the convolutional layer CONV(64), the decimation MaxPool(2, 2) layers and the interpolation MaxUnpool(2, 2) layers can be adjusted on one or more layers, in order to adapt the neural network architecture closest to the type of features sought in the hyperspectral scene.
The weights of said neural network 12 are calculated by means of a training. For example, learning through retro-propagation of the gradient or its derivatives from training data can be used to calculate these weights.
As a variant, the neural network 12 can determine the probability of presence of several distinct features within the same observed scene. In this case, the last convolutional layer will have a depth corresponding to the number of distinct features to be detected. Thus the convolutional layer CONV (1) is replaced by a convolutional layer CONV (u), where u corresponds to the number of distinct features to be detected.
As illustrated in
The capture surface 32 (shown below) may correspond to a CCD sensor (“charge-coupled device” in the English literature, that is to say a charge transfer device), to a CMOS sensor (for “complementary metal-oxide-semiconductor” in the English literature, a technology for manufacturing electronic components), or any other known sensor.
The capture device 102 may further comprise an uncompressed “standard” image acquisition device comprising a converging lens 131 and a capture surface 32. The capture device 102 may further comprise a device for acquiring a compressed image as described above with reference to
In the presented example, the standard image acquisition device and the compressed image acquisition device are arranged juxtaposed with parallel optical axes, and optical beams overlapping at least partially. Thus, a portion of the hyperspectral scene is imaged by both the acquisition devices. Thus, the focal planes of the different image acquisition sensors are offset relative to each other transversely to the optical axes of these sensors.
Alternatively, a set of partially reflective mirrors is used to capture said at least one non-diffracted standard image 112 and said at least one compressed image 11, 13 of the same hyperspectral scene 3 on multiple sensors simultaneously.
Preferably, each pixel of the standard image 112 is coded on three colors red, green and blue and on 8 bits thus making it possible to represent 256 levels on each color.
Alternatively, the capture surface 32 may be a device whose captured wavelengths are not in the visible part. For example, the device 2 can integrate sensors whose wavelength is between 0.001 nanometer and 10 nanometers or a sensor whose wavelength is between 10,000 nanometers and 20000 nanometers, or a sensor whose length of wave is between 300 nanometers and 2000 nanometers.
When the images 11, 112 or 13 of the observed hyperspectral focal plane are obtained, the detection means implements a neural network 14 to detect a feature in the observed scene from the information of the compressed images 11 and 13, and the standard image 112.
As a variant, only the compressed 11 and standard 112 images are used and processed by the neural network 14.
As a variant, only the compressed 13 and standard 112 images are used and processed by the neural network 14.
Thus, when the description relates to a set of compressed images, it is at least one compressed image.
This neural network 14 aims to determine the probability of presence of the particularity sought for each pixel located at the x and y coordinates of the observed hyperspectral scene 3.
To do this, as illustrated in
As illustrated in
The above-described filling corresponds to the population of the first input (“Input1”) of the neural network, according to the architecture presented below.
For the second input (“Input2”) of the neural network, the population of the input layer relative to the “standard” image is populated by directly copying the “standard” image into the neural network.
According to an exemplary embodiment where a compressed image 13 is also used, the third input “Input3” of the neural network is populated as described above for the compressed image 13.
A neural network architecture for the direct detection of features in the hyperspectral scene may be as follows:
CONV (64)
CONV (64)
MaxPool (2,2)
MaxPool (2,2)
CONV (64)
CONV (64)
MaxPool (2,2)
MaxPool (2,2)
MaxUnpool (2,2)
MaxUnpool (2,2)
MaxUnpool (2,2)
In this description, “Input1” corresponds to the portion of the input layer 50 populated from the compressed image 11. “Input2” corresponds to the portion of the input layer 50 populated from the standard image 112, and “Input3” corresponds to the portion of the input layer 50 populated from the compressed image 13. The line “CONV (64)” at the fifth line of the architecture operates the merger of the information.
In a variant, the line “CONN (64)” at the fifth line of the information merging architecture may be replaced by a fully connected layer having as input all of the MaxPool outputs (2, 2) of the processing paths of all inputs “input1”, “input2” and “input3” and output an tensor of order one serving as input to the next layer “CONN (64)” presented in the sixth line of the architecture.
In particular, the fusion layer of the neural network takes into account the offsets of the focal planes of the different image acquisition sensors, and integrates the homographic function making it possible to merge the information of the different sensors by taking into account the parallaxes of the different images.
The variants presented above for the first embodiment can also be applied here.
The weights of said neural network 14 are calculated by means of a training. For example, learning through retro-propagation of the gradient or its derivatives from training data can be used to calculate these weights.
Alternatively, the neural network 14 can determine the probability of presence of several distinct features within the same observed scene. In this case, the last convolutional layer will have a depth corresponding to the number of distinct features to be detected. Thus the convolutional layer CONV(1) is replaced by a convolutional layer CONV(u), where u corresponds to the number of distinct features to be detected.
According to an alternative embodiment, as shown in
Thus, the neural network 14 uses, for the direct detection of the sought features, the information of said at least one compressed image as follows:
The invention has been presented above in various variants, in which a detected feature of the hyperspectral scene is a two-dimensional image whose value of each pixel at coordinates x and y corresponds to the probability of presence of a feature at the same x and y coordinates of the hyperspectral focal plane of the scene 3. It is possible, however, alternatively, to provide, according to the embodiments of the invention, the detection of other features. According to one example, such another feature can be obtained from the image obtained from the neural network presented above. For this, the neural network 12, 14 may have a subsequent layer, adapted to process the image in question and determine the desired feature. According to an example, this subsequent layer may for example count the pixels of the image in question for which the probability is greater than a certain threshold. The result obtained is then an area (possibly divided by a standard area of the image). According to an example of application, if the image has, in each pixel, a probability of presence of a chemical compound, the result obtained can then correspond to a concentration of the chemical compound in the imaged hyperspectral scene.
According to another example, this later layer may for example have only one neuron whose value (real or Boolean) will indicate the presence or absence of an object or a feature sought in the hyperspectral scene. This neuron will have a maximum value in case of presence of the object or feature and a minimum value in the opposite case. This neuron will be fully connected to the previous layer, and the connection weights will be calculated by means of a learning.
According to a variant, it will be understood that the neural network can also be designed to determine this feature (for example to detect this concentration) without going through the determination of an image of probability of presence of the feature in each pixel.
Detection system 1
capture device 2
hyperspectral scene 3
acquisition system 4
compressed image in two dimensions 11, 13
neural network 12, 14
first convergent lens 21
opening 22
collimator 23
diffraction grating 24
second convergent lens 25
capture surface 26
input layer 30
output layer 31
capture surface 32
first convergent lens 41
mask 42
collimator 43
prism 44
second converging lens 45
capture surface 46
input layer 50
encoder 51
convolution layer or fully connected layer 52
decoder 53
sensor 101
capture device 102
focal plane 103
standard image 112
lens 131
Number | Date | Country | Kind |
---|---|---|---|
1873313 | Dec 2018 | FR | national |
1901202 | Feb 2019 | FR | national |
1905916 | Jun 2019 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2019/085847 | 12/18/2019 | WO | 00 |