The present disclosure claims the benefit of priority under the Paris Convention to Chinese Patent Application No. 202310366088.8 filed on Mar. 30, 2023, which is incorporated herein by reference in its entirety.
The present disclosure relates to the technical field of image processing, and in particular to a method and apparatus for binocular depth estimation, an embedded device, and a readable storage medium.
Binocular depth estimation refers to searching for matched pixel points on obtained left and right view images, calculating a corresponding disparity, and then calculating a real physical distance (i.e., depth information) in combination with known binocular camera information. Conventional binocular depth estimation algorithms are mostly local or global search algorithms, i.e., a matching degree of pixels on left and right view images are calculated by constructing a cost function, and pixels with the minimum cost are selected within a search range as matching points.
However, some problems are found in these conventional algorithms in practical application, for example, a five-dimensional disparity cost space is constructed in a conventional method, so that a large amount of three-dimensional convolution is required to generate depth estimation, thereby having problems of large occupied memory and large calculation amount, especially for embedded devices with low computing power in which the conventional method cannot be applied and depth estimation cannot be carried out in real time. In addition, problems of low precision and the like are likely to occur due to too many matching candidates.
In order to describe the technical schemes in the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments. It should be understood that the accompanying drawings in the following description merely show some embodiments and should not be considered as limiting the scope. For those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure.
Components in the embodiments of the present disclosure, which are generally described and illustrated herein, may be arranged and designed in a variety of different configurations. Therefore, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed present disclosure, but merely represents a selected embodiment of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by a person skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.
In the following, the terms ‘comprising,’ ‘including,’ ‘having,’ and their cognates, as used in various embodiments of the present disclosure, are intended to express inclusion of specific features, numbers, steps, operations, elements, components, or combinations thereof. They should not be construed to exclude the presence of one or more other features, numbers, steps, operations, elements, components, or combinations thereof, or exclude the possibility of adding one or more features, numbers, steps, operations, elements, components, or combinations thereof. Additionally, terms such as ‘first,’ ‘second,’ ‘third,’ etc., are used for descriptive purposes only and should not be interpreted as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present disclosure belong. The terms, such as terms defined in commonly used dictionaries, will be interpreted as having the same meaning as the context meaning in the relevant technical field and will not be construed as having an idealized meaning or overly formal meaning unless expressly defined in the various embodiments of the present disclosure.
Some embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments may be combined with each other.
Disparity is a position difference between corresponding pixel points on left and right images in a horizontal direction, and a target of binocular depth information estimation is to calculate disparity of each pixel point on a reference image, so that precision of a depth is improved along with precision of disparity prediction. Generally, conventional binocular depth estimation mainly includes the following four steps: matching cost calculation, cost aggregation, disparity calculation and disparity optimization. However, in existing solutions, a disparity cost volume with a total dimension of 5 dimensions is usually constructed for the matching cost calculation, and it is described as a cost space with a total dimension of [B, C, D, H, W], where B represents Batch, C represents Channel, D represents Disparity, H represents Height, and W represents Width. It should be understood that the larger the total dimension is, the more the memory occupied by the disparity cost space is, the more complex the subsequent processing steps of cost aggregation, disparity calculation and disparity optimization are, and the larger the corresponding calculation amount is. Therefore, for an embedded device, since the computing power is insufficient and the memory is limited, the above conventional method cannot be implemented on the embedded device, which causes the binocular depth estimation to be restricted on such devices.
Based on this, the present disclosure provides a lightweight method for binocular depth estimation. After feature extraction is performed on binocular images, as shown in
The method for binocular depth estimation is described below in combination with some specific embodiments.
S110, obtaining binocular images and performing feature extraction on the binocular images to obtain left and right feature mappings.
As an example, a binocular camera is installed on the embedded device for obtaining binocular images (also referred to as left and right images or left and right eye images) in real time, and then feature extraction is performed based on the obtained binocular images, so as to obtain respective feature mappings of the left image and the right image. A pre-trained neural network may be used to perform the feature extraction, and a specific structure is not limited herein, for example, some lightweight neural networks such as MobileNet series may be preferentially used as a feature extraction model.
If the feature extraction and depth estimation are directly performed by using original binocular images that are obtained, the matching efficiency is greatly reduced when the pixels are matched. Because there is no prior constraint between the binocular images, and for each pixel of the left image, a matched pixel needs to be searched in full-image space of the right image. Thus, the binocular images are preprocessed in the embodiments before the feature extraction is performed, so as to greatly improve the efficiency of subsequent disparity calculation.
In some embodiments, before performing feature extraction on the binocular images, the method further includes preprocessing the obtained binocular images. It should be understood that the preprocessed binocular images are used for subsequent feature extraction. The preprocessing may include, but is not limited to, performing epipolar alignment on the original binocular images that are obtained, and performing pixel normalization processing on the aligned binocular images.
As an example, when performing the epipolar alignment, as shown in
In addition, after the epipolar alignment is performed, pixel normalization processing is also performed on the binocular images for facilitating the extraction of image features. Specifically, pixel distribution of the left and right images is normalized from original 0-255 to 0-1, a mean value of the RGB channel is subtracted from each value in the RGB channel image, and then the result is divided by a standard deviation of the channel, thereby obtaining standardized binocular images. Finally, the standardized binocular images are respectively input into the feature extraction model to obtain the left and right feature mappings.
S120, performing disparity construction by using the left and right feature mappings to obtain a disparity cost volume with a reduced dimension.
The cost volume is used to measure similarity of two image blocks in the left and right binocular images in binocular matching, so as to determine whether the two image blocks match. It should be noted that, different from a conventional cost volume construction mode, when constructing the disparity cost volume in the embodiments of the present disclosure, the dimension reduction on the channel dimension is realized by multiplying a multi-layer convolution and a feature mapping, and then a cyclic plus append mode is provided based on the dimension reduction, that is, new space is continuously added only in the disparity dimension to obtain the disparity cost volume with the total dimension of 4 dimensions. Since an append operation on a single dimension can be supported by the current embedded device, it is convenient to implement on an embedded platform.
As an example, the dimension reduction is implemented in a combination manner of a multi-layer convolution structure and group-wise correlation calculation, so as to construct a cost space with a channel number of 1 (C=1) to reduce memory occupation to 1/C. In addition, a two-dimensional convolution can be used to perform calculation in a subsequent disparity regression process, a calculation amount thereof is reduced to ⅓ of a conventional solution using a three-dimensional convolution. This solution is convenient to be transferred to an embedded device with a lower computing power.
In one embodiment, as shown in
S121, performing channel dimension reduction on a left feature mapping and a right feature mapping each having an original channel number to obtain a left feature mapping and a right feature mapping each having a first channel number.
For example, the first channel number may be C/4 or the like. Specifically, a common convolutional layer structure may be used, and the left feature mapping and the right feature mapping of which the original channel number is C are reduced to C/4 respectively. A convolution layer of channel dimension reduction may adopt some common dimension reduction convolution structures, which are not specified herein.
S122, moving the right feature mapping having the first channel number in a horizontal direction according to a pixel step, and performing channel dimension splicing on the right feature mapping having the first channel number and the left feature mapping having the first channel number to obtain a disparity feature mapping having a second channel number.
The feature mapping after dimension reduction is subjected to a splicing operation in the channel dimension, thereby obtaining the disparity feature mapping having the second channel number. It should be understood that the left and right images are already on the same horizontal line through previous pre-processing operations such as epipolar alignment. When the left and right images are moved in the horizontal direction (i.e., a width W direction of the images), two identical objects with disparity are overlapped on the channel, so that different disparity can be artificially constructed.
As an example, the right feature mapping after dimension reduction may be moved in a horizontal direction at one pixel step each time, and for each pixel step moved, the disparity increases by one. After each pixel step is moved, the right feature mapping is spliced with the left feature mapping having the first channel number to obtain a splicing feature mapping having the second channel number under the corresponding disparity, and finally all the splicing feature mappings are combined to obtain the above disparity feature mapping. It should be understood that the second channel number is equal to the sum of two first channel numbers. Still taking the above C/4 as an example, the second channel number is C/2.
S123, performing channel dimension splicing on the left feature mapping and the right feature mapping each having the first channel number and the disparity feature mapping having the second channel number to obtain a disparity feature mapping having the original channel number.
The left and right feature mappings and disparity feature mapping after dimension reduction are spliced again, that is, the final splicing is performed to obtain a disparity feature mapping having a channel number still being C.
S124, performing channel dimension reduction on the disparity feature mapping having the original channel number to obtain a disparity feature mapping having a channel number of 1 to serve as the disparity cost volume with the reduced dimension.
In order to construct a cost volume, original left and right feature mappings are divided into a plurality of groups along the channel dimension in the embodiments of the present disclosure, and then an ith left feature mapping group and a corresponding ith right feature mapping group are cross-correlated on all disparity levels, thereby obtaining an inter-group correlation graph, that is, providing a better similarity measure of the left and right images by using the group-wise correlation.
Specifically, the left and right feature mappings having the original channel number are subjected to group-wise correlation calculation to obtain a group-wise correlation feature mapping. Then, convolution processing is performed on the left and right feature mappings having the first channel number and group-wise correlation feature mapping to obtain the disparity feature mapping having a channel number of 1.
As an example, the disparity cost volume may be represented by the following expression:
in the formula, C (d, x, y) represents a disparity cost volume, Convs1c represents a convolution operation, c is a total number of channels,
respectively represent a left feature mapping and a right feature mapping each having the channel number of C/4, (x, y) represents coordinates of pixel points in the left image, (x-d, y) represents the coordinates of pixel points in the right image whose disparity is d, Gwc represents group-wise correlation calculation, Nc/Ng represents the number of channels of each feature mapping group, where Ng is a divided group number, flg(x,y), frg(x−d,y) respectively represent a left feature mapping group and a right feature mapping group obtained by dividing left and right feature mappings having the channel number of c into G groups, respectively.
It should be understood that a cost space having a channel number of C=1, a disparity number D, and a total dimension of [B, 1, D, H, W] may be constructed by means of the above convolution and dimension reduction. Since C=1, the total dimension of the cost volume may be considered as only [B, D, H, W], that is, the memory occupation is reduced by 1/C in the channel dimension. In addition, when constructing the cost space, the embedded device utilizes the append operation supported by itself to continuously increase the new space only in the disparity dimension.
S130, performing attention feature learning on the disparity cost volume to obtain an attention feature vector, and performing feature weighting on the disparity cost volume by using the attention feature vector to obtain a weighted cost volume.
In some embodiments, after the disparity cost volume is constructed, the attention mechanism is combined to improve the subsequent matching precision. As an example, attention feature extraction is performed on the disparity cost volume by using a trained convolutional network to obtain the attention feature vector, and then feature weighting is performed on the disparity cost volume through the attention feature vector to obtain an improved disparity cost volume.
Feature weighting is performed on the disparity cost volume through the attention feature vector, i.e., the disparity cost volume and the attention feature vector are subjected to inner product operation, which is specifically expressed as: C′cost=Ccost*Wattention. In the formula, C′cost represents the weighted disparity cost volume, Ccost represents the disparity cost volume before weighted, and Wattention represents the attention feature vector. Since the channel number of the cost volume is reduced to 1, the dimension of the attention feature vector is [B, D, H, W].
The training for the trained convolutional network may adopt a common network model training process, for example, a Smooth L1 loss function is used to calculate a loss when a constructed convolutional network performs attention feature learning on training samples for iteratively updating parameters in the convolutional, until an attention feature vector output by the convolutional network after parameter updating meets a preset convergence condition, and then the training is stopped. For example, the preset convergence condition may be that an error between a supervision signal for supervising the learning and an attention feature vector output by the convolutional network from learning does not exceed an allowable error range, etc., and certainly may also be other conditions, which is not limited herein.
S140, performing disparity regression on the weighted cost volume based on a two-dimensional convolution to obtain a prediction disparity map.
After obtaining the disparity cost volume weighted by the attention feature vector, disparity regression is performed on the disparity cost volume to obtain a disparity map. It should be noted that since the disparity cost volume is reduced to 1 in the channel dimension, only the two-dimensional convolutional network may be used to extract the disparity feature during performing the disparity regression. Compared with the conventional solution relying on a three-dimensional convolutional network, the calculation amount herein is significantly reduced. In addition, since most of existing model files of the three-dimensional convolutional network are written in formats such as Python, the embedded device cannot support format conversion of these complex model files, that is, these model files cannot be directly transferred. However, in the embodiments of the present disclosure, since the complexity of the convolutional network is reduced to two dimensions, the model file of the two-dimensional convolutional network can be converted from Python format to onnx format and then be converted through a development tool SDK of the embedded device into the format supported by the embedded device, so as to realize the use of the model file on the embedded device.
As an example, disparity feature extraction is performed on the weighted cost volume by using a trained two-dimensional convolutional network, a normalized exponential function (i.e., a softmax function) is used to perform regularization processing on the disparity feature, so as to obtain probabilities of each pixel at different disparity levels, and then weighting and calculation are performed according to the probabilities and indexes corresponding to the disparity levels to obtain a disparity prediction value of each pixel, thereby generating a continuous prediction disparity map. It should be understood that the calculation of the disparity is a non-differentiable problem, so that a differentiable softargmin method is used in the embodiments of the present disclosure to regress from the disparity cost volume to obtain the disparity value of the pixel.
Specifically, the calculation of the prediction disparity is described by the following expression:
in the formula,
represents a probability that a current disparity level d is in all possible disparities ΣiCj.
In some embodiments, in the training process of the two-dimensional convolutional network, disparity prediction learning is performed on the cost volume through the two-dimensional convolutional network to obtain a prediction disparity map, and a loss value between the prediction disparity map and a real disparity map is calculated by using a Smooth L1 loss function to update the parameters of the two-dimensional convolutional network until a final output disparity prediction result can meet corresponding precision requirements.
Specifically, the loss function L may be described as:
in the formula, gi represents an ith real disparity map,
S150, performing disparity depth conversion on the predicted disparity map to obtain a depth map of the binocular images.
For example, when disparity depth conversion is performed, the prediction disparity map is converted between the camera plane and the real world based on focal lengths and baseline distances of binocular cameras to obtain a depth map corresponding to the binocular images in the real world.
With reference to
where XL and XR respectively represent linear distances between projections PL and PR of the P point on planes of left and right cameras and a left plane of the camera, so that depth information Z in the real world is obtained through conversion.
According to the method for binocular depth estimation provided in the embodiments of the present disclosure, the obtained binocular images are preprocessed, feature extraction is performed on the obtained binocular images, and dimension reduction processing is adopted when the disparity cost volume is constructed, so that the cost volume is changed into 1 in the channel dimension, and finally, the two-dimensional convolution model is adopted to realize disparity prediction in the subsequent disparity regression process, so that the calculation amount of the model is small, the occupied memory is small, and the above operations can be supported by the embedded device. Therefore, the method realizes real-time operation on the embedded platform with low computing power, and the problem that the existing scheme cannot be carried out in the embedded device is effectively solved.
It should be understood that the apparatus for binocular depth estimation in the embodiments of the present disclosure corresponds to the method for binocular depth estimation in the above embodiments, and the options related to the method in the above embodiments are also applicable to the apparatus embodiments, so that the description is not repeated herein.
The present disclosure further provides an embedded device, for example, the embedded device may be, but is not limited to, a terminal device built based on an embedded platform, such as an intelligent robot equipped with binocular cameras, a communication terminal, a monitoring device, etc., which is not limited herein.
As an example, the embedded device includes binocular cameras, a processor and a memory. The binocular cameras are configured for obtaining binocular images, the memory stores a computer program, and the processor executes the computer program to enable the embedded device to perform the method for binocular depth estimation or functions of the modules in the apparatus for binocular depth estimation.
The processor may be an integrated circuit chip having a signal processing capability. The processor may be a general-purpose processor, including at least one of a central processing unit (CPU), a graphics processing unit (GPU), and a network processor (NP), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), other programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is able to implement or perform methods, steps, and logical block diagrams disclosed in the embodiments of the present disclosure.
The memory may be, but is not limited to, a random access memory (RAM), a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), etc. The memory is configured to store the computer program, and the processor correspondingly executes the computer program after receiving an execution instruction.
The present disclosure further provides a non-transitory readable storage medium, configured to store the computer program used in the embedded device.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/device and method may also be implemented in other manners. The apparatus/device embodiments described above are merely illustrative, for example, the flowcharts and structural diagrams in the drawings illustrate the architecture, functionality, and operation of possible implementations of apparatuses, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that includes one or more executable instructions for implementing specified logical functions. It should also be noted that, in an alternative implementation, the functions noted in the block may occur out of the order noted in the accompanying drawings. For example, two blocks shown in succession may, in fact, be executed substantially in parallel, which may sometimes be executed in a reverse order, depending upon the functions involved. It should also be noted that each block in the structural diagrams and/or flowchart, and combinations of blocks in the structural diagrams and/or flowchart, may be implemented with dedicated hardware-based systems that perform specified functions or acts, or may be implemented in combinations of special purpose hardware and computer instructions.
In addition, the functional modules or units in the embodiments of the present disclosure may be integrated together to form an independent portion, or each of the modules may exist alone, or two or more modules may be integrated to form an independent portion.
When the functions are implemented in the form of a software functional module and sold or used as an independent product, the functions may be stored in a non-transitory computer-readable storage medium. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product in essence, or the part that contributes to the prior art or a portion of the technical solution may be embodied in the form of a software product. The computer software product is stored in a non-transitory storage medium and includes several instructions for enabling a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to perform all or some of the processes in the methods described in the embodiments of the present disclosure. The above storage medium includes various media that can store program codes, such as a USB flash disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc.
The above embodiments are merely intended for describing technical solutions of the present disclosure. However, the protection scope of the present disclosure is not limited thereto, and any person skilled in the art could easily conceive changes or substitutions within the technical scope disclosed in the present disclosure, all of which should be covered within the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202310366088.8 | Mar 2023 | CN | national |