The present disclosure relates to an RGB-T (Thermal) multispectral image pedestrian detection method, and in particular, to a pedestrian target position acquisition method based on multispectral images, which is applied to the fields of unmanned driving, road condition perception, intelligent monitoring and the like.
Whether the input image or video contains pedestrians can be determined through pedestrian detection. In the field of intelligent transportation, the situation of pedestrians on road conditions may be determined by pedestrian detection, thereby providing a reference for unmanned decision-making. In the field of intelligent security, pedestrian detection can further be used to reflect the personnel situation at a security scene and remind security personnel of possible risks.
At present, pedestrian detection methods are mainly based on RGB images. Although the methods can show excellent detection performance in scenes with good lighting conditions, their performance in scenes with poor lighting conditions is unsatisfactory, which is caused by the low signal-to-noise ratio of RGB images in low light conditions. The thermal infrared image is sensitive to human thermal radiation, and it is not affected by lighting conditions, and thus it can provide all-weather clear human shape information. However, thermal infrared images can only provide shape information but not color and texture information, which makes the pedestrian detection method based on thermal infrared images vulnerable to objects with similar pedestrian structures. Therefore, the multispectral pedestrian detection method came into being, which combines the advantages of RGB and thermal infrared images to achieve all-weather pedestrian detection.
The multispectral pedestrian detection method has been widely concerned by researchers because of its robust pedestrian detection performance. “Multispectral Deep Neural Networks for Pedestrian Detection” studies the influence of multispectral features on detection results at different stages of neural networks, and it designs three fusion methods. One is to directly concatenate RGB and thermal infrared images into four-channel images and send them to the neural network to output the detection results, and this method is called early fusion. One is to send RGB and thermal infrared images to the neural network respectively, and then fuse the middle layer features of the neural network, and complete the detection based on the fused features, and this method is called mid fusion. One is to send RGB and thermal infrared images into two separate neural networks and fuse their detection results, and this method is called late fusion. This study found that the detection effect of mid fusion is better than that of early fusion and late fusion. Based on this discovery, the mid fusion strategy is mainly adopted in the subsequent research.
Although fusing multispectral features can provide complementary information for a single spectrum, simply concatenating multispectral features to achieve feature fusion cannot significantly improve the detection performance. In order to solve this problem, recent research work has proposed different feature fusion strategies. These fusion strategies can be divided into two types according to whether segmentation branches are required during feature fusion: one is fusion with segmentation branches, and the other is fusion without segmentation branches.
For the feature fusion strategy that does not need to segmentation branches during fusion, “Weakly Aligned Cross-modal Learning for Multispectral Pedestrian Detection” focuses on solving the problem of unregistering in multispectral data. It uses two separate neural networks to extract multispectral features and predict the displacement relationship between multispectral features, so as to realize pedestrian detection of weakly registered multispectral data.
“Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems” realizes the differences of multispectral data and solves the problem of information imbalance in cross-modal data.
“Spatio-contextual Deep Network-based Multimodal Pedestrian Detection for Autonomous Driving” focuses on solving the problem of spatial and contextual information aggregation in the process of multispectral feature fusion. It uses a graph attention network to fuse multispectral features, and uses a conditional random field to process the spatial information of the fused features, and then uses a channel attention mechanism and a cyclic neural network to process the contextual information of the fused features.
“BAANet: Learning Bi-directional Adaptive Attention Gates for Multispectral Pedestrian Detection” found that it is difficult for RGB images to capture clear pedestrian information in low-light scenes, while in high-temperature but well-lit scenes, because the thermal radiation of pedestrians is similar to that of other objects in the environment, it is difficult for thermal infrared images to distinguish pedestrians from other objects. It proposes a two-way attention gating mechanism and a scene illumination classification network, which can adaptively use favorable spectral information under different illumination conditions.
“Learning a Dynamic Cross-modal Network for Multispectral Pedestrian Detection” dynamically combines local and non-local information when fusing multispectral features to achieve better detection performance.
“Multimodal Object Detection via Probabilistic Ensembling” uses ensemble learning to combine the detection results of multiple detectors.
For the feature fusion strategy of using segmentation branches in fusion, “Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation” introduces a new segmentation branch to the original detector, and uses a multi-task loss function to supervise the segmentation and detection network. This method has achieved good detection performance.
“Guided Attentive Feature Fusion for Multispectral Pedestrian Detection” uses a segmentation branch for segmentation mask prediction, and guides cross-spectral and single-spectral attention based on this mask.
“Locality Guided Cross-modal Feature Aggregation and Pixel-level Fusion for Multispectral Pedestrian Detection” uses a segmentation branch for segmentation mask prediction and screen local features from complementary spectral features based on this mask, thus enhancing feature expression at specific positions.
It has been found in recent studies that the method of using segmentation branches in the fusion process is generally superior to the method of not using segmentation branches. Therefore, the present disclosure adopts the strategy of using segmentation branches in the fusion. Different from all the above methods, the present disclosure aims to fuse multispectral features and optimize the fused features, so as to enhance the feature expression of pedestrian areas, inhibit the feature expression of background areas, and achieve a more accurate multispectral pedestrian detection effect.
In view of the problems existing in the prior art, the present disclosure provides an RGB-T multispectral pedestrian detection method based on a target aware fusion strategy, and the overall process is as follows.
An RGB-T multispectral pedestrian detection method based on a target aware fusion strategy include the following steps: giving a pair of registered visible light RGB and thermal infrared T images, firstly, extracting multispectral features, and then fusing the extracted multispectral features in a feature space based on the target aware fusion strategy; and finally, sending the fused features into a detection head, outputting a position box of a pedestrian and a confidence score, and completing a detection process.
In the above technical solution, further, the step of fusing the extracted multispectral features in a feature space based on the target aware fusion strategy comprises two steps:
A pedestrian detection model used to implement the above method, wherein in the training process of the detection model, in order to ensure the accuracy of the model, a correlation maximum loss function is designed in the process of multispectral feature optimization. The correlation maximum loss function consists of two parts: 1) the segmentation loss function supervises the consistency between the predicted two-dimensional confidence map and the real confidence map; and 2) the maximized information entropy loss function supervises the maximization of the correlation degree of each position in the correlation vector.
The training process of the detection model includes the following steps:
Firstly, processing the concatenated multispectral features by using a feature channel splitting mechanism to output a initially fused feature Fr:
where [Frgb, Fthermal] represents RGB image features and thermal infrared image features concatenated in the channel direction; (·; θi) represents an ith convolutional layer in a multispectral feature aggregation module, and θi represents a learnable parameter of the ith convolutional layer; and (·) represents a residual convolution module.
Then, optimizing the initially fused feature Fx in two paths, and outputting an optimized feature Fy:
where ⊕ represents a pixel-by-pixel multiplication operation; s represents a correlation vector; (·represents a multilayer perceptron consisting of two fully connected layers; m·fxi represents a correlation operation between a predicted two-dimensional confidence map m and an ith channel feature map of the initially fused feature Fx; and σ(·represents a sigmoid operation; θseg represents a learnable parameter of a segmentation branch.
Calculating a correlation maximum loss function according to the predicted two-dimensional confidence map m, a ground-truth two-dimensional confidence map m and the correlation vector s.
where a ground-truth value of the ground-truth two-dimensional confidence map {tilde over (m)} is obtained as follows: all regions in labeling boxes corresponding to given pictures are set to 1, and other regions are set to 0; seg(·) is a segmentation loss function; neg_entropy(·) is a maximized information entropy loss function; a represents a balance coefficient, which is used to control the weight relationship between two loss functions.
bce(·) represents a binary cross entropy loss function; dice(·) represents a Dice loss function; E is a small constant to prevent division by 0; mi and {tilde over (m)}i represent values of the predicted two-dimensional confidence map m and the ground-truth two-dimensional confidence map {tilde over (m)} at an ith position; and si represents a value of the correlation vector s at the ith position.
The present disclosure has the following advantages:
The complementary advantages of RGB and thermal infrared image information are fully used to make up for the deficiency that the current detector is easily affected by illumination conditions, and robust all-weather pedestrian detection is realized. The multispectral feature fusion process is supervised, instead of just using the classification and regression loss of the final detector to supervise the prediction box. The output features of the target aware fusion strategy emphasize the feature expression at the target position and weaken the feature expression on irrelevant background. This feature map can distinguish the target object from irrelevant background noise more obviously, and then help the detector achieve better detection performance. The target aware fusion strategy proposed by the present disclosure is a convenient and universal multispectral feature fusion strategy, which can be used in Faster R-CNN and YOLO to improve their detection performance in multispectral pedestrian detection tasks.
The technical solution of that present disclosure will be further explained with specific examples and accompany drawings.
The present disclosure provides an RGB-T multispectral pedestrian detection method based on a target aware fusion strategy, which includes the following steps:
Giving a pair of registered visible light RGB and thermal infrared T images, extracting multispectral features, fusing the extracted multispectral features in a feature space based on the target aware fusion strategy, sending the fused features into a detection head, outputting a position box of a pedestrian and a confidence score, and completing a detection process.
The present disclosure further provides a pedestrian detection model for executing the above method, and the training process of the detection model and the specific process of pedestrian detection based on the model will be described below.
As shown in
The training process of a multispectral pedestrian detection model based on a target aware fusion strategy is shown in
The target aware fusion module includes a multispectral feature aggregation module and a multispectral feature optimization module. The multispectral feature aggregation module uses the feature channel splitting mechanism to process the concatenated multispectral features and output initially fused feature Fx. The multispectral features extracted by the neural network are processed by using the feature channel splitting mechanism, and the multispectral features are processed in two paths: in one path, a convolutional layer is used to compress channel dimensions of the multispectral features, and in the other path, a convolution and residual module is used to compress the channel dimensions of the multispectral features, and then the features processed in the two paths are concatenated in the channel dimensions and the initially fused feature is output through a convolutional layer:
where [Frgb, Fthermal] represents RGB image features and thermal infrared image features concatenated in the channel direction; (·; θi) represents an ith convolutional layer in a multispectral feature aggregation module, θi represents a learnable parameter of the ith convolutional layer; and (·)) represents a residual convolution module. This process is shown in
The multispectral feature optimization module optimizes the initially fused feature Fx in two paths, and outputs the optimized feature Fy, which is specifically as follows: in one path, three layers of convolutional layers are used to process the initially fused feature into a single-channel two-dimensional confidence map, then a correlation vector between the two-dimensional confidence map and an initially fused channel feature map is calculated, and then two layers of fully connected layers are used to process the correlation vector; in the other path, the initially fused feature is multiplied with the above correlation vector processed by the fully connected layers to obtain an optimized feature to be output:
where ⊕ represents a pixel-by-pixel multiplication operation; s represents a correlation vector; (·) represents a multilayer perceptron consisting of two fully connected layers; m· fxi represents a correlation operation between a predicted two-dimensional confidence map m and an ith channel feature map of the initially fused feature Fx; and σ(·) represents a sigmoid operation; θseg represents a learnable parameter of a segmentation branch.
A correlation maximum loss function is calculated according to the predicted two-dimensional confidence map m, a ground-truth two-dimensional confidence map {tilde over (m)} and the correlation vector s
where a ground-truth value of the ground-truth two-dimensional confidence map {tilde over (m)} is obtained as follows: all regions in labeling boxes corresponding to given pictures are set to 1, and other regions are set to 0; seg(·) is a segmentation loss function; neg_entropy(·) is a maximized information entropy loss function; and a represents a balance coefficient, which is used to control the weight relationship between two loss functions.
bce(·) represents a binary cross entropy loss function; dice(·) represents a Dice loss function; ϵ is a small constant, in order to prevent division by 0; m; and {tilde over (m)}i represent values of the predicted two-dimensional confidence map m and the ground-truth two-dimensional confidence map {tilde over (m)} at an ith position; and si represents a value of the correlation vector s at the ith position. The optimization process is shown in
As shown in
The multispectral feature aggregation module processes the concatenated multispectral features by using a feature channel splitting mechanism to output a initially fused feature Fx:
where [Frgb, Fthermal] represents RGB image features and thermal infrared image features spliced in the channel direction; (·; θi) represents an ith convolutional layer in a multispectral feature aggregation module, and θi represents a learnable parameter of the ith convolutional layer; and (·) represents a residual convolution module; this process is shown in
The multispectral feature optimization module optimizes the initially fused feature Fx in two paths, and outputs the optimized feature Fy:
⊕ represents a pixel-by-pixel multiplication operation; s represents a correlation vector; (·) represents a multilayer perceptron consisting of two fully connected layers; m·fxi represents a correlation operation between a predicted two-dimensional confidence map m and an ith channel feature map of the initially fused feature Fx; and σ(·) represents a sigmoid operation. This process is shown in
Number | Date | Country | Kind |
---|---|---|---|
202310319227.1 | Mar 2023 | CN | national |
The present application is a continuation of International Application No. PCT/CN2023/085308, filed on Mar. 31, 2023, which claims priority to Chinese Application No. 202310319227.1, filed on Mar. 29, 2023, the contents of both of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/085308 | Mar 2023 | WO |
Child | 18639914 | US |