The present invention belongs to the technical field of semantic segmentation of remote sensing images, and in particular relates to a cross-domain remote sensing image semantic segmentation method based on iterative intra-domain adaptation and self-training.
With the continuous development of remote sensing technology, remote sensing devices such as satellites and drones can collect a large number of remote sensing satellite images. For example, the drones can capture a large number of high-spatial-resolution remote sensing images over cities and rural areas. Such massive remote sensing data provides many application opportunities, such as urban monitoring, urban management, agriculture, automatic mapping, and navigation. Among these applications, the key technology is semantic segmentation or image classification of remote sensing images.
In recent years, convolutional neural network (CNN) has become the most commonly used technique in semantic segmentation and image classification, and some CNN-based models have demonstrated their effectiveness in this task, such as FCN, SegNet, U-Net series, PSPNets, and DeepLab series. When training images and test images come from the same satellite or city, these models can all achieve good semantic segmentation results. However, when we use these models for classification of remote sensing images obtained from different satellites or cities, due to the different data distribution between different satellite and city images (domain shift), the test results of the models will become very poor and unsatisfactory. In some relevant literature, this problem is referred to as domain adaptation; in the field of remote sensing, domain shift is usually caused by different atmospheric conditions, acquisition differences (these differences will change the spectral characteristics of objects), differences in the spectral characteristics of sensors, or/and different types of spectral bands (such as some images may be in the red, green, and blue bands, while others may be in the near-infrared, red, and green bands) during imaging of remote sensing devices.
In a typical domain adaptation problem, training images and test images are usually designated as source domain and target domain. A common solution for processing domain adaptation is to create a new semantic labeled dataset on a target domain and train a model thereon. Due to the fact that collecting a large number of pixel labeled images of a target city is time-consuming and expensive, and this solution is very expensive and impractical, in order to reduce the workload of manual pixel classification, there are already some solutions, such as synthesizing data from weakly supervised labels. However, these methods still have limitations as they also require a significant amount of manual labor.
In order to improve the generalization ability of CNN-based semantic segmentation models, another commonly used method is to randomly change colors for data augmentation, such as gamma correction and image brightness conversion, which have been widely used in remote sensing. However, when there are significant differences in data distribution, the above data augmentation methods cannot achieve good results in cross-domain semantic segmentation. It is impossible to apply a model of a domain containing red, green, and blue bands to another domain containing near-infrared, red, and green channels using these simple augmentation methods. To overcome this limitation, generative adversarial network (GAN) [I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets[C]. Proceedings of the international conference on Neural Information Processing Systems (NIPS). 2014:2672-2680] is used for generating pseudo target domain images with similar data distributions to target domain images, and these generated pseudo target domain images can be used for training classifiers in the target domain. At the same time, some methods based on adversarial learning [Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation[C].” Proceedings of the international conference on computer vision and pattern recognition (CVPR). 2018:7472-7481] and self-training [Y. Zou, Z. Yu, B. Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training [C]. Proceedings of the international conference on European conference on computer vision (ECCV). 2018:289-305] have also been proposed by researchers to solve domain adaptation problems. Although these methods have achieved good effects in natural images, there are still certain problems in directly applying these methods to remote sensing images. The most important point is that these methods ignore the differences in the target domain images themselves, such as significant differences in building styles, shapes and the like within the same city.
Due to the differences in the target domain images themselves, the segmentation effect of inter-domain semantic segmentation models that migrate from the source domain to the target domain will also vary across all target domain images, that is, relatively accurate segmentation results can be obtained on some target domain images, but the segmentation results obtained on other target domain images will become very poor. Therefore, how to perform further intra-domain adaptation on target domain images and reduce the differences within the target domain, so that the cross-domain semantic segmentation model can achieve good segmentation effects on all target domain images, is an important issue faced by cross-domain remote sensing image semantic segmentation. Secondly, because the target domain images do not have corresponding labels, the commonly used method is to use self-training techniques to use semantic segmentation results generated by trained cross-domain semantic segmentation model as pseudo labels for the target domain images, and then use the pseudo labels to continue to train the cross-domain semantic segmentation models to obtain a final target domain semantic segmentation model. The training effect of this self-training model based on pseudo labels depends on the quality of the pseudo labels. When the quality of the pseudo labels is poor, the training effect of the model will also be greatly weakened, and the semantic segmentation ability of the model will also be greatly weakened. Therefore, how to select image results with good model segmentation effects as pseudo labels and how to improve the quality of pseudo labels are also important issues in self-training techniques.
In view of the above, the present invention provides a cross-domain remote sensing image semantic segmentation method based on iterative intra-domain adaptation and self-training, which can migrate a semantic segmentation model trained on a remote sensing image in one domain to remote sensing images of another domains, and perform further intra-domain adaptation within a target domain remote sensing image, reducing target intra-domain shift while reducing inter-domain shift between source domain and target domain, thereby further improving the performance and robustness of cross-domain remote sensing image semantic segmentation models.
A cross-domain remote sensing image semantic segmentation method based on iterative intra-domain adaptation and self-training, including the following steps:
semantic segmentation model FS, and a target domain image xt to train a source-target inter-domain semantic segmentation model Finter;
Further, a specific implementation process of step (1) includes:
source-target domain image bidirectional translation network, comprising a source→target direction image translation network and a target→source direction image translation network;
Further, a calculation expression for the segmentation probability credibility St in step (2) is as follows:
wherein H and W are a length and a width of the target domain image xt, respectively, C is a number of segmentation categories in the target domain image xt, Pt(h,w,c
Further, a calculation expression for the target domain pseudo label in step (2) is as follows:
wherein (h,w) represents a category of a pixel point with coordinates (h, w) in the target domain pseudo label , Pt (h,w,c) represents a segmentation probability of a corresponding category c of the pixel point with coordinates (h, w) in the target domain image xt, μc is a segmentation probability threshold corresponding to the category c,
Pt(h,w,c
Further, a calculation expression for the segmentation probability perplexity It(h,w) is as follows:
wherein δ( ) is a function used for measuring a perplexity between segmentation probabilities of categories of pixel points.
Further, a specific implementation process of step (4) includes:
The method of the present invention is a complete cross-domain remote sensing image semantic segmentation framework, including training of source-target inter-domain domain adaptation models, generation of target domain category segmentation probabilities and pseudo labels, sorting of target domain image segmentation probability credibility scores, training of target intra-domain iterative domain adaptation models, and generation of target domain segmentation results.
The present invention proposes an iterative domain adaptation training network within a target domain. When training the iterative domain adaptation training network, the present invention uses commonly used self-training learning techniques to guide the training of target domain segmentation models by means of using the part of images with good segmentation effects and segmentation results thereof as pseudo labels, so that target domain models can also achieve good segmentation results on the part of images with poor segmentation effects.
In addition, in order to address the characteristics of complex and diverse distribution within the target domain, the present invention also proposes to divide a target domain into a plurality of sub-domains and perform iterative intra-domain adaptation training on the plurality of sub-domains; in order to divide the target domain into the plurality of sub-domains, the present invention proposes a segmentation probability credibility calculation method, which sorts and classifies target domain images according to the scores of segmentation results of target domain models, and selects the part of the target domain images with good segmentation effects and pseudo labels thereof to further optimize the target domain model.
In the process of obtaining pseudo labels, the present invention proposes a method that combines a segmentation probability threshold and a segmentation probability perplexity threshold to remove pixel points with poor segmentation results from the pseudo labels, thereby avoiding low-quality pseudo labels interfering with target domain model training.
Based on the iterative domain adaptation training framework, the present invention achieves domain adaptation training within target domains. After obtaining a migration model from a source domain to a target domain and target domain segmentation results, the iterative domain adaptation training framework adopted by the present invention perform further intra-domain adaptation training on a target domain model, to obtain a final target domain model and semantic segmentation results, thereby improving the accuracy of cross-domain remote sensing image semantic segmentation.
In order to provide a more specific description of the present invention, the following will provide a detailed explanation of the technical solution of the present invention in conjunction with accompanying drawings and specific implementations.
As shown in
(1) using a source domain image xs, a source domain label ys, a source domain semantic segmentation model Fs, and a target domain image xt to train a source-target inter-domain semantic segmentation model Finter.
In this implementation, when there is no source domain semantic segmentation model Fs, it can be obtained by training using the source domain image xs and the source domain label ys. Commonly used deeplab, U-net, etc. can be used as a model network structure. A loss function uses cross-entropy loss with K categories, and a corresponding formula is as follows:
wherein xs is a source domain image, ys is a source domain image label, K is a number of label categories, FS is a semantic segmentation model on a source domain, Z,±[k=y
In this implementation, a Potsdam city image with building labels is taken as a source domain, and cropped into a size of 512*512 pixels, RGB 3 channels are retained, the number of images and the number of corresponding building labels are each 4000, deeplab V3+ can be used as a model network structure, a learning rate is 10−4, an optimization algorithm is adam, and a semantic segmentation model FS on a Potsdam domain is obtained by training 900 epochs.
Commonly used inter-domain domain adaptation training from a source domain to a target domain is based on image conversion and adversarial learning. This embodiment illustrates a GAN-based image conversion method, but is not limited to an image conversion-based method. The image conversion-based method first requires training a bidirectional image conversion model between a source domain and a target domain. The bidirectional image conversion model includes an image translation network GS→T from a source domain image xs to a target domain image xt, an image translation network GT→S from the target domain image xt to the source domain image xs, as well as a source domain discriminator DS and a target domain discriminator DT. Training loss functions include a cycle consistency loss function, a semantic consistency loss function, a self-loss function, and an adversarial loss function.
An equation expression for the cycle consistency loss function is as follows:
wherein xs is a source domain image, xt is a target domain image, GS→T is an image translation network from the source domain image xs to the target domain image xt, GT→S is an image translation network from the target domain image xt to the source domain image xs, is a mathematical expectation function, and μ μ1 is an L1 norm.
An equation expression for the semantic consistency loss function is as follows:
wherein xs is a source domain image, xt is a target domain image, GS→T is an image translation network from the source domain image xs to the target domain image xt, GT→S is an image translation network from the target domain image xt to the source domain image xs, is a mathematical expectation function, FT is a semantic segmentation model on a target domain, FS is a semantic segmentation model on a source domain, and KL(μ) is a KL divergence between two distributions.
An equation expression for the adversarial loss function is as follows:
wherein xs is a source domain image, xt is a target domain image, GS→T is an image translation network from the source domain image xs to the target domain image xt, GT→S is an image translation network from the target domain image xt to the source domain image xs, is a mathematical expectation function, DS is a source domain discriminator, and DT is a target domain discriminator. An equation expression for the self-loss function is as follows:
wherein xs is a source domain image, xt is a target domain image, GS→T is an image translation network from the source domain image xs to the target domain image xt, GT→S is an image translation network from the target domain image xt to the source domain image xs, is a mathematical expectation function, and ∥·∥1 is an L1 norm.
In this implementation, a Potsdam city image is taken as a source domain, and a Vaihingen city image is taken as a target domain, with an image size of 512*512 pixels and with 3 channels. The number of Potsdam city images (source domain) is 832, and the number of Vaihingen city images (target domain) is 845, the images including buildings. The image conversion model uses GAN, which includes an image translation network GS→T from a Potsdam image xs to a Vaihingen image xt, an image translation network GT→S from the Vaihingen image xt to the Potsdam image xs, as well as a Potsdam domain discriminator DSand a Vaihingen domain discriminator DT. A generator network structure is 9 layers of ResNet. A discriminator network structure is of 4 layers of CNNs. Training loss functions include a cycle consistency loss function, a semantic consistency loss function, an adversarial loss function, and a self-loss function. A learning rate is 10−4. An optimization algorithm is adam. After training 100 epochs, the training is stopped. After the training is completed, a Potsdam-Vaihingen direction image translation network GS→T and 10 Vaihingen-Potsdam direction image translation networks GT→S are obtained. Then, 4000 Potsdam satellite images with 512*512pixels and 3channels are converted from the Potsdam domain to the Vaihingen domain using a translation network GS→T, to obtain a pseudo Vaihingen image GS→T(Xs). The pseudo Vaihingen (target domain) image GS→T(xs) and the Potsdam (source domain) label ys are then used to train a pseudo Vaihingen (target domain) semantic segmentation model Finter.
Commonly used deeplab, U-net, etc. can be used as a model network structure. A loss function uses cross-entropy loss with K categories, and a corresponding formula is as follows:
wherein xs is a source domain image, ys is a source domain image label, K is a number of label categories, Finter is a semantic segmentation model on a source domain, [k=y
In this implementation, 4000 pseudo Vaihingen domain images GS→T(xs) with 512*512 pixels and 3 channels and the source domain label ys generated in step (1) are used to train a semantic segmentation model Finter on the Vaihingen domain; deeplab V3+ is used as a model network structure, a learning rate is 10−4, an optimization algorithm is adam, and a semantic segmentation model Finter on a pseudo Vaihingen domain is obtained by training 100 epochs.
(2) Inputting the target domain image xt into the source-target inter-domain semantic segmentation model Finter to obtain a category segmentation probability Pt of the target domain image xt, and then using the category segmentation probability Pt to calculate segmentation probability credibility ST and a target domain pseudo label .
In this implementation, 500 Vaihingen domain image xt, with 512*512 pixels and 3 channels are input into the source-target inter-domain semantic segmentation model Finterto obtain the category segmentation probability Pt of the target domain image xt, and the category segmentation probability Pt is used to calculate the segmentation probability credibility ST and the target domain pseudo label . A calculation method for calculating the segmentation probability credibility St is as follows:
wherein Σ represents a mathematical summation symbol, represents a mathematical product symbol, H is a length of a target domain image xt, W is a width of the target domain image xt, C is a number of classification categories of the target domain image Xt, Pt is a category segmentation probability (a matrix with a size of H×W×C) obtained by inputting the target domain image xt into the semantic segmentation model Finter, Pt(h,w,c) is a category segmentation probability of a pixel point with coordinates (h, w) and category c in the category segmentation probability Pt, and
cPt(h,w,c) is to calculate a product of category segmentation probabilities corresponding to each category c of pixel points with coordinates (h, w).
A method for obtaining the target domain pseudo label using the category segmentation probability Pt is as follows:
wherein argmax is a function that takes a maximum value, argmax{tilde over (c)}Pt(h,w,c) is a category {tilde over (c)} with highest category segmentation probability among pixel points with coordinates (h, w) in the category segmentation probability Pt, μc is a segmentation probability threshold used for generating pseudo labels for the category c, It(h,w) is segmentation probability perplexity of the pixel point with coordinates (h, w) in the target domain image xt, and v is a segmentation probability perplexity threshold used for generating pseudo labels. A calculation method for the segmentation probability perplexity It(h,w) is as follows:
wherein represents a mathematical product symbol, H is a length of a target domain image xt, W is a width of the target domain image xt, C is a number of classification categories of the target domain image xt, and cPt(h,w,c) is to calculate a product of category segmentation probabilities corresponding to each category c of the pixel point with coordinates (h, w).
(3) Sorting the segmentation probability credibility St of the 500 Vaihingen (target) domain images xt in descending order according to numerical values, and then dividing the target domain images xt into 4 subsets of target domain images {Xt1, Xt2, Xt3, Xt4} on average according to the sorted segmentation probability credibility St.
(4) Using a subset of Vaihingen (target) domain images Xt1 with highest segmentation probability credibility and a corresponding subset of pseudo labels thereof, the source-target inter-domain semantic segmentation model Finter and subsets of target domain images {Xt2, Xt3, Xt4} for iteratively training to obtain a target intra-domain semantic segmentation model Fintra.
The intra-domain single-domain adaptation method adopted in this implementation is explained using an adversarial learning-based methods, but not limited to same. The adversarial learning-based method requires an intra-domain semantic segmentation model Fintra and a discriminator Dintra. Training loss functions include a semantic segmentation loss function and an adversarial loss function.
An equation expression for the semantic segmentation loss function is as follows:
wherein Xi is a subset of target domain images of an i-th part, yi is a subset of pseudo labels corresponding to xi, K is a number of label categories, Fintra is a semantic segmentation model on a target domain, [k=y
An equation expression for the adversarial loss function is as follows:
wherein Xi is a subset of target domain images of an i-th part, is a mathematical expectation function, and Dintra is a target domain discriminator.
This implementation requires three iterative intra-domain adaptation. Firstly, in the first iteration, a subset of 125 target domain images Xt1 and a corresponding subset of pseudo labels thereof are added to an originally empty training set Xtclean and a corresponding label set , respectively, then, the training set of 125 images Xtclean as well as the corresponding label set and a subset of 125 target domain images Xt2 are undergone adversarial training, the source-target inter-domain semantic segmentation model Finter is used as an initial target intra-domain semantic segmentation model Fintra(1), a segmentation model network structure adopts deeplabV3+, a discriminator network structure is of 4 layers of CNNs, a learning rate is 10−4, an optimization algorithm is adam, after training 100 epochs, the training is stopped, and Fintra(2) is obtained after the training is completed; the subset of 125 target domain images Xt2 is input into the target intra-domain semantic segmentation model Fintra(2) to obtain a category segmentation probability Pt2, a subset of pseudo labels of the subset of target domain images Xt2 is obtained according to the segmentation probability Pt2, the subset of target domain images Xt2 and the corresponding subset of pseudo labels are added to the training set Xtclean and the corresponding label set , respectively, then, the training set of 250 images Xtclean as well as the corresponding label set and a subset of 125 target domain images Xt3 as well as the intra-domain semantic segmentation model Fintra(2) are undergone adversarial training, a segmentation model network structure adopts deeplabV3+, a discriminator network structure is of 4layers of CNNs, a learning rate is 10−4, an optimization algorithm is adam, after training 100 epochs, the training is stopped, and Fintra(3) is obtained after the training is completed; a subset of 125 target domain images Xt3 is input into the target intra-domain semantic segmentation model Fintra(3) to obtain a category segmentation probability Pt3, a subset of pseudo labels of the subset of target domain images Xt3 is obtained according to the segmentation probability Pt3, the subset of target domain images Xt3 and the corresponding subset of pseudo labels are added to the training set Xtclean and the corresponding label set , respectively, then, the training set of 375 images Xtclean as well as the corresponding label set and a subset of 125 target domain images Xt4 as well as the intra-domain semantic segmentation model Fintra(3) are undergone adversarial training, a segmentation model network structure adopts deeplabV3+, a discriminator network structure is of 4 layers of CNNs, a learning rate is 10−4, an optimization algorithm is adam, after training 100 epochs, the training is stopped, and after the training is completed, a final target intra-domain semantic segmentation model Fintra (Fintra(4)) is obtained.
(5) Inputting the target domain image xt into the target intra-domain semantic segmentation model Fintra to obtain a final segmentation result map of the target domain image xt.
Table 1 shows indexes of precision, recall, F1, and IoU calculated from results obtained from pre-migration, histogram matching (traditional method), a GAN-based inter-domain domain adaptation method, single intra-domain adaptation as well as the iterative intra-domain adaptation strategy of the present invention and label truth values, which are tested by means of relevant experiments.
From the above experimental results, it can be seen that compared with pre-migration, this implementation effectively improves the IoU index of semantic segmentation, with an improvement of 0.2510. Meanwhile, compared with simple histogram matching, the IoU index of this implementation has also been improved by 0.1973; compared with single intra-domain adaptation and inter-domain domain adaptation, the IoU index of single intra-domain adaptation is improved by 0.0296, indicating that intra-domain adaptation can reduce intra-domain differences. At the same time, compared with single intra-domain adaptation, the IoU index of iterative intra-domain adaptation is further improved by 0.0172, indicating that iterative intra-domain adaptation can further reduce intra-domain differences. Therefore, the present invention is of great help in improving the performance of cross-satellite remote sensing image semantic segmentation.
The above description of the embodiments is for the convenience of those of ordinary skill in the art to understand and apply the present invention. Those familiar with the art can obviously easily make various modifications to the above embodiments and apply the general principles explained here to other embodiments without creative labor. Therefore, the present invention is not limited to the aforementioned embodiments, and the improvements and modifications made by those skilled in the art based on the disclosure of the present invention shall be within the scope of protection of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202210402338.4 | Apr 2022 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/090009 | 4/28/2022 | WO |