The present invention belongs to the field of medical image registration, and more specifically, relates to a building method of a multi-modal three-dimensional medical image segmentation and registration model, and an application thereof.
The morbidity of soft tissue cancer has increased rapidly in recent years, and the proportion of malignant tumors in causes of death is also increasing year by year. Cancer has become an important factor threatening human life and health. The optimal policy of response to cancer is early diagnosis and early treatment. For example, puncture biopsy is the gold standard for prostate cancer diagnosis, and an accurate prostate puncture navigation system can effectively increase the positive rate of puncture and reduce trauma caused thereby. Transrectal ultrasound is convenient, radiation-free, and real-time, is therefore often used for prostate puncture navigation, but is poor in identifying a tumor, and is prone to result in inaccurate target positioning. Magnetic resonance imaging has good resolution for soft tissue, and can distinguish between different types of lesions of the prostate, so that people pay great attention to how to simultaneously display, in an ultrasound image, rich information in a three-dimensional magnetic resonance image. The key technique is real-time accurate registration of three-dimensional magnetic resonance and ultrasound images.
At present, many scholars have proposed to use deep learning to achieve image registration. The mainstream method is to use a deep learning registration model to directly predict a pixel-level deformation field to achieve end-to-end registration, so as to avoid iteration, thereby speeding up registration. The mainstream deep learning registration models are mainly based on a convolutional neural network (CNN), and some researchers also combine the CNN with a transformer to improve performance thereof. Most of these models use encoding and decoding frameworks, and deformation parameters or deformation fields are output only at the end of the models. This design does not fully incorporate the features of registration tasks that the low-to-high resolution progressive registration process is involved. The registration performance thereof is limited.
Another problem of existing deep learning registration methods is that it is difficult to acquire true deformation fields for training. The early supervised registration method trains a deep learning model by marking a marker point of a particular anatomical structure or randomly generating a known deformation field, but the label acquisition costs thereof are high. In addition, it is difficult to effectively apply the method in actual tasks. Therefore, unsupervised methods are proposed for medical image registration. The principle of the unsupervised registration methods is to apply a predicted deformation field to a floating image first, and then a difference between a registered image and a reference image is evaluated via the designed similarity measures such as mean squared errors and normalized cross-correlation so as to build a loss function, and the performance of the methods depends on the similarity measures. Because a clear hierarchical structure can be observed in a magnetic resonance image of the prostate, but the internal tissue structure of the prostate in an ultrasound image is blurred, the similarity between the images of the two modalities is low, and it is difficult to use the existing loss function to train registration models for magnetic resonance images and ultrasound images of the prostate.
In order to solve the problem of image registration between multi-modal images with low similarity therebetween, such as magnetic resonance images and ultrasound images, some scholars have proposed to first extract a target region via image segmentation and then perform registration. This method can build a similarity loss according to the contour of the target region, and remove interference from a background region, but errors of the segmentation model tend to affect the performance of subsequent registration. For example, training data with segmentation errors is prone to result in wrong convergence and overfitting of the registration model.
In view of the defects in the prior art and improvement requirements, provided in the present invention are a building method of a multi-modal three-dimensional medical image segmentation and registration model, and an application thereof, and the objective thereof is to improve the accuracy of image registration between multi-modal three-dimensional medical images with a low degree of similarity.
In order to achieve the above objective, according to an aspect of the present invention, provided is a building method of a multi-modal three-dimensional medical image segmentation and registration model, comprising:
Further, network architectures of the reference image segmentation model, the floating image segmentation model, and the registration model are the same, and are all multi-scale self-attention networks.
Further, during training, segmentation losses Lseg of the reference image segmentation model and the floating image segmentation model are both as follows:
Further, during training, a registration loss Lreg of the registration model is:
Further, the value of λ is 10, and the value of β is 0.1.
Further, the multi-scale self-attention network comprises: an encoding module, and cascaded n groups of decoding modules, encoding and decoding branch feature fusion modules, and self-attention modules, n being the total number of scales corresponding to the multi-scale,
Further, the encoding module comprises a convolutional layer and parallel-connected n groups of residual convolution modules and convolutional downsampling modules, wherein the convolutional layer is used to perform feature map extraction on a received three-dimensional image to acquire the feature map fe1; and the cascaded n groups of residual convolution modules and convolutional downsampling modules are respectively used to perform feature extraction and resampling on a feature map output by an upper stage, so as to acquire n feature maps of different resolutions all lower than the resolution of the feature map fe1;
Further, the self-attention module performs the self-attention principle calculation in the following manner:
Further provided in the present invention is a multi-modal three-dimensional medical image segmentation and registration method, which uses a segmentation and registration model built by means of the building method described above to perform the following steps:
Further provided in the present invention is a computer-readable storage medium, comprising a stored computer program, wherein when executed by a processor, the computer program controls a device where the storage medium is located to perform the building method of a multi-modal three-dimensional medical image segmentation and registration model described above and/or the multi-modal three-dimensional medical image segmentation and registration method described above.
In general, the above technical solutions proposed in the present invention can achieve the following beneficial effects:
(1) The present invention provides a novel building method of a registration model, and innovatively proposes building of a segmentation and registration model. That is, models corresponding to segmentation and registration are trained synchronously, and are collectively referred to as a segmentation and registration model in the present invention. Specifically, in view of segmentation and registration of medical images, it is proposed to use a multi-scale network to respectively perform multi-scale segmentation and registration. At each resolution level in segmentation and registration processes, a deformation field or a segmentation result map is correspondingly generated. During building of a loss function, a segmentation loss and a registration loss are each the sum of segmentation and registration losses in each scale, and parameter adjustment is performed on all network models on the basis of the sum of the segmentation loss and the registration loss. In such a synchronous training manner, parameter adjustment of the segmentation model actually improves registration accuracy by means of improving segmentation accuracy, thereby avoiding the existing problem of overfitting caused by independent training of segmentation and registration, and improving the training effect of the registration model. In addition, compared with an end-to-end registration model, a deformation field output by the segmentation and registration model finally acquired by the method is optimized multiple times from overall to detailed levels, so that the segmentation and registration model has higher registration accuracy. In addition, the level set energy function loss and the gradient loss are added to the segmentation loss to quickly constrain a segmentation region into a uniform simply connected domain during early training, so as to ensure the accuracy of the early training of the registration model. Therefore, the present invention improves overall registration performance while implementing the segmentation model and the registration model simultaneously.
(2) In the multi-scale self-attention network proposed in the present invention, at each resolution level in the decoding process, a self-attention module is guided by an original input to form a deformation field or a segmented image of a corresponding resolution by using input features, and the segmentation and registration performance are enhanced by introducing a cascaded update process from low resolutions to high resolutions.
(3) The registration loss employed by the method of the present invention comprises constraints on the contour and constraints on the regularization of the deformation field, thereby improving registration accuracy while reducing folding.
In order to make the purpose, technical solution, and advantages of the present invention clearer, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be appreciated that the specific embodiments described here are used merely to explain the present invention and are not used to limit the present invention. In addition, the technical features involved in various embodiments of the present invention described below can be combined with one another as long as they do not constitute a conflict therebetween.
A building method of a multi-modal three-dimensional medical image segmentation and registration model includes:
S1, acquiring three-dimensional medical images of two modalities, of the same type and of a plurality of targets, wherein images of one modality are used as reference images, and images of the other modality are used as floating images; and forming a pair of images by using three-dimensional medical images of two modalities corresponding to each target, and using the pair of images as a training sample, so as to acquire a training sample set.
It should be noted that considering that an image may include a relatively large number of regions other than a target organ during actual medical image acquisition, in order to improve calculation efficiency, an acquired three-dimensional image is typically trimmed in the industry, and then pixel value normalization processing is performed on pixels within the image. The processed three-dimensional image is used as the three-dimensional medical image acquired in step S1 of the present method.
S2, using the training sample set to simultaneously train and optimize parameters of a reference image segmentation network, a floating image segmentation network, and a registration network on the basis of the sum of a segmentation loss and a registration loss, so as to acquire a segmentation and registration model formed by a reference image segmentation model, a floating image segmentation model, and a registration model.
As shown in
It should be noted that before training, it is typical in the industry to first determine network architectures of the reference image segmentation model, the floating image segmentation model, and the registration model, and initialize parameters. Additionally, the multi-scale segmentation may be as follows: first, the original image is resampled to acquire images of different scales (i.e., different resolutions), then a segmentation network model performs a segmentation operation on target regions in the images of different scales, to acquire a multi-scale segmentation result. The resampling operation may be classified as one of the functions of the segmentation network model. The deformation field is formed by displacement amounts of pixel points in the reference image relative to corresponding points in the floating image.
The method of the present embodiment is a novel registration model building method, and building of a segmentation and registration model is innovatively proposed. That is, models corresponding to segmentation and registration are trained synchronously, and are collectively referred to as a segmentation and registration model. The method introduces a multi-scale module, and provides a loss function of both segmentation and registration to improve registration performance. The trained model can perform registration for a target region in a multi-modal medical image.
As a preferred embodiment, network architectures of the reference image segmentation model, the floating image segmentation model, and the registration model are all multi-scale self-attention networks, thereby achieving high accuracy.
The training process uses a multi-scale output generated by the model, a multi-modal original image, and a segmentation label corresponding to each modal image to build a loss function, and a training objective is to minimize the loss function. The original image and the corresponding segmentation label are downsampled into a plurality of different resolutions to build multi-scale losses. Specifically, the segmentation loss is the sum of losses corresponding to multi-scale segmentation. The registration loss also involves the sum of losses corresponding to a plurality of scales according to features of loss items.
The first-order gradient loss may specifically be:
The level set energy function loss may specifically be:
Accordingly, as a preferred embodiment, during training, segmentation losses Lseg of the reference image segmentation model and the floating image segmentation model are both as follows:
As a preferred embodiment, during training, a registration loss Lreg of the registration model is:
As a preferred embodiment, as shown in
The cascaded n groups of decoding modules, encoding and decoding branch feature fusion modules, and self-attention modules are used according to the resolutions of encoded feature maps in ascending order. First of all, the encoding module in the first group is used to perform feature extraction on the feature map of the lowest resolution acquired by means of the encoding and to upsample the same to a corresponding upper level of resolution in the encoding adjacent to the resolution of a current feature extraction result. The encoding and decoding branch feature fusion module fuses the feature map output by the decoding module in the same group and the feature map of the same resolution acquired by means of the encoding, so as to acquire a decoded feature map. The self-attention module performs self-attention principle calculation on the basis of the decoded feature map and an image having the same resolution as the decoded feature map and acquired by resampling the combined image, so as to acquire a self-attention feature map of the corresponding resolution and a deformation field or a segmentation result of the corresponding resolution (if the multi-scale self-attention network is used as the segmentation model, then a segmentation result is acquired here, and if the multi-scale self-attention network is used as the registration model, then a deformation field is acquired here). The self-attention feature map is used as an input of the decoding module in the second group of the decoding module, the encoding and decoding branch feature fusion module, and the self-attention module. The above-described process is repeated until the self-attention module in the n-th group of the decoding module, the encoding and decoding branch feature fusion module, and the self-attention module outputs a deformation field or a segmentation result having the resolution of the original image. For example, the value of n is four. The deformation fields or segmentation results output by the self-attention modules in the four groups of decoding modules, encoding and decoding branch feature fusion modules, and self-attention modules are respectively Y1/8, Y1/4, Y1/2, and Y1. The superscripts all represent the ratio of the resolution to the original image resolution.
As a preferred embodiment, the encoding module includes a convolutional layer and parallel-connected n groups of residual convolution modules and convolutional downsampling modules, wherein the convolutional layer is used to perform feature map extraction on a received three-dimensional image to acquire the feature map fe1; the cascaded n groups of residual convolution modules and convolutional downsampling modules are respectively used to perform feature extraction and resampling on a feature map output by an upper stage, so as to acquire n feature maps of different resolutions all lower than the resolution of the feature map fe1;
As a preferred embodiment, the self-attention module performs the self-attention principle calculation in the following manner:
As shown in FIG. c in Di, Kli, Vli. The three feature vector sets (i.e., the feature matrices)
Di, Kli, Vli generate an output vector Voi on the basis of the self-attention principle, and after the output vector Voi is deconstructed into an image feature, the image feature and fDi are added to acquire {circumflex over (f)}Di. {circumflex over (f)}Di is subjected to layer normalization, and then an output Yi (a segmentation result si or a registration deformation field ϕi) of the current resolution and a feature fci to be transmitted downward are generated via two feedforward neural networks (FFNs).
Preferably, in the encoding and decoding branch feature fusion, as shown in FIG. c in
A multi-modal three-dimensional medical image segmentation and registration method uses a segmentation and registration model built by using the building method described in Embodiment 1 to perform the following steps:
That is, a multi-modal three-dimensional medical image to be subjected to registration is acquired, and is trimmed according to a region of interest, and then pixel value normalization processing is performed. Then, the image is input into the trained segmentation and registration model to acquire a deformation field, thereby acquiring a registration result.
The related technical solution is the same as that in Embodiment 1, and will not be described herein again.
To better describe the effect of the present invention, the following examples are provided for verification:
On the basis of the above method, training and testing of a multi-modal three-dimensional medical image segmentation and registration model were performed as Example 1, and specifically included the following steps:
(1) Data was acquired from Prostate-MRI-US-Biopsy data sets of a public database of The Cancer Imaging Archive (TCIA). Matching was performed therein, and 502 pieces of data including magnetic resonance images and ultrasound images and prostate labels corresponding thereto were acquired. 452 pieces of data were randomly selected therefrom for model training, and the remaining 50 pieces of data were used for model testing. During preprocessing, the original images were resampled into the spatial resolution of 1 mm×1 mm×1 mm. During training, an equal amount of random panning was added to each pair of images, thereby enhancing data diversity.
(2) By means of the above training set, the provided segmentation and registration model (shortly referred to as MSANet) was trained by using the loss function provided in the present invention. Optimal model parameters were loaded, and applied to the test set. A registration result of the prostate region in the image was output. Finally, the registration accuracy was quantitatively evaluated.
Further, in order to verify the method of the present invention, the following comparative examples (the comparative examples used the same data set) were designed:
Morph (VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans. Med. Imaging 38, 1788-1800, 2019) was used to perform a registration task with respect to a prostate region in a multi-modal three-dimensional medical image. Training was performed by using the data set, the learning rate, the number of iterations, and the optimizer parameters that were the same as those in the method of the present invention.
TransMorph (TransMorph: Transformer for unsupervised medical image registration. Medical Image Analysis 82, 102615, 2022) was used to perform a registration task with respect to a prostate region in a multi-modal three-dimensional medical image. Training was performed by using the data set, the learning rate, the number of iterations, and the optimizer parameters that were the same as those in the method of the present invention.
ASNet (Adversarial learning for mono- or multi-modal registration. Medical Image Analysis 58, 101545, 2019) was used to perform a registration task with respect t to a prostate region in a multi-modal three-dimensional medical image. Training was performed by using the data set, the learning rate, the number of iterations, and the optimizer parameters that were the same as those in the method of the present invention.
Attention-Reg (Cross-modal attention for multi-modal image registration. Medical Image Analysis 82, 102612, 2022) was used to perform a registration task with respect to a prostate region in a multi-modal three-dimensional medical image. Training was performed by using the data set, the learning rate, the number of iterations, and the optimizer parameters that were the same as those in the method of the present invention.
To show the advantages of the present invention, the registration performance of Example 1 was compared with those of Comparative Examples 1-4. In quantitative comparison, evaluation was performed by using the Dice similarity coefficient (DSC), 95% Hausdorff distance (HD95), and the ratio of pixels less than 0 (|Jϕ|≤0) in the Jacobian determinant matrix of the deformation field, and the definitions are as follows:
Table 1 shows the quantitative evaluation result of the registration results of Example 1 and Comparative Example 1-4. It can be seen that MSANet provided in the present invention achieved the greatest average DSC and the smallest average HD95, and the variance thereof was the smallest. Moreover, the average folding ratio of the deformation field predicted by MSANet was 0.001, and the variance was 0.0003. The average folding ratio and the variance were both less than those in the results of Comparative Examples 1-4. The above results show that the present invention achieved more accurate registration for prostates in three-dimensional magnetic resonance and ultrasound images.
In order to more intuitively show the advantages of the present invention, visual effect diagrams of the registration results corresponding to Example 1 and Comparative Examples 1-4 and visual diagrams of the deformation fields corresponding thereto are provided. As shown in
In the above embodiments, the registration of the prostate region in the ultrasound and magnetic resonance images is used as an example to sufficiently show that the present invention achieves higher registration accuracy for multi-modal three-dimensional medical images. Embodiment 1 is based on deep learning, and can be used for registration tasks for a multi-modal three-dimensional medical image.
The above embodiments are merely examples. In addition to ultrasound and magnetic resonance images, the method of the present invention is also applicable to registration for three-dimensional medical images of other modalities, such as ultrasound, CT, etc. In addition, the method of the present invention is also applicable to medical image registration for the kidney and other regions.
A computer-readable storage medium includes a stored computer program, wherein when executed by a processor, the computer program controls a device where the storage medium is located to perform the building method of a multi-modal three-dimensional medical image segmentation and registration model according to Embodiment 1 and/or the multi-modal three-dimensional medical image segmentation and registration method according to Embodiment 2.
The related technical solution is the same as that in Embodiment 1, and will not be described herein again.
It should be easily understood by those skilled in the art that the foregoing description is only preferred embodiments of the present invention and is not intended to limit the present invention. All modifications, identical replacements and improvements within the spirit and principle of the present invention should be in the scope of protection of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202410099911.8 | Jan 2024 | CN | national |