The present invention belongs to the field of medical image segmentation, and more specifically, to a temporal information enhancement-based method for 3D medical image segmentation.
Automatic image segmentation technology is widely applied in the medical field and plays an important role in the development of computer-aided diagnosis and treatment systems such as clinical diagnosis, surgical navigation, etc. For example, puncture biopsy is the gold standard for prostate cancer diagnosis, and accurate puncture navigation techniques can effectively improve detection rates for puncture surgery and reduce trauma to patients. Real-time 3D ultrasound attracts much attention in the field of puncture navigation. Automatic segmentation of a biopsy needle in a 3D ultrasound image is the key technology to achieve navigation in puncture surgery.
Currently, many deep learning models have been proposed for image segmentation tasks, and the convolutional neural network (CNN) and the transformer have become the mainstream methods for image segmentation. In recent years, the CNN and the transformer has been combined to further improve the accuracy of segmentation networks, but the transformer module requires a lot of computing and memory resources. Therefore, the combination of the CNN and the transformer mostly depends on the encoder-decoder structure, which can eliminate redundant information by encoding so as to reduce model complexity.
With the development of image processing technology, many researchers have found that the utilization of temporal information is one of the main directions of improving the accuracy of medical image segmentation. For example, when a biopsy needle moves, temporal information can provide the reference for the relative movement and shape thereof, thereby greatly reducing the detection difficulty thereof. Thanks to the ability of the transformer to learn the global feature correlation, the transformer is very suitable for processing multi-frame images, and is widely applied to video segmentation tasks. However, the existing research mainly deals with 2D natural image sequences, and the 3D ultrasound-based method for real-time needle detection cannot achieve a good effect.
Another significant issue with medical image segmentation is the high cost of data annotation. Therefore, numerous attempts have been made to explore semi-supervised segmentation methods, which means that only a few images in a data set are annotated to achieve segmentation accuracy close to that achieved by annotating all data. A currently popular semi-supervised segmentation strategy is consistency learning, which means that the model is encouraged to have similar output when a sample or a parameter of the model is slightly disturbed. This will force output features of similar samples to be closer while output features of different categories are more different from each other, so that model performance is indirectly enhanced by using unlabeled samples. The existing consistency learning schemes are mainly implemented by setting up parallel networks or transforming input data. The former needs to occupy additional computing resources, and the performance of the latter is not ideal.
For the above defects or improvement requirements in the prior art, provided in the present invention is a temporal information enhancement-based method for 3D medical image segmentation. The method uses a circle transformer to extract motion information of a target in a 3D image sequence to perform training, thereby helping to improve segmentation accuracy for a target region image in a 3D medical image. The method is applicable to all segmentation models based on the encoder-decoder structure. Segmentation results before and after the combination of temporal information are both constrained, thereby eliminating dependency of the model on a temporal module, and improving the segmentation accuracy of the model without costs. During application, only a single frame of 3D image needs to be input, and no sequence needs to be used as an input, so that the application mode is more flexible. For unlabeled data, the method calculates a consistency loss according to output probability maps before and after the combination of temporal information, thereby facilitating improvement in model performance and requiring no additional memory.
In order to achieve the above objective, according to a first aspect of the present invention, provided is a temporal information enhancement-based method for 3D medical image segmentation, comprising:
According to a second aspect of the present invention, provided is a temporal information enhancement-based system for 3D medical image segmentation, comprising: a computer-readable storage medium and a processor,
According to a third aspect of the present invention, provided is a computer-readable storage medium storing computer instructions, the computer instructions being configured to cause a processor to perform the method according to the first aspect.
In general, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:
In conclusion, in the present invention, deep learning is combined with semi-supervised training of a temporal information enhancement-based segmentation model to train a 3D medical image segmentation model. During the semi-supervised training of the temporal information enhancement-based segmentation model, a circle transformer module is constructed to extract temporal motion information to optimize the training process. In addition, segmentation results before and after the combination of temporal information are both supervised, so that the model is no longer temporally dependent, thereby improving the training effect of the segmentation model without costs. A new consistency loss is proposed on the basis of the segmentation results before and after the combination of temporal information, to constrain unlabeled data to achieve semi-supervised training, thereby facilitating improvement in the model performance and requiring no additional memory.
In order to clarify the purpose, technical solution, and advantages of the present invention, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be appreciated that the specific embodiments described here are used merely to explain the present invention and are not used to limit the present invention. In addition, the technical features involved in various embodiments of the present invention described below can be combined with one another as long as they do not constitute a conflict therebetween.
Provided in an embodiment of the present invention is a temporal information enhancement-based method for 3D medical image segmentation. The method uses a circle transformer module to extract temporal information to enhance a training effect of a segmentation model. The trained model can perform segmentation for a target region image in a medical image. As shown in
Specifically, the training phase comprises:
The semi-supervised training framework for a temporal information enhancement-based segmentation model constructed and acquired in step (3) specifically includes:
Regarding the structure of the circle transformer module, the circle transformer module generates the pre-decoded features by fusing two groups of features via two self-attention calculations with exchanged query objects. That is, in the first self-attention calculation, the static feature is treated as Key and Value, and the dynamic feature is treated as Query. An intermediate variable having the same size as the dynamic feature is generated by means of a weighted sum of the static features. In the second self-attention calculation, the above intermediate variable is treated as Key and Value, and the static feature is treated as Query. Layer normalization having residual connection and feedforward neural network processing are performed on a result of the second self-attention calculation, to acquire the pre-decoded feature, i.e., a final output of the circle transformer module. The final output of the circle transformer module is a twice-weighted sum of the static features. The circle transformer module performs weighted updating on the static features by means of the above calculation on the basis of correlation between the dynamic and static features under the guidance of the dynamic features while maintaining the feature size unchanged.
The method provided in the present invention utilizes the circle transformer module to introduce motion information of a target between an adjacent frame of image and a current frame of image, and enables the model to pay close attention to the shape and position features of a needle in a training process, so that the trained model has higher segmentation accuracy and is more robust in a complex environment. During calculation of a pre-decoded feature of a certain frame of image, the static feature is denoted as ƒs, and the dynamic feature is denoted as ƒd, the intermediate variable of the self-attention calculation is denoted as ƒm, the output of the self-attention calculation is denoted as ƒa, the output of the circle transformer module is denoted as ƒo, and LN(·) and FFN(·) respectively represent name-corresponding functions. The calculation process is as follows:
The segmentation loss (Lseg) is for supervising segmentation results acquired from the encoded feature and the pre-decoded feature for a labeled sample, and consists of dice loss (LDice) and cross-entropy loss (LCE). y and ŷ respectively represent a label and a prediction for a pixel i. M is the total number of pixel points in the sample. The segmentation loss is defined as follows:
The consistency loss (Lcons) is for unlabeled data, and maintains consistency between segmentation results acquired from the encoded feature and the pre-decoded feature. ŷ1,i and ŷ2,i respectively represent predictions for the pixel i in the segmentation results acquired from the encoded feature and the pre-decoded feature. The consistency loss is defined as follows:
It can be understood that the segmentation loss and the consistency loss may also be constructed according to related prior art.
Preferably in the present invention, before the semi-supervised training is performed on the segmentation model, the method further includes: cropping each image in the training set according to a region of interest, and then performing pixel value normalization processing, and
Specifically, performing cropping on the training set sample, rotating, translating, and flipping the cropped image, and performing pixel value normalization processing can reduce calculation complexity and improve diversity of training data.
Similarly, for an original 3D medical image to be segmented, the original image is cropped according to the region of interest, and a pixel value normalization operation is performed on the cropped image to acquire a sample to be segmented.
That is, the above semi-supervised training framework for a temporal information enhancement-based segmentation model may be used to train an image segmentation model having the encoder-decoder structure. The trained model is used so that the sample to be segmented acquired by an image preprocessing function module is used as a model input, and a segmentation result for a target region in the sample is output. The training utilizes the training set sample, and is performed by using the gold standard of segmentation of some samples as a label. The training set sample is acquired by acquiring a 3D medical image sequence including a target region, of which the gold standard of segmentation of the target region of some samples is known, and by cropping an original image sequence according to a region of interest, and performing pixel value normalization processing.
A medical image sequence with the total number of frames being n is used as an example.
On the basis of the above method, an actual 3D medical image segmentation model was trained. Specifically, the following steps are included:
Further, in order to verify the method of the present invention, the following comparative examples (the comparative examples used the same data set) were designed:
TriANet proposed in the document (Triple attention network for video segmentation, Neurocomputing 417, 202-211, 2020) was used to implement the segmentation task of the biopsy needle region in the 3D medical image sequence. Training was performed by using the same data set, learning rate, number of iterations, and optimizer parameters as those in the method of the present invention.
VisTR proposed in the document (End-to-end video instance segmentation with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8737-8746, 2021) was used to implement the segmentation task of the biopsy needle region in the 3D medical image sequence. Training was performed by using the same data set, learning rate, number of iterations, and optimizer parameters as those in the method of the present invention.
IFC proposed in the document (Video instance segmentation using inter-frame communication transformers, in: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021) was used to implement the segmentation task of the biopsy needle region in the 3D medical image sequence. Training was performed by using the same data set, learning rate, number of iterations, and optimizer parameters as those in the method of the present invention.
AOT proposed in the document (Associating objects with transformers for video object segmentation, in: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021) was used to implement the segmentation task of the biopsy needle region in the 3D medical image sequence. Training was performed by using the same data set, learning rate, number of iterations, and optimizer parameters as those in the method of the present invention.
DeAOT proposed in the document (Decoupling features in hierarchical propagation for video object segmentation, in: Proceedings of the 35th Conference on Neural Information Processing Systems, 2022) was used to implement the segmentation task of the biopsy needle region in the 3D medical image sequence. Training was performed by using the same data set, learning rate, number of iterations, and optimizer parameters as those in the method of the present invention.
To show the advantages of the present invention, the segmentation effect of Example 1 was compared with those of Comparative Examples 1-5. In quantitative comparison, the dice coefficient, the needle tip positioning error (Etip), the needle length error (Elen), and the needle angle error (Eang) were used for evaluation. The DSC was defined as follows:
where TP, FP, and FN respectively represented the number of true positives, the number of false positives, and the number of false negatives in the predicted pixel classification results in comparison to the gold standard. For Etip, Elen, and Eang, respective needle tip positions, lengths, and angles were extracted from the segmentation results and labels by means of a linear fitting algorithm, and then the mean square error (MSE) of each parameter was calculated.
Table 1 lists four quantitative evaluation metrics of the segmentation results of Example 1 and Comparative Examples 1-5, and the running speed and parameters. As can be seen from the table, in Example 1 compared with Comparative Examples 1-5, the four quantitative evaluation metrics DSC, Etip, Elen, and Eangz are obviously improved. In addition, the parameters in Example 1 are only 0.87 M, and the computation speed reaches 55 frames/second (FPS), thereby achieving real-time segmentation for a biopsy needle in a 3D ultrasound image more quickly and accurately.
In order to more intuitively show the advantages of the present invention, visual effect diagrams of segmented images corresponding to Example 1 and Comparative Examples 1-5 are provided. As shown in (a) to (h) in
The above embodiment in which needle detection is used as an example sufficiently shows that the present invention facilitates improvement in detection accuracy of a biopsy needle in a 3D medical image. The method and system are based on deep learning, can be used to train an encoder-decoder structure-based segmentation model, and in particular can be used for a segmentation task for a biopsy needle region of a 3D medical image.
The above embodiment is merely an example. In addition to the biopsy needle, the method and system of the present invention may also be used to perform segmentation on 3D medical images having other target regions such as surgical instruments, organs, etc., that move over time.
Provided in an embodiment of the present invention is a temporal information enhancement-based system for 3D medical image segmentation, including: a computer-readable storage medium and a processor,
Provided in an embodiment of the present invention is a computer-readable storage medium storing computer instructions, the computer instructions being configured to cause a processor to perform the method according to any one of the above embodiments.
It can be easily understood by a person skilled in the art that the foregoing description is only preferred embodiments of the present invention and is not intended to limit the present invention. Any modifications, identical replacements, improvements and so on that are within the spirit and principle of the present invention should be included in the scope of protection of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202311391101.1 | Oct 2023 | CN | national |