Deep learning has shown great promise in medical image analysis, yet is limited to supervised learning based on a relatively large amount of labeled training data. However, labeling medical images, especially volumetric images, can be expertise-dependent, labor-intensive, and time-consuming, which motivates remarkable progress in label-efficient learning. Particularly, to enlarge the training set while keeping less human intervention, Self-Supervised Learning (SSL) approaches have demonstrated their effectiveness by firstly pretraining on a large number of unlabeled data and then finetuning on small-scale labeled datasets to improve the task performance of the small-scale labeled dataset.
SSL-based pretraining neglects the process of obtaining human annotations by learning representations with supervision signals generated from the data itself, which has become an important label-efficient solution for volumetric medical image representation learning.
Recently, Vision Transformer (ViT) has inspired an increasing number of SSL approaches due to its high scalability and generalization ability. For example, Masked Autoencoder (MAE) and SimMIM have achieved high performance in various natural image analysis tasks by learning transferable representations via reconstructing the original input in pixel from its highly-masked version based on the ViT structure. However, medical images such as Computed Tomography (CT) are often volumetric and large in size. Existing methods like Swin-UNETR and MAE3D have proposed breaking down the original volume scans into smaller sub-volumes (e.g., a 96×96×96 sub-volume from a 512×512×128 CT scan) to reduce the computation cost of ViT.
However, the use of local cropping strategies for medical images poses two significant challenges. Firstly, these strategies focus on reconstructing information from the masked local sub-volumes, neglecting the global context information of the patient as a whole. Global representation has been shown to play a crucial role in Self-Supervised Learning. For medical images, a global view of the volumetric data contains rich clinical contexts of the patient, such as the status of other organs, which is useful for further analysis and provides important clinical insights. Secondly, there is no guarantee that the learned representations will be stable to the input distortion caused by masking, particularly when the diverse local sub-volumes only represent a small portion of the original input. This can lead to slow convergence and low efficiency during training. Pretraining solely with strong augmentation, such as a local view masked with a high ratio, is considered a challenging pretext task, as it may distort the image structure and result in slow convergence. Instead, weak augmentation can be seen as a more reliable “anchor” to the strong augmentations.
A first aspect of the present disclosure provides a system for training a neural network model, comprising: an image acquisition device configured to obtain volumetric image data; and a computing system in communication with the image acquisition device, wherein the computing system is configured to: obtain the volumetric image data from the image acquisition device; generate, based on the volumetric image data, images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views; generate, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, representations corresponding to the global complete view and the one or more masked views; generate, by the GL-MAE encoder system, one or more reconstructed images corresponding to the one or more masked views; evaluate, by the GL-MAE encoder system, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; compute, by the GL-MAE encoder system, one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and update, by the GL-MAE encoder system, one or more parameters in the neural network model based on the one or more losses.
According to an implementation of the first aspect, the one or more masked views comprises a global masked view and a local masked view, and the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images. The one or more processors are further configured to: evaluate first consistency between the representations of the global complete view and the global masked view; and evaluate second consistency between the representations of the global complete view and the local masked view.
According to an implementation of the first aspect, the one or more processors are further configured to: project the representations of the global complete view and the one or more masked views to the shared representation space.
According to an implementation of the first aspect, the one or more losses comprises one or more reconstruction losses based on the one or more reconstructed images and one or more consistency losses based on the evaluation results.
According to an implementation of the first aspect, the one or more processors are further configured to: encode, using a first encoder, global complete view images among the received images to generate first representations corresponding to the global complete view images; and encode, using a second encoder, masked view images among the received images to generate second representations corresponding to the masked view images, wherein the first encoder is obtained based on the second encoder.
According to an implementation of the first aspect, parameters in the first encoder are updated using a momentum factor that is dynamically computed based on learnable parameters in the second encoder.
A second aspect of the present disclosure provides a method for training a neural network model, comprising: receiving, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views; generating, by the GL-MAE encoder system, representations corresponding to the global complete view and the one or more masked views; generating, by the GL-MAE encoder system, one or more reconstructed images corresponding to the one or more masked views; evaluating, by the GL-MAE encoder system, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; computing, by the GL-MAE encoder system, one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and updating, by the GL-MAE encoder system, one or more parameters in the neural network model based on the one or more losses.
According to an implementation of the first aspect, the one or more masked views comprises a global masked view and a local masked view, and the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images.
According to an implementation of the first aspect, evaluating, in the shared representation space, consistency between the representations of the global complete view and the one or more masked views further comprises: evaluating first consistency between the representations of the global complete view and the global masked view; and evaluating second consistency between the representations of the global complete view and the local masked view.
According to an implementation of the first aspect, the method further comprises: projecting the representations of the global complete view and the one or more masked views to the shared representation space.
According to an implementation of the first aspect, the one or more losses comprises one or more reconstruction losses based on the one or more reconstructed images and one or more consistency losses based on the evaluation results.
According to an implementation of the first aspect, weight of each loss of the one or more losses is tunable.
According to an implementation of the first aspect, generating the representations corresponding to the global complete view and the one or more masked views further comprises: encoding, using a first encoder, global complete view images among the received images to generate first representations corresponding to the global complete view images; and encoding, using a second encoder, masked view images among the received images to generate second representations corresponding to the masked view images, wherein the first encoder is obtained based on the second encoder.
According to an implementation of the first aspect, parameters in the first encoder are updated using a momentum factor that is dynamically computed based on learnable parameters in the second encoder.
According to an implementation of the first aspect, the method further comprises: receiving a plurality of volumetric medical images; obtaining the images of the plurality of views by applying at least one of cropping, scaling, and downsampling; and obtaining images of the one or more masked views by applying masks with a predefined ratio.
According to an implementation of the first aspect, the method further comprises: performing, using the neural network model, segmentation on an input image to identify one or more regions of interest.
A third aspect of the present disclosure provides a non-transitory computer-readable medium, having computer-executable instructions stored thereon, for training a neural network model, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to carry out: receiving images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views; generating representations corresponding to the global complete view and the one or more masked views; generating one or more reconstructed images corresponding to the one or more masked views; evaluating in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; computing one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and updating one or more parameters in the neural network model based on the one or more losses.
According to an implementation of the first aspect, the one or more masked views comprises a global masked view and a local masked view, and wherein the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images.
According to an implementation of the first aspect, evaluating, in the shared representation space, consistency between the representations of the global complete view and the one or more masked views further comprises: evaluating first consistency between the representations of the global complete view and the global masked view; and evaluating second consistency between the representations of the global complete view and the local masked view.
According to an implementation of the first aspect, the one or more processors further carry out: projecting the representations of the global complete view and the one or more masked views to the shared representation space.
The present systems and methods for image processing are described in detail below with reference to the attached drawing figures, wherein:
Masked autoencoder (MAE) is a promising self-supervised pretraining technique that may improve the representation learning of a neural network without human intervention. However, applying MAE directly to volumetric medical images poses two challenges: (i) a lack of global information that is crucial for understanding the clinical context of the holistic data, (ii) no guarantee of stabilizing the representations learned from randomly masked inputs. To address these limitations, the present disclosure presents Global-Local Masked AutoEncoder (GL-MAE), an efficient and effective self-supervised pretraining strategy. In addition to reconstructing masked local views, GL-MAE incorporates global context learning by reconstructing masked global views. Furthermore, a complete global view is integrated as an anchor to guide the reconstruction and stabilize the learning process through global-to-global consistency learning and global-to-local consistency learning. Finetuning results on multiple datasets demonstrate the superiority of this method over other state-of-the-art self-supervised algorithms, highlighting its effectiveness on versatile volumetric medical image segmentation tasks, even when annotations are scarce.
In some embodiments, GL-MAE is used for learning representations of volumetric medical data. GL-MAE reconstructs input volumes from both global and local views of the data. It also employs consistency learning to enhance the semantic corresponding by using unmasked global sub-volumes to guide the learning of multiple masked local and global views. By introducing the global information into the MAE-based SSL pretraining, GL-MAE achieved superior performance in the downstream volumetric medical image segmentation tasks compared with state-of-the-art techniques, such as MAE3D and Swin-UNETR.
In some embodiments, both global and local views of an input volumetric image are obtained by applying image transforms, such as cropping and downsampling. On the one hand, the global view covers a large region with rich information but low spatial resolution, which may miss details for small organs or tumors. On the other hand, the local views are rich in details but only take up a small fraction of the input volume and in high spatial resolution. To leverage both sources of information, an MAE is used to simultaneously reconstruct both the masked global and local images, enabling learning from both the global context and the local details of the data. A global view of the image encompasses a more comprehensive area of the data and could help learn representations that are invariant to different views of the same object. Such view-invariant representations are beneficial in medical image analysis. To encourage the learning of view-invariant representations, global-guided consistency learning is utilized, where the representation of an unmasked global view is used to guide learning robust representations of the masked global and local views. Finally, as the global view covers most of the local views, it serves as an “anchor” for the masked local views to learn the global-to-local consistency.
Components of a network environment may communicate with each other via a network(s) 110, which may be wired, wireless, or both. By way of example, the network 110 may include one or more Wide Area Networks (WANs), one or more Local Area Networks (LANs), one or more public networks such as the Internet, and/or one or more private networks. Where the network includes a wireless telecommunications network, components such as a base station, a communications tower, or access points (as well as other components) may provide wireless connectivity.
Compatible network environments may include one or more peer-to-peer network environments—in which case a server may not be included in a network environment—and one or more client-server network environments—in which case one or more servers may be included in a network environment. In peer-to-peer network environments, functionality described herein with respect to a server(s) may be implemented on any number of client devices.
In at least one embodiment, a network environment may include one or more cloud-based network environments, a distributed computing environment, a combination thereof, etc. A cloud-based network environment may include a framework layer, a job scheduler, a resource manager, and a distributed file system implemented on one or more of servers, which may include one or more core network servers and/or edge servers. A framework layer may include a framework to support software of a software layer and/or one or more application(s) of an application layer. The software or application(s) may respectively include web-based service software or applications. In some embodiments, one or more of the client devices may use the web-based service software or applications (e.g., by accessing the service software and/or applications via one or more application programming interfaces (APIs)). The framework layer may be, but is not limited to, a type of free and open-source software web application framework such as that may use a distributed file system for large-scale data processing (e.g., “big data”).
A cloud-based network environment may provide cloud computing and/or cloud storage that carries out any combination of computing and/or data storage functions described herein (or one or more portions thereof). Any of these various functions may be distributed over multiple locations from central or core servers (e.g., of one or more data centers that may be distributed across a state, a region, a country, the globe, etc.). A cloud-based network environment may be private (e.g., limited to a single organization), may be public (e.g., available to many organizations), and/or a combination thereof (e.g., a hybrid cloud environment).
By way of example and not limitation, a client device 120 may be embodied as a Personal Computer (PC), a laptop computer, a mobile device, a smartphone, a tablet computer, or any other suitable device.
As shown in
The communication interface 170 may be configured to communicate information between the computing system 140 and other devices or systems, such as the client device(s) 120, the server(s) 130, or any other suitable device(s)/system(s) in the network environment 100 as show in
In some examples, a display may be integrated as part of the computing system 140 or may be provided as a separate device communicatively coupled to the computing system 140. The display may include a display device such as a Liquid Crystal Display (“LCD”), a Light Emitting Diode Display (“LED”), a plasma display, or any other type of display, and provide a Graphical User Interface (“GUI”) presented on the display for user input and data depiction. In some instances, the display may be integrated as part of the communication interface 170.
In some embodiments, a ML/AL model according to exemplary embodiments of the present disclosure may be extended to any suitable type of neural network (NN) models. A NN model includes multiple layers of interconnected nodes (e.g., perceptrons, neurons, etc.) that can be trained with enormous amounts of input data to quickly solve complex problems with high accuracy. The first layer in the NN model, which receives input to the NN model, is referred to as an input layer. The last layer in the NN model, which produces outputs of the NN model, is referred to as an output layer. Any layer between the input layer and the output layer of the NN model is referred to as a hidden layer. The parameters/weights related to the NN model may be stored in a memory (e.g., a memory 160 in a computing system 140) in the form of a data structure.
At block 210, the computing system 140 obtains a dataset as input data. The dataset may be of various types, such as image data. For example, it may include volumetric medical images obtained from Computed Tomography (CT) scans, training data from data pools, augmented data from various sources, and more.
At block 220, the computing system 140 trains a NN model using the input dataset. Various techniques may be employed to train the NN model. As an illustrative example, without limiting the scope of the present disclosure, the computing system 140 may perform one or more of the following operations to train the NN model.
First, the computing system 140 initializes the NN model with random and/or default weights and biases. The weights and/or biases are parameters that may be adjusted during training. Second, the computing system 140 passes the input dataset through the model to obtain predictions. This step may be referred to as a forward pass, which involves applying the current model parameters (e.g., weights and/or biases) to the input dataset. Third, as depicted in block 230, the computing system 140 computes one or more losses. For example, the processing system applies one or more loss functions to quantify the difference between predicted and actual/target values, for example, by comparing the model's predictions to the actual/target values in the dataset. The computing system 140 backpropagates the computed loss(es) to update one or more parameters in the model. For example, the computing system 140 calculates the gradients of the loss with respect to the model parameters, thereby determining the contribution of each or some of the parameters to the error. With that, the computing system 140 adjusts one or more parameters in the model based on the computed gradients. In some instances, the computing system 140 applies suitable optimization algorithms (e.g., gradient descent) to adjust the one or more parameters in the model, aiming to minimize the loss and improve the model's performance. It will be appreciated by one skilled in the art that other techniques may be employ to utilize the computed losses for updating one or more parameters in the model. The computing system 140 performs multiple iterations or epochs to train the model. Each iteration may process a new or existing batch of data (e.g., an input dataset) and updates the model parameters, gradually improving its ability to make accurate predictions.
At block 240, the computing system 140 determines whether the trained model is converged. If not, the processing system may continue to another iteration (e.g., to perform any of blocks 210, 220, 230, and/or 240). For example, the computing system 140 may stop training when the model reaches satisfactory performance or after a predefined number of iterations or epochs.
In some examples, convergence may be determined by observing when the loss(es) of the model, measured by the loss function(s), stops decreasing significantly or starts increasing. For example, if the loss reaches a plateau or begins to rise, it suggests that the model may have converged or is overfitting. Convergence criteria may involve setting a threshold for the loss value or employing a patience parameter, which may indicate the number of iterations/epochs with no improvement before stopping the training.
At block 250, the computing system 140 outputs a model, based on determining the convergence of the model.
As shown in
The consistency learning module 350 is configured to learn global-to-global consistency 360 and global-to-local consistency 370 based on global guidance. In an embodiment, the global-guided consistency learning is introduced between the representations of unmasked global views (e.g., the global complete views 320) and the reconstructed masked views, which encourages invariant and robust feature learning.
In this configuration, in addition to reconstructing masked local views, GL-MAE 300 reconstructs masked local views, as well as incorporates global context learning by reconstructing masked global views. Additionally, GL-MAE 300 integrates the complete global view 320 as an anchor to guide the reconstruction and stabilize the learning process through global-to-global consistency learning 360 and global-to-local consistency learning 370. Finetuning results on multiple datasets demonstrate the superiority of this method over other state-of-the-art self-supervised algorithms, highlighting its effectiveness on versatile volumetric medical image segmentation tasks, even when annotations are scarce.
As depicted in
In this example, the inputs to GL-MAE are various views obtained from volumetric medical images 402. The various views include global view 410, masked global view 412, and masked local view 414. In some embodiment, the input image data (e.g., the volumetric medical images 402) may be transformed, such as through cropping, scaling, rotating, etc, and/or augmented. For example, the global view 410 and masked global view 412 may be obtained based on cropped image 404, while the masked local view 414 may be obtained based on cropped images 406. In one embodiment, a sub-volume of the volumetric medical images 402 may be obtained through downsampling.
Masked views, such as the masked global view 412 and masked local view 414, are generated by applying random or predefined masks (e.g., 416) to individual images. In some embodiments, Each input image, such as a global view image or a local view image, may first be segmented into multiple patches of a predefined size, for example with a predefined number of pixels. Subsequently, one or more patches in the respective image may be subjected to the application of mask(s). Masks may take various forms depending on the applications. For example, A mask may be defined as a predetermined number of pixels with default or fixed values, thereby overriding the visual information from the original image.
GL-MAE encodes the global view 410, masked global view 412, and masked local view 414 to generate corresponding representations (e.g., 410a, 412a, and 414a, respectively). Some or all of the patches from the input views may be encoded by GL-MAE. In some embodiments, only the patches that have not been applied with masks are encoded. The representations 410a, 412a, and 414a may be projected to a shared space (e.g., a shared representation space), enabling GL-MAE to learn the consistency between the representations.
Based on the representations 410a, 412a, and 414a, GL-MAE evaluates global-guided consistency 420 (indicated by dashed arrows 422 and 424) and performs image reconstructions 430 (indicated by dashed arrows 432 and 434). For example, the consistency learning module 350 in the GL-MAE 300 as shown in
The outputs from the GL-MAE include results from global-guided consistency learning 420 and reconstructed images (e.g., reconstructed global images 442 and/or reconstructed local images 444) from the image reconstructions 430. In some embodiments, the results from the global-guided consistency learning 420 may include one or more losses computed based on the global-to-global consistency learning 360 and/or the global-to-local consistency learning 370.
As an overview of the framework 500, the initial step involves generating various views based on the input image data (e.g., volumetric medical images 502). One or more image transformations are performed on the images 502, such as cropping and downsampling. For example, cropped images 504 are used for generating global complete views (or global views in dashed box 510) and global masked views (in dashed box 520), and cropped images 506 are used for generating local masked views (in dashed box 530).
The next step is to process the various views through separate paths by using GL-MAE. The various views are processed through different paths, which may be performed in any suitable order, for example, in parallel, in series, or a combination thereof.
The global views 510 are processed by an encoder 512 to generate corresponding representations 510a (e.g., in dashed box 514). The global masked views 520 are processed by an encoder 522 to generate corresponding representations 520a (e.g., in dashed box 524). Additionally, a decoder 526 is utilized to generate reconstructed global views 520b (e.g., in dashed box 528). The local masked views 530 are processed by an encoder 532 to generate corresponding representations 530a (e.g., in dashed box 534). Additionally, a decoder 536 is utilized to generate reconstructed global views 530b (e.g., in dashed box 538). The encoders 522 and 532 may be substantially the same, and the decoders 526 and 536 may be substantially the same.
GL-MAE projects the representations 510a, 520a, and 530a to a shared space to evaluate global-guided consistency (in dashed box 550) between the representations 510a, 520a, and 530a.
The aforementioned processes will be further elaborated upon with exemplary equations and implementations hereinafter. It will be appreciated by one skilled in the art that these equations and implementations are merely examples and do not limit the scope of the present disclosure.
MAE with Global and Local Reconstruction
In this example, the volumetric medical images 502 include a volume (x) of images obtained from an unlabeled CT dataset (D). The unlabeled CT dataset may include raw images from CT scanning, augmented images, and/or other types of image data. The dataset (D) is represented by D={x1, . . . , XN}, where N is the total number of volumes. The N number of volumes may be used as batches of datasets for training a NN model implementing GL-MAE. As shown in
In an example, for global views (e.g., in 510 and 520), images were scaled by a random ratio from the range of [0.5, 1], cropped, and then resized into [160, 160, 160] pixels in height, width, and depth dimensions. For local views (e.g., in 530), images were scaled by a random ratio from the range of [0.25, 0.5], cropped and resized into [96, 96, 96]. Finally, all images were normalized from [−1000, 1000] to [0,1]. p and q were set to 2 and 8.
To obtain the masked volumes (e.g., in 520 and 530) for local and global reconstruction, all local views 530 in the set Vl and global views in the set Vg 510 are individually tokenized into patches and then applied volume masking transform tm with a predefined ratio. The masked patches serve as invisible patches (562 in the legend 560), while the rest of the patches serve as visible patches (564 in the legend 560) for a learnable encoder s(⋅). By applying tm for each vgi and vli respectively, the visible patches form a set of masked local and global sub-volumes, represented by {tilde over (V)}l={{tilde over (v)}li}, i∈[1, q] and {tilde over (V)}g={{tilde over (v)}gi}, i∈[1, p], respectively.
The encoder s(⋅) (e.g., the encoder 522/532) is used to map the input volumes (in 520/530) to a representation space. Position embedding is added to each visible patch and then combined with a class token for generating the volume representation (e.g., 520a or 530a). These visible patches for local and global views vli and vgi are fed into s(⋅) to generate {tilde over (Z)}l and {tilde over (Z)}g by:
where Z consists of two-part embeddings, including the output of the class token and patch tokens, denoted as:
where {tilde over (Z)}gcls and {tilde over (Z)}lcis are outputs of the class tokens, while {tilde over (Z)}gp and {tilde over (Z)}lp are outputs of the visible macro-level view of a large area within the volume. In this example, position embedding are interpolated before being added to the visible tokens. {tilde over (Z)}gp and {tilde over (Z)}lp are in low resolution, which is beneficial for downstream tasks involving dense prediction. This process is repeated q times for local sub-volumes and p times for global sub-volumes to patches, which then are used for reconstruction.
A momentum encoder m(⋅) (e.g., the encoder 512) generates a mean representation Zc (510a in dashed box 514) of the unmasked global view vg as:
The momentum encoder's 512 parameters are updated using a momentum factor that is dynamically computed based on the learnable encoder's 522/532 parameters as follows:
where m(t)(⋅) and s(t)(⋅) represents the momentum encoder and encoder at the tth iteration, respectively, and μ is a momentum coefficient updated with a cosine scheduler.
A decoder (⋅) (e.g., the decoder 526/536) is used to reconstruct the invisible patches from the representation of the visible patches. The input of the decoder
(⋅) includes encoded visible patches and masked token. As shown in
Decoder output (yl, yg) is reshaped to form reconstructed volumes (e.g., 520b in dashed box 524 and 530b in dashed box 534). Reconstruction loss is computed using one or more loss functions. Any suitable loss functions may be used for computing the loss. For example, Mean Square Error may be used as reconstruction loss function and applied to the local and global masked sub-volumes.
For local masked sub-volumes {tilde over (V)}l, the local reconstruction loss is defined as:
where h, w, d denote voxel indices in the representations, P represents the number of patches for local views, and H, W, D refer to the height, width, and depth of each sub-volume, respectively. The reconstruction loss is computed as the sum of squared differences between the reconstruction target (vlh,w,d) and the reconstructed representations (ylh,w,d) by pixel values.
For global masked sub-volumes {tilde over (V)}g, the global reconstruction loss is defined as:
where P represents the number of patches for global views. Since the global view has a larger input size than local views, position embedding may be interpolated before being added to the visible tokens. This process enables the reconstruction of the masked volumes at both the local and global views, facilitating the learning of rich information from both the local details and global information.
Global-guided consistency learning (e.g., block 350 in
To perform consistency learning, the representations of Zlcls, the masked global views {tilde over (Z)}gcls, and the masked local views {tilde over (Z)}lcls are first projected into a shared space. In some embodiments, the projection is performed by projection layers Ps(⋅) (denoted as 544/546 in
where Ec represents the embedding of the complete global views, {tilde over (E)}g represents the embedding of the masked global views, and Et represents the embedding of the masked local view. The dimension of Ec, Eg and {tilde over (E)}l is represented by K, which is a predefined value, e.g., 512. For each type of embedding E in the shared space, it is normalized before computing the loss function. The embedding Ei is normalized as follows:
where t represents the temperature used to control the entropy of the distribution. The training of GL-MAE may aim to minimize the distributions between the embedding representations of the global complete view Ec and the masked global view {tilde over (E)}g, and the distributions between the embedding representations of the global complete view Ec and the masked local view {tilde over (E)}l by minimizing:
where H(x, y)=−x log y is cross-entropy loss.
For global unmasked sub-volumes Vg and global masked sub-volumes {tilde over (V)}g, a global-to-global consistency loss function is formulated as:
where the operator |⋅| computes the number of volumes in the respective set. {tilde over (E)}g learns consistency guided by the global context embedding Ec.
For global unmasked sub-volumes Vg and local masked volumes Vl, a global-to-local consistency loss function is formulated as:
Local embedding {tilde over (E)}l learn consistency guided by global context embedding Ec during the pretraining process.
The overall loss function is represented by:
where β1, β2, and β3 are hyper-parameters, which are used to balance the relative contributions of these four loss terms. The values of β1, β2, and/or β3 may be tuned according to various usage scenarios. In some embodiments, β1, β2, and β3 are set to 1.0 in experiments empirically.
At block 610, the computing system receives images of various views based on input image data. The various views include a global complete view, and one or more masked views. In an embodiment, the one or more masked views includes a global masked view and a local masked view.
At block 620, the computing system generates representations corresponding to the global complete view and the one or more masked views.
At block 630, the computing system generates one or more reconstructed images corresponding to the one or more masked views.
At block 640, the computing system evaluates, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views. For example, the computing system projects the representations of the global complete view and the one or more masked views to the shared representation space for evaluation. In some embodiments, the computing system evaluates first consistency between the representations of the global complete view and the global masked view, and second consistency between the representations of the global complete view and the local masked view.
At block 650, the computing system computes one or more losses based on the one or more reconstructed images and the results from the consistency evaluation.
When blocks 610-650 are performed during training of a NN model implementing GL-MAE, the computing system updates one or more parameters in the NN model based on the one or more losses (e.g., as indicated in block 660).
During inference, the computing system generates outputs based on the one or more reconstructed images and/or the evaluation results, (e.g., as indicated in dashed block 670).
The trained NN model with GL-MAE may be applied to various tasks, including image segmentation and disease classification. For image segmentation, the NN model with GL-MAE effectively segments CT scans or Magnetic Resonance Imaging (MRI) scans by identifying areas of interest such as diseased organs and tumor regions. In disease classification, the NN model with GL-MAE provides accurate diagnostic results based on CT scans or MRI scans, including the diagnosis of Covid-19.
In some embodiments, the computing system is communicatively connected to one or more image acquisition devices. The image acquisition device(s) is configured to obtain volumetric image data and transmit the data to the computing system for data processing, for example, through the communication interface 170. The image acquisition devices may include various cameras, a CT scanner, or an MRI scanner.
The following provides exemplary pseudocodes for implementing one or more processes discussed above. For example, Algorithm 1 may be used for generating various views as input to GL-MAE.
Algorithm 2 may be used for implementing GL-MAE in pretraining.
with [g_xs, l_xs, g_volume, l_volume] computed by Equations
with [m_g_features, s_g_feature, s_l_feature] computed by
=
+ α
.backward( )
Illustrative experimental results are described in detail below to demonstrate efficacy and advantages of the present disclosure. Additional details and advantages relating to exemplary embodiments of the present disclosure are discussed by Zhuang et al. in “ADVANCING VOLUMETRIC MEDICAL IMAGE SEGMENTATION VIA GLOBAL-LOCAL MASKED AUTOENCODER,” (available at arXiv: 2306.08913), which is hereby incorporated by reference in its entirety.
The SSL pretraining experiments were carried out on the Beyond the Cranial Vault (BTCV) abdomen challenge dataset (Landman et al. 2015) and TCIA Covid19 dataset (An et al. 2020). For downstream tasks, experiments were mainly conducted on the BTCV dataset to follow previous work (Chen et al. 2023; Tang et al. 2022). To assess the model's generalization on Computed Tomography (CT) datasets, its effectiveness was also evaluated on MM-WHS (Zhuang 2018), Medical Segmentation Decathlon (MSD) Task 09 Spleen, and The COVID-19-20 Lung CT Lesion Segmentation Challenge dataset (Covid-19-20 dataset) (Roth et al. 2022). The model was further transferred to Brain Tumor Segmentation (BrasTS) (Simpson et al. 2019) for assessing its cross-modality generalization ability. All datasets used were collected from open source and can be obtained via the cited papers. Dice Score (%) was used as the evaluation metric following (Chen et al. 2023; Tang et al. 2022).
ViT is used as a transformer-based backbone. Both ViT-Tiny (ViT-T) and ViT-Base (ViT-B) were used for the experiments. The pretraining phase was conducted for 1600 epochs for ViT-T and ViT-B without specification, with an initial learning rate of 1e-2, employing AdamW (Kingma and Ba 2014) as an optimizer and a batch size of 256 on four 3090Ti for 3 days. For global views (e.g., vg in 510 and 520 in
UNETR (Hatamizadeh et al. 2022) is adopted as the segmentation framework. Finetuning is performed on BTCV dataset. For linear evaluation that freezes the encoder parameters and finetuning the segmentation decoder head, model was finetuned for 3000 epochs using an initial learning rate of 1e-2, and trained on a single 3090Ti GPU with a batch size of 4. For end-to-end segmentation, the model was trained on four 3090Ti GPUs for 3000 epochs, with a batch size of 4, using an initial learning rate of 3e-4.
End-to-end and linear evaluation. To assess the effectiveness of this method, following (Tang et al. 2022; Chen et al. 2023; He et al. 2023), end-to-end segmentation experiments were conducted on three datasets.
As shown in Table 1 (700), GL-MAE (10th row) outperformed the Supervised baseline (3rd row) by a large margin (82.33% vs 79.61%, 95.72% vs 94.20%, and 88.88% vs 83.85%) with full training dataset, indicating that GL-MAE benefits the model from the unlabeled dataset. MAE3D is a recently proposed competitive SSL strategy in medical image analysis, where it has shown superiority (9th row), particularly on dense prediction tasks such as segmentation. GL-MAE (10th row) outperformed the MAE3D (9th row), which further confirms the its effectiveness. SegresNet and 3D U-Net are supervised models with competitive performance. Swin-UNETR uses 3D Swin-Transformer as the backbone, while GLSV was designed for Cardiac CT images. All SSL methods (4-10th rows) achieved better performance than the supervised Baseline, while GL-MAE (10th row) achieved the best performance over three datasets even when using only 25% and 50% annotations of the training datasets. This indicates the superior generalization ability of GL-MAE.
GL-MAE showed consistent performance when using a more lightweight transformer, ViT-T, which requires less computational resources and can be trained and inferred faster.
As shown in Table 2 (800), GL-MAE outperformed other methods by a large margin in terms of average Dice score, Normalized surface dice, and Hausdorff distance metric in both linear and end-to-end segmentation evaluation settings. This demonstrates the versatility of GL-MAE when adapting to a lightweight backbone, which is necessary in certain situations such as surgical robots.
Generalization on the unseen datasets. MM-WHS is a pre-dominant small-scale organ dataset that has not been involved in the pretraining. The experimental findings demonstrated that GL-MAE significantly improved the average dice score compared with the MAE3D, indicating its strong generalization capabilities. Furthermore, there were substantial enhancements in the performance of the aorta, LV, and RV, which share analogous structural features with the training data. This suggests that GL-MAE can exploit the structural consistency between organs across varying datasets and generalize effectively to the novel unseen datasets.
COVID-19 lesion segmentation. CT scans are commonly used in diagnosing COVID-19, yet there is a shortage of annotated data. GL-MAE has shown to improvement in COVID-19 lesion segmentation performance, indicating that GL-MAE can capture valuable knowledge from unlabeled CT datasets to improve disease diagnosis, demonstrating the versatility of the proposed method in practical clinical settings.
GL-MAE may be applied to other downstream tasks for other applications, such as processing Magnetic Resonance Imaging (MRI) datasets.
Ablation study. The framework 500 as demonstrated in
ViT-B and ViT-T were both considered as the backbone of the framework 500. The first row represents the supervised baseline without any related strategies. Instead of reconstructing the local patches at a time in each iteration, in the 2nd row, the GL-MAE method firstly reconstructs the local patches q times each iteration, thereby learning rich representation and exhibiting better performance most of the time. In the 3rd row, the reconstruction for global patches was used, further improving performance since the model can learn the global context as well as the local details. In the 4th row, the global-to-global consistency was introduced to learn a more robust representation of distortion caused by masking and learn the critical information, leading to further performance improvement. The last row represents the global-local consistency, which aims to capture the relationship between different parts of the images and their main semantics. The final objective loss function achieved the best performance across several datasets with various settings, demonstrating the importance of global information for volumetric data.
Label-efficient finetuning. An evaluation of GL-MAE was conducted under a semi-supervised learning scheme with ViT-T as the backbone. The experimental results suggested that GL-MAE can improve the dice score even when the amount of annotated training data is limited. Transformer-based models are prone to over-fitting the limited labeled data due to their dense connections, while GL-MAE can reduce the necessity for labeled data and effectively enhance performance even in low-annotated learning scenarios.
Convergence comparison. GL-MAE exhibits faster convergence and superior performance compared to MAE3D. This suggests that pretraining with global complete views and masked views can help stabilize the training process, resulting in faster convergence and more powerful representation. By utilizing global complete views as an “anchor” for local and global masked views, the model establishes a stronger relationship through global-guided consistency. The integration of global context information and different scale reconstructions enhances the overall performance of GL-MAE and contributes to its superior results.
Analysis of mask ratio. The impact of the mask ratio for the reconstruction of GL-MAE was studied. The findings showed that the mask ratio may be adjusted to balance between preserving important information and providing enough diversity for the model to learn robust representations.
Scaling to larger data. Experimental results demonstrated the ability of GL-MAE to scale to larger amounts of data and improve performance on downstream tasks.
Visualization. Studies showed that GL-MAE can improve the completeness of segmentation results (e.g., patches in the reconstructed global and local images). For example, some experiments showed that the results of segmentation using GL-MAE were better than those using MAE3D and Swin-UNETR, particularly in terms of completeness for larger organs.
It is noted that the techniques described herein may be embodied in executable instructions stored in a non-transitory computer readable medium for use by or in connection with a processor-based instruction execution machine, system, apparatus, or device. It will be appreciated by those skilled in the art that, for some embodiments, various types of computer-readable media can be included for storing data. As used herein, a “computer-readable medium” includes one or more of any suitable media for storing the executable instructions of a computer program such that the instruction execution machine, system, apparatus, or device may read (or fetch) the instructions from the computer-readable medium and execute the instructions for carrying out the described embodiments. Suitable storage formats include one or more of an electronic, magnetic, optical, and electromagnetic format. A non-exhaustive list of conventional exemplary computer-readable medium includes: a portable computer diskette; a random-access memory (RAM); a read-only memory (ROM); an crasable programmable read only memory (EPROM); a flash memory device; and optical storage devices, including a portable compact disc (CD), a portable digital video disc (DVD), and the like.
It should be understood that the arrangement of components illustrated in the attached Figures are for illustrative purposes and that other arrangements are possible. For example, one or more of the elements described herein may be realized, in whole or in part, as an electronic hardware component. The elements may be implemented in software, hardware, or a combination of software and hardware. Moreover, some or all of these other elements may be combined, some may be omitted altogether, and additional components may be added while still achieving the functionality described herein. Thus, the subject matter described herein may be embodied in many different variations, and all such variations are contemplated to be within the scope of the claims.
To facilitate an understanding of the subject matter described herein, many aspects are described in terms of sequences of actions. It will be recognized by those skilled in the art that the various actions may be performed by specialized circuits or circuitry, by program instructions being executed by one or more processors, or by a combination of both. The description herein of any sequence of actions is not intended to imply that the specific order described for performing that sequence must be followed. All methods described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.
The use of the terms “a” and “an” and “the” and similar references in the context of describing the subject matter (particularly in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation, as the scope of protection sought is defined by the claims as set forth hereinafter together with any equivalents thereof. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illustrate the subject matter and does not pose a limitation on the scope of the subject matter unless otherwise claimed. The use of the term “based on” and other like phrases indicating a condition for bringing about a result, both in the claims and in the written description, is not intended to foreclose any other conditions that bring about that result. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention as claimed.
This application claims the benefit of U.S. Provisional Application No. 63/512,905, filed Jul. 10, 2023, the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63512905 | Jul 2023 | US |