ADVANCING VOLUMETRIC MEDICAL IMAGE SEGMENTATION VIA GLOBAL-LOCAL MASKED AUTOENCODER

Information

  • Patent Application
  • 20250022580
  • Publication Number
    20250022580
  • Date Filed
    June 12, 2024
    a year ago
  • Date Published
    January 16, 2025
    6 months ago
Abstract
A system is provided for training a neural network model. The system comprises one or more image acquisition devices, configured to obtain volumetric image data; and one or more processors, configured to: generate images of a plurality of views, which comprise a global complete view and one or more masked views; generate, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, representations corresponding to the global complete view and the one or more masked views; generating one or more reconstructed images corresponding to the one or more masked views; evaluating in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; computing one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and updating one or more parameters in the neural network model based on the one or more losses.
Description
BACKGROUND

Deep learning has shown great promise in medical image analysis, yet is limited to supervised learning based on a relatively large amount of labeled training data. However, labeling medical images, especially volumetric images, can be expertise-dependent, labor-intensive, and time-consuming, which motivates remarkable progress in label-efficient learning. Particularly, to enlarge the training set while keeping less human intervention, Self-Supervised Learning (SSL) approaches have demonstrated their effectiveness by firstly pretraining on a large number of unlabeled data and then finetuning on small-scale labeled datasets to improve the task performance of the small-scale labeled dataset.


SSL-based pretraining neglects the process of obtaining human annotations by learning representations with supervision signals generated from the data itself, which has become an important label-efficient solution for volumetric medical image representation learning.


Recently, Vision Transformer (ViT) has inspired an increasing number of SSL approaches due to its high scalability and generalization ability. For example, Masked Autoencoder (MAE) and SimMIM have achieved high performance in various natural image analysis tasks by learning transferable representations via reconstructing the original input in pixel from its highly-masked version based on the ViT structure. However, medical images such as Computed Tomography (CT) are often volumetric and large in size. Existing methods like Swin-UNETR and MAE3D have proposed breaking down the original volume scans into smaller sub-volumes (e.g., a 96×96×96 sub-volume from a 512×512×128 CT scan) to reduce the computation cost of ViT.


However, the use of local cropping strategies for medical images poses two significant challenges. Firstly, these strategies focus on reconstructing information from the masked local sub-volumes, neglecting the global context information of the patient as a whole. Global representation has been shown to play a crucial role in Self-Supervised Learning. For medical images, a global view of the volumetric data contains rich clinical contexts of the patient, such as the status of other organs, which is useful for further analysis and provides important clinical insights. Secondly, there is no guarantee that the learned representations will be stable to the input distortion caused by masking, particularly when the diverse local sub-volumes only represent a small portion of the original input. This can lead to slow convergence and low efficiency during training. Pretraining solely with strong augmentation, such as a local view masked with a high ratio, is considered a challenging pretext task, as it may distort the image structure and result in slow convergence. Instead, weak augmentation can be seen as a more reliable “anchor” to the strong augmentations.


SUMMARY

A first aspect of the present disclosure provides a system for training a neural network model, comprising: an image acquisition device configured to obtain volumetric image data; and a computing system in communication with the image acquisition device, wherein the computing system is configured to: obtain the volumetric image data from the image acquisition device; generate, based on the volumetric image data, images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views; generate, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, representations corresponding to the global complete view and the one or more masked views; generate, by the GL-MAE encoder system, one or more reconstructed images corresponding to the one or more masked views; evaluate, by the GL-MAE encoder system, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; compute, by the GL-MAE encoder system, one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and update, by the GL-MAE encoder system, one or more parameters in the neural network model based on the one or more losses.


According to an implementation of the first aspect, the one or more masked views comprises a global masked view and a local masked view, and the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images. The one or more processors are further configured to: evaluate first consistency between the representations of the global complete view and the global masked view; and evaluate second consistency between the representations of the global complete view and the local masked view.


According to an implementation of the first aspect, the one or more processors are further configured to: project the representations of the global complete view and the one or more masked views to the shared representation space.


According to an implementation of the first aspect, the one or more losses comprises one or more reconstruction losses based on the one or more reconstructed images and one or more consistency losses based on the evaluation results.


According to an implementation of the first aspect, the one or more processors are further configured to: encode, using a first encoder, global complete view images among the received images to generate first representations corresponding to the global complete view images; and encode, using a second encoder, masked view images among the received images to generate second representations corresponding to the masked view images, wherein the first encoder is obtained based on the second encoder.


According to an implementation of the first aspect, parameters in the first encoder are updated using a momentum factor that is dynamically computed based on learnable parameters in the second encoder.


A second aspect of the present disclosure provides a method for training a neural network model, comprising: receiving, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views; generating, by the GL-MAE encoder system, representations corresponding to the global complete view and the one or more masked views; generating, by the GL-MAE encoder system, one or more reconstructed images corresponding to the one or more masked views; evaluating, by the GL-MAE encoder system, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; computing, by the GL-MAE encoder system, one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and updating, by the GL-MAE encoder system, one or more parameters in the neural network model based on the one or more losses.


According to an implementation of the first aspect, the one or more masked views comprises a global masked view and a local masked view, and the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images.


According to an implementation of the first aspect, evaluating, in the shared representation space, consistency between the representations of the global complete view and the one or more masked views further comprises: evaluating first consistency between the representations of the global complete view and the global masked view; and evaluating second consistency between the representations of the global complete view and the local masked view.


According to an implementation of the first aspect, the method further comprises: projecting the representations of the global complete view and the one or more masked views to the shared representation space.


According to an implementation of the first aspect, the one or more losses comprises one or more reconstruction losses based on the one or more reconstructed images and one or more consistency losses based on the evaluation results.


According to an implementation of the first aspect, weight of each loss of the one or more losses is tunable.


According to an implementation of the first aspect, generating the representations corresponding to the global complete view and the one or more masked views further comprises: encoding, using a first encoder, global complete view images among the received images to generate first representations corresponding to the global complete view images; and encoding, using a second encoder, masked view images among the received images to generate second representations corresponding to the masked view images, wherein the first encoder is obtained based on the second encoder.


According to an implementation of the first aspect, parameters in the first encoder are updated using a momentum factor that is dynamically computed based on learnable parameters in the second encoder.


According to an implementation of the first aspect, the method further comprises: receiving a plurality of volumetric medical images; obtaining the images of the plurality of views by applying at least one of cropping, scaling, and downsampling; and obtaining images of the one or more masked views by applying masks with a predefined ratio.


According to an implementation of the first aspect, the method further comprises: performing, using the neural network model, segmentation on an input image to identify one or more regions of interest.


A third aspect of the present disclosure provides a non-transitory computer-readable medium, having computer-executable instructions stored thereon, for training a neural network model, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to carry out: receiving images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views; generating representations corresponding to the global complete view and the one or more masked views; generating one or more reconstructed images corresponding to the one or more masked views; evaluating in a shared representation space, consistency between the representations of the global complete view and the one or more masked views; computing one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; and updating one or more parameters in the neural network model based on the one or more losses.


According to an implementation of the first aspect, the one or more masked views comprises a global masked view and a local masked view, and wherein the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images.


According to an implementation of the first aspect, evaluating, in the shared representation space, consistency between the representations of the global complete view and the one or more masked views further comprises: evaluating first consistency between the representations of the global complete view and the global masked view; and evaluating second consistency between the representations of the global complete view and the local masked view.


According to an implementation of the first aspect, the one or more processors further carry out: projecting the representations of the global complete view and the one or more masked views to the shared representation space.





BRIEF DESCRIPTION OF THE DRAWINGS

The present systems and methods for image processing are described in detail below with reference to the attached drawing figures, wherein:



FIG. 1A illustrates an exemplary network environment, in accordance with one or more embodiments in the present disclosure.



FIG. 1B is a block diagram of an exemplary computing system configured to implement various functions, in accordance with one or more embodiments in the present disclosure.



FIG. 2 is a flowchart illustrating an example process for training a neural network (NN) model, in accordance with one or more embodiments of the present disclosure.



FIG. 3 is a block diagram of an exemplary Global-Local Masked AutoEncoder (GL-MAE) scheme, in accordance with one or more embodiments of the present disclosure.



FIG. 4 illustrates an exemplary flow diagram implementing GL-MAE, in accordance with one or more embodiments of the present disclosure.



FIG. 5 illustrates an exemplary framework 500 of GL-MAE, in accordance with one or more embodiments of the present disclosure.



FIG. 6 is a block diagram of an exemplary process implementing GL-MAE, in accordance with one or more embodiments of the present disclosure.



FIG. 7 presents Table 1 that shows a comparison of performances between various methods.



FIG. 8 presents Table 2 that shows a comparison of performances between various methods



FIG. 9 presents Table 3 that showcases ablation study results.





DETAILED DESCRIPTION

Masked autoencoder (MAE) is a promising self-supervised pretraining technique that may improve the representation learning of a neural network without human intervention. However, applying MAE directly to volumetric medical images poses two challenges: (i) a lack of global information that is crucial for understanding the clinical context of the holistic data, (ii) no guarantee of stabilizing the representations learned from randomly masked inputs. To address these limitations, the present disclosure presents Global-Local Masked AutoEncoder (GL-MAE), an efficient and effective self-supervised pretraining strategy. In addition to reconstructing masked local views, GL-MAE incorporates global context learning by reconstructing masked global views. Furthermore, a complete global view is integrated as an anchor to guide the reconstruction and stabilize the learning process through global-to-global consistency learning and global-to-local consistency learning. Finetuning results on multiple datasets demonstrate the superiority of this method over other state-of-the-art self-supervised algorithms, highlighting its effectiveness on versatile volumetric medical image segmentation tasks, even when annotations are scarce.


In some embodiments, GL-MAE is used for learning representations of volumetric medical data. GL-MAE reconstructs input volumes from both global and local views of the data. It also employs consistency learning to enhance the semantic corresponding by using unmasked global sub-volumes to guide the learning of multiple masked local and global views. By introducing the global information into the MAE-based SSL pretraining, GL-MAE achieved superior performance in the downstream volumetric medical image segmentation tasks compared with state-of-the-art techniques, such as MAE3D and Swin-UNETR.


In some embodiments, both global and local views of an input volumetric image are obtained by applying image transforms, such as cropping and downsampling. On the one hand, the global view covers a large region with rich information but low spatial resolution, which may miss details for small organs or tumors. On the other hand, the local views are rich in details but only take up a small fraction of the input volume and in high spatial resolution. To leverage both sources of information, an MAE is used to simultaneously reconstruct both the masked global and local images, enabling learning from both the global context and the local details of the data. A global view of the image encompasses a more comprehensive area of the data and could help learn representations that are invariant to different views of the same object. Such view-invariant representations are beneficial in medical image analysis. To encourage the learning of view-invariant representations, global-guided consistency learning is utilized, where the representation of an unmasked global view is used to guide learning robust representations of the masked global and local views. Finally, as the global view covers most of the local views, it serves as an “anchor” for the masked local views to learn the global-to-local consistency.



FIG. 1A illustrates an exemplary network environment 100, in accordance with one or more examples in the present disclosure. A neural network tasked for representation learning may be implemented in the exemplary network environment 100. Network environments suitable for use in implementing embodiments of the disclosure may include one or more client devices 120, servers 130, and/or other device types.


Components of a network environment may communicate with each other via a network(s) 110, which may be wired, wireless, or both. By way of example, the network 110 may include one or more Wide Area Networks (WANs), one or more Local Area Networks (LANs), one or more public networks such as the Internet, and/or one or more private networks. Where the network includes a wireless telecommunications network, components such as a base station, a communications tower, or access points (as well as other components) may provide wireless connectivity.


Compatible network environments may include one or more peer-to-peer network environments—in which case a server may not be included in a network environment—and one or more client-server network environments—in which case one or more servers may be included in a network environment. In peer-to-peer network environments, functionality described herein with respect to a server(s) may be implemented on any number of client devices.


In at least one embodiment, a network environment may include one or more cloud-based network environments, a distributed computing environment, a combination thereof, etc. A cloud-based network environment may include a framework layer, a job scheduler, a resource manager, and a distributed file system implemented on one or more of servers, which may include one or more core network servers and/or edge servers. A framework layer may include a framework to support software of a software layer and/or one or more application(s) of an application layer. The software or application(s) may respectively include web-based service software or applications. In some embodiments, one or more of the client devices may use the web-based service software or applications (e.g., by accessing the service software and/or applications via one or more application programming interfaces (APIs)). The framework layer may be, but is not limited to, a type of free and open-source software web application framework such as that may use a distributed file system for large-scale data processing (e.g., “big data”).


A cloud-based network environment may provide cloud computing and/or cloud storage that carries out any combination of computing and/or data storage functions described herein (or one or more portions thereof). Any of these various functions may be distributed over multiple locations from central or core servers (e.g., of one or more data centers that may be distributed across a state, a region, a country, the globe, etc.). A cloud-based network environment may be private (e.g., limited to a single organization), may be public (e.g., available to many organizations), and/or a combination thereof (e.g., a hybrid cloud environment).


By way of example and not limitation, a client device 120 may be embodied as a Personal Computer (PC), a laptop computer, a mobile device, a smartphone, a tablet computer, or any other suitable device.



FIG. 1B is a block diagram of an exemplary computing system 140 configured to implement various functions in accordance with one or more embodiments in the present disclosure. In some examples, the computing system 140 may be embodied as a client device 120 or a server 130 as shown in FIG. 1A. In another examples, the computing system 140 may include any combination of the components in the network environment 100 as shown in FIG. 1A to carry out various functions/processes disclosed herein. In yet another examples, the computing system 140 may include a cloud storage (e.g., a network attached storage (NAS) connected in the network environment 100) storing training dataset and one or more client devices 120 and or one or more servers 130 to retrieve the training dataset from the cloud storage and train a machine learning (ML) or artificial intelligence (AI) model implemented thereon.


As shown in FIG. 1B, the computing system 140 may include one or more processors 150, a communication interface 170, and a memory 160. The one or more processors 150, communication interface 170, and memory 160 may be communicatively coupled to a bus 180 to enable communication therebetween. The processor(s) 150 may be configured to perform the operations in accordance with the instructions stored in memory 160. The processor(s) 150 may include any appropriate type of general-purpose or special-purpose microprocessor, such as central processing unit (CPU), graphic processing unit (GPU), parallel processing unit (PPU), etc. The memory 160 may be configured to store computer-readable instructions that, when executed by the processor(s) 150, can cause the processor(s) 150 to perform various operations disclosed herein. The memory 160 may be any non-transitory type of mass storage, such as volatile or non-volatile, magnetic, semiconductor-based, tape-based, optical, removable, non-removable, or other type of storage device or tangible computer-readable medium including, but not limited to, a read-only memory (“ROM”), a flash memory, a dynamic random-access memory (“RAM”), and/or a static RAM.


The communication interface 170 may be configured to communicate information between the computing system 140 and other devices or systems, such as the client device(s) 120, the server(s) 130, or any other suitable device(s)/system(s) in the network environment 100 as show in FIG. 1A. For example, the communication interface 170 may include an integrated services digital network (“ISDN”) card, a cable modem, a satellite modem, or a modem to provide a data communication connection. As another example, the communication interface 170 may include a local area network (“LAN”) card to provide a data communication connection to a compatible LAN. As a further example, the communication interface 170 may include a high-speed network adapter such as a fiber optic network adaptor, 10G Ethernet adaptor, or the like. Wireless links can also be implemented by the communication interface 170. In such an implementation, the communication interface 170 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information via a network. The network can typically include a cellular communication network, a Wireless Local Area Network (“WLAN”), a Wide Area Network (“WAN”), or the like. In some variations, the communication interface 170 may include various I/O devices such as a keyboard, a mouse, a touchpad, a touch screen, a microphone, a camera, a biosensor, etc. A user may input data to the computing system 140 (e.g., a terminal device) through the communication interface 170.


In some examples, a display may be integrated as part of the computing system 140 or may be provided as a separate device communicatively coupled to the computing system 140. The display may include a display device such as a Liquid Crystal Display (“LCD”), a Light Emitting Diode Display (“LED”), a plasma display, or any other type of display, and provide a Graphical User Interface (“GUI”) presented on the display for user input and data depiction. In some instances, the display may be integrated as part of the communication interface 170.


In some embodiments, a ML/AL model according to exemplary embodiments of the present disclosure may be extended to any suitable type of neural network (NN) models. A NN model includes multiple layers of interconnected nodes (e.g., perceptrons, neurons, etc.) that can be trained with enormous amounts of input data to quickly solve complex problems with high accuracy. The first layer in the NN model, which receives input to the NN model, is referred to as an input layer. The last layer in the NN model, which produces outputs of the NN model, is referred to as an output layer. Any layer between the input layer and the output layer of the NN model is referred to as a hidden layer. The parameters/weights related to the NN model may be stored in a memory (e.g., a memory 160 in a computing system 140) in the form of a data structure.



FIG. 2 is a flowchart illustrating an example process 200 for training a NN model, in accordance with one or more embodiments of the present disclosure. The NN model may be implemented in a computing system 140 as shown in FIG. 1B operating in the network environment 100 as depicted in FIG. 1A. The computing system 140 may include one or more client devices 120, one or more servers 130, or any combination thereof in the network environment 100 as shown in FIG. 1A. One or more processer(s) 150 in the computing system 140 may execute instructions stored in the memory 160 to perform the process 200. The process 200 may be performed alone or in combination with other processes described in the present disclosure. It will be appreciated by one skilled in the art that the process 200 may be performed in any suitable environment and blocks in the process 200 may be performed in any suitable order.


At block 210, the computing system 140 obtains a dataset as input data. The dataset may be of various types, such as image data. For example, it may include volumetric medical images obtained from Computed Tomography (CT) scans, training data from data pools, augmented data from various sources, and more.


At block 220, the computing system 140 trains a NN model using the input dataset. Various techniques may be employed to train the NN model. As an illustrative example, without limiting the scope of the present disclosure, the computing system 140 may perform one or more of the following operations to train the NN model.


First, the computing system 140 initializes the NN model with random and/or default weights and biases. The weights and/or biases are parameters that may be adjusted during training. Second, the computing system 140 passes the input dataset through the model to obtain predictions. This step may be referred to as a forward pass, which involves applying the current model parameters (e.g., weights and/or biases) to the input dataset. Third, as depicted in block 230, the computing system 140 computes one or more losses. For example, the processing system applies one or more loss functions to quantify the difference between predicted and actual/target values, for example, by comparing the model's predictions to the actual/target values in the dataset. The computing system 140 backpropagates the computed loss(es) to update one or more parameters in the model. For example, the computing system 140 calculates the gradients of the loss with respect to the model parameters, thereby determining the contribution of each or some of the parameters to the error. With that, the computing system 140 adjusts one or more parameters in the model based on the computed gradients. In some instances, the computing system 140 applies suitable optimization algorithms (e.g., gradient descent) to adjust the one or more parameters in the model, aiming to minimize the loss and improve the model's performance. It will be appreciated by one skilled in the art that other techniques may be employ to utilize the computed losses for updating one or more parameters in the model. The computing system 140 performs multiple iterations or epochs to train the model. Each iteration may process a new or existing batch of data (e.g., an input dataset) and updates the model parameters, gradually improving its ability to make accurate predictions.


At block 240, the computing system 140 determines whether the trained model is converged. If not, the processing system may continue to another iteration (e.g., to perform any of blocks 210, 220, 230, and/or 240). For example, the computing system 140 may stop training when the model reaches satisfactory performance or after a predefined number of iterations or epochs.


In some examples, convergence may be determined by observing when the loss(es) of the model, measured by the loss function(s), stops decreasing significantly or starts increasing. For example, if the loss reaches a plateau or begins to rise, it suggests that the model may have converged or is overfitting. Convergence criteria may involve setting a threshold for the loss value or employing a patience parameter, which may indicate the number of iterations/epochs with no improvement before stopping the training.


At block 250, the computing system 140 outputs a model, based on determining the convergence of the model.



FIG. 3 is a block diagram of an exemplary GL-MAE scheme 300, in accordance with one or more embodiments of the present disclosure. GL-MAE 300 may be implemented in a computing system 140 as shown in FIG. 1B operating in the network environment 100. The computing system 140 may include one or more client devices 120, one or more servers 130, or any combination thereof in the network environment 100 as shown in FIG. 1A. One or more processer(s) 150 in the computing system 140 may execute instructions stored in the memory 160 to execute any of the blocks in GL-MAE 300. GL-MAE 300 may be executed alone or in combination with other processes described in the present disclosure. It will be appreciated by one skilled in the art that GL-MAE 300 may be executed in any suitable environment and blocks in GL-MAE 300 may be executed in any suitable order.


As shown in FIG. 3, GL-MAE 300 includes various functional modules, including a masked autoencoder 310 and a consistency learning module 350. The masked autoencoder (MAE) 310 is configured to generate representations for various views (e.g., global complete views 320, masked global views, and masked local views) and perform global reconstruction 330 and local reconstruction 340 to obtain reconstructed global views and local views, respectively, based on the input image data. This enables the GL-MAE model 300 to learn global context as well as local details. The MAE 310 may include one or more encoders and/or one or more decoders. For example, one encoder may be used to encode the masked global and local views, another encoder may be used to encode the global complete views 320. The one or more encoders in the MAE 310 may be substantially the same or related. For example, the encoder used for encoding the global complete views 320 may be derived from the encoder used for encoding the masked global and local views. Similarly, one or more decoders may be employed for reconstructing the global and local views from the masked global and local views, where the one or more decoders may be the substantially the same or related.


The consistency learning module 350 is configured to learn global-to-global consistency 360 and global-to-local consistency 370 based on global guidance. In an embodiment, the global-guided consistency learning is introduced between the representations of unmasked global views (e.g., the global complete views 320) and the reconstructed masked views, which encourages invariant and robust feature learning.


In this configuration, in addition to reconstructing masked local views, GL-MAE 300 reconstructs masked local views, as well as incorporates global context learning by reconstructing masked global views. Additionally, GL-MAE 300 integrates the complete global view 320 as an anchor to guide the reconstruction and stabilize the learning process through global-to-global consistency learning 360 and global-to-local consistency learning 370. Finetuning results on multiple datasets demonstrate the superiority of this method over other state-of-the-art self-supervised algorithms, highlighting its effectiveness on versatile volumetric medical image segmentation tasks, even when annotations are scarce.



FIG. 4 illustrates an exemplary flow diagram 400 implementing GL-MAE, in accordance with one or more embodiments of the present disclosure. The flow diagram 400 may be executed on a computing system 140 as shown in FIG. 1B operating in the network environment 100. The computing system 140 may include one or more client devices 120, one or more servers 130, or any combination thereof in the network environment 100 as shown in FIG. 1A. One or more processer(s) 150 in the computing system 140 may execute instructions stored in the memory 160 to perform any of the processes as demonstrated in the flow diagram 400. The flow diagram 400 may be executed alone or in combination with other processes described in the present disclosure. It will be appreciated by one skilled in the art that the flow diagram 400 may be executed in any suitable environment and processes in the flow diagram 400 may be performed in any suitable order.


As depicted in FIG. 4, the flow diagram 400 illustrates exemplary inputs to and outputs from GL-MAE, as well as exemplary processes undergone by GL-MAE.


In this example, the inputs to GL-MAE are various views obtained from volumetric medical images 402. The various views include global view 410, masked global view 412, and masked local view 414. In some embodiment, the input image data (e.g., the volumetric medical images 402) may be transformed, such as through cropping, scaling, rotating, etc, and/or augmented. For example, the global view 410 and masked global view 412 may be obtained based on cropped image 404, while the masked local view 414 may be obtained based on cropped images 406. In one embodiment, a sub-volume of the volumetric medical images 402 may be obtained through downsampling.


Masked views, such as the masked global view 412 and masked local view 414, are generated by applying random or predefined masks (e.g., 416) to individual images. In some embodiments, Each input image, such as a global view image or a local view image, may first be segmented into multiple patches of a predefined size, for example with a predefined number of pixels. Subsequently, one or more patches in the respective image may be subjected to the application of mask(s). Masks may take various forms depending on the applications. For example, A mask may be defined as a predetermined number of pixels with default or fixed values, thereby overriding the visual information from the original image.


GL-MAE encodes the global view 410, masked global view 412, and masked local view 414 to generate corresponding representations (e.g., 410a, 412a, and 414a, respectively). Some or all of the patches from the input views may be encoded by GL-MAE. In some embodiments, only the patches that have not been applied with masks are encoded. The representations 410a, 412a, and 414a may be projected to a shared space (e.g., a shared representation space), enabling GL-MAE to learn the consistency between the representations.


Based on the representations 410a, 412a, and 414a, GL-MAE evaluates global-guided consistency 420 (indicated by dashed arrows 422 and 424) and performs image reconstructions 430 (indicated by dashed arrows 432 and 434). For example, the consistency learning module 350 in the GL-MAE 300 as shown in FIG. 3 performs global-to-global consistency learning 360 based on the representations 410a and 412a corresponding to the global view 410 and masked global view 412, respectively. Additionally and/or alternatively, the consistency learning module 350 in the GL-MAE 300 as shown in FIG. 3 performs global-to-local consistency learning 370 based on the representations 410a and 414a corresponding to the global view 410 and masked local view 414, respectively. Additionally, the masked autoencoder 310 in the GL-MAE 300 as shown in FIG. 3 performs global reconstruction 330 and local reconstruction 340 based on the representations 412a and 414a corresponding to the masked global view 412 and masked local view 414, respectively.


The outputs from the GL-MAE include results from global-guided consistency learning 420 and reconstructed images (e.g., reconstructed global images 442 and/or reconstructed local images 444) from the image reconstructions 430. In some embodiments, the results from the global-guided consistency learning 420 may include one or more losses computed based on the global-to-global consistency learning 360 and/or the global-to-local consistency learning 370.



FIG. 5 illustrates an exemplary framework 500 of GL-MAE, in accordance with one or more embodiments of the present disclosure. The framework 500 may be implemented in a computing system 140 as shown in FIG. 1B operating in the network environment 100. The computing system 140 may include one or more client devices 120, one or more servers 130, or any combination thereof in the network environment 100 as shown in FIG. 1A. One or more processer(s) 150 in the computing system 140 may execute instructions stored in the memory 160 to perform any of the processes as demonstrated in the framework 500. The one or more processer(s) 150 in the computing system 140 may perform the processes outlined in the framework 500 to facilitate the flow diagram 400 as depicted in FIG. 4. The framework 500 may be implemented alone or in combination with other processes described in the present disclosure. It will be appreciated by one skilled in the art that the framework 500 may be implemented in any suitable environment and processes in the framework 500 may be performed in any suitable order.


As an overview of the framework 500, the initial step involves generating various views based on the input image data (e.g., volumetric medical images 502). One or more image transformations are performed on the images 502, such as cropping and downsampling. For example, cropped images 504 are used for generating global complete views (or global views in dashed box 510) and global masked views (in dashed box 520), and cropped images 506 are used for generating local masked views (in dashed box 530).


The next step is to process the various views through separate paths by using GL-MAE. The various views are processed through different paths, which may be performed in any suitable order, for example, in parallel, in series, or a combination thereof.


The global views 510 are processed by an encoder 512 to generate corresponding representations 510a (e.g., in dashed box 514). The global masked views 520 are processed by an encoder 522 to generate corresponding representations 520a (e.g., in dashed box 524). Additionally, a decoder 526 is utilized to generate reconstructed global views 520b (e.g., in dashed box 528). The local masked views 530 are processed by an encoder 532 to generate corresponding representations 530a (e.g., in dashed box 534). Additionally, a decoder 536 is utilized to generate reconstructed global views 530b (e.g., in dashed box 538). The encoders 522 and 532 may be substantially the same, and the decoders 526 and 536 may be substantially the same.


GL-MAE projects the representations 510a, 520a, and 530a to a shared space to evaluate global-guided consistency (in dashed box 550) between the representations 510a, 520a, and 530a.


The aforementioned processes will be further elaborated upon with exemplary equations and implementations hereinafter. It will be appreciated by one skilled in the art that these equations and implementations are merely examples and do not limit the scope of the present disclosure.


MAE with Global and Local Reconstruction


In this example, the volumetric medical images 502 include a volume (x) of images obtained from an unlabeled CT dataset (D). The unlabeled CT dataset may include raw images from CT scanning, augmented images, and/or other types of image data. The dataset (D) is represented by D={x1, . . . , XN}, where N is the total number of volumes. The N number of volumes may be used as batches of datasets for training a NN model implementing GL-MAE. As shown in FIG. 5, a volume x∈RCHWD is randomly sampled from D, where “CHWD” stands for the dimensions Channel, Height, Width, and Depth. The volume x is augmented into a small-scale sub-volume vl∈RCHWD using image transformation t1 and a large-scale sub-volume vg∈RCHWD using image transformation t2. The sub-volume vi comprises local view images that provide detailed information in high resolution about texture and boundaries, but lack a global view of the input data, such as the status of other organs. The sub-volume vl corresponds to a local view's set Vl={vli, i∈[1, q]}. On the other hand, the sub-volume vg comprises global view images that provide a global view's set Vg={vgi, i∈[1, p]}. p and q are integers, which represent the number of views in the respective view's set.


In an example, for global views (e.g., in 510 and 520), images were scaled by a random ratio from the range of [0.5, 1], cropped, and then resized into [160, 160, 160] pixels in height, width, and depth dimensions. For local views (e.g., in 530), images were scaled by a random ratio from the range of [0.25, 0.5], cropped and resized into [96, 96, 96]. Finally, all images were normalized from [−1000, 1000] to [0,1]. p and q were set to 2 and 8.


To obtain the masked volumes (e.g., in 520 and 530) for local and global reconstruction, all local views 530 in the set Vl and global views in the set Vg 510 are individually tokenized into patches and then applied volume masking transform tm with a predefined ratio. The masked patches serve as invisible patches (562 in the legend 560), while the rest of the patches serve as visible patches (564 in the legend 560) for a learnable encoder s(⋅). By applying tm for each vgi and vli respectively, the visible patches form a set of masked local and global sub-volumes, represented by {tilde over (V)}l={{tilde over (v)}li}, i∈[1, q] and {tilde over (V)}g={{tilde over (v)}gi}, i∈[1, p], respectively.


The encoder s(⋅) (e.g., the encoder 522/532) is used to map the input volumes (in 520/530) to a representation space. Position embedding is added to each visible patch and then combined with a class token for generating the volume representation (e.g., 520a or 530a). These visible patches for local and global views vli and vgi are fed into s(⋅) to generate {tilde over (Z)}l and {tilde over (Z)}g by:












Z
~

l

=

s



(


v
~

l

)



,




(


Eq
.

1


a

)















Z
~

g

=

s



(


v
~

g

)



,




(


Eq
.

1


b

)







where Z consists of two-part embeddings, including the output of the class token and patch tokens, denoted as:












Z
~

l


=
Δ


[



Z
~

l
cls

;


Z
~

l
p


]


,




(


Eq
.

1


c

)















Z
~

g


=
Δ


[



Z
~

g
cls

;


Z
~

g
p


]


,




(


Eq
.

1


d

)







where {tilde over (Z)}gcls and {tilde over (Z)}lcis are outputs of the class tokens, while {tilde over (Z)}gp and {tilde over (Z)}lp are outputs of the visible macro-level view of a large area within the volume. In this example, position embedding are interpolated before being added to the visible tokens. {tilde over (Z)}gp and {tilde over (Z)}lp are in low resolution, which is beneficial for downstream tasks involving dense prediction. This process is repeated q times for local sub-volumes and p times for global sub-volumes to patches, which then are used for reconstruction.


A momentum encoder m(⋅) (e.g., the encoder 512) generates a mean representation Zc (510a in dashed box 514) of the unmasked global view vg as:











Z
c

=

m

(

v
g

)


,


v
g






V
g

.






(


Eq
.

1


e

)







The momentum encoder's 512 parameters are updated using a momentum factor that is dynamically computed based on the learnable encoder's 522/532 parameters as follows:












m

(
t
)


(
·
)




μ



s

(
t
)


(
·
)


+


(

1
-
μ

)




m

(

t
-
1

)


(
·
)




,




(


Eq
.

1


f

)







where m(t)(⋅) and s(t)(⋅) represents the momentum encoder and encoder at the tth iteration, respectively, and μ is a momentum coefficient updated with a cosine scheduler.


A decoder custom-character(⋅) (e.g., the decoder 526/536) is used to reconstruct the invisible patches from the representation of the visible patches. The input of the decoder custom-character(⋅) includes encoded visible patches and masked token. As shown in FIG. 5, the masked token is learnable and indicates the missing patches to predict. Position embedding to all tokens is added to cover the location information. The output (yl, yg) of the decoder D(⋅) is derived by:











y
l

=

𝒟

(


Z
~

l
p

)


,




(


Eq
.

2


a

)













y
g

=


(


Z
~

g
p

)

.





(


Eq
.

2


b

)







Decoder output (yl, yg) is reshaped to form reconstructed volumes (e.g., 520b in dashed box 524 and 530b in dashed box 534). Reconstruction loss is computed using one or more loss functions. Any suitable loss functions may be used for computing the loss. For example, Mean Square Error may be used as reconstruction loss function and applied to the local and global masked sub-volumes.


Local Reconstruction

For local masked sub-volumes {tilde over (V)}l, the local reconstruction loss custom-character is defined as:












R
l

=


1




"\[LeftBracketingBar]"



V
~

l



"\[RightBracketingBar]"


×
HWD









h
=
1

H








w
=
1

W








d
=
1

D



(








v
l





V
~

l







(


y
l

h
,
w
,
d


-

v
l

h
,
w
,
d



)

2

P


)



,




(

Eq
.

3

)







where h, w, d denote voxel indices in the representations, P represents the number of patches for local views, and H, W, D refer to the height, width, and depth of each sub-volume, respectively. The reconstruction loss is computed as the sum of squared differences between the reconstruction target (vlh,w,d) and the reconstructed representations (ylh,w,d) by pixel values.


Global Reconstruction

For global masked sub-volumes {tilde over (V)}g, the global reconstruction loss custom-character is defined as:












R
g

=


1




"\[LeftBracketingBar]"



V
~

g



"\[RightBracketingBar]"


×
HWD









h
=
1

H








w
=
1

W








d
=
1

D



(








v
g





V
~

g







(


y
g

h
,
w
,
d


-

v
g

h
,
w
,
d



)

2

P


)



,




(

Eq
.

4

)







where P represents the number of patches for global views. Since the global view has a larger input size than local views, position embedding may be interpolated before being added to the visible tokens. This process enables the reconstruction of the masked volumes at both the local and global views, facilitating the learning of rich information from both the local details and global information.


Global-Guided Consistency

Global-guided consistency learning (e.g., block 350 in FIG. 3) includes two components: global-to-global consistency (e.g., block 360 in FIG. 3) and global-to-local consistency (e.g., block 370 in FIG. 3). The global-to-global consistency enforces consistency between the representations of the unmasked global view vg (in 510) and the masked global views {tilde over (v)}g (in 520), promoting the learning of features that are robust to the distortion caused by masking and accelerating training convergence. Since the global view contains richer contexts and covers most of the local views, it may be used as an “anchor” to guide the representation learning of the local views. The global-to-local consistency is used for capturing information about the relationships between different parts of an image and its main semantics. It enforces consistency between representations of learning features from the masked local views (in 530) and the mean representation of the unmasked global view (in 510). As discussed earlier, the momentum encoder's 512 parameters are updated using a momentum factor that is dynamically computed based on the learnable encoder's 522/532 parameters.


To perform consistency learning, the representations of Zlcls, the masked global views {tilde over (Z)}gcls, and the masked local views {tilde over (Z)}lcls are first projected into a shared space. In some embodiments, the projection is performed by projection layers Ps(⋅) (denoted as 544/546 in FIG. 5) and Pm(⋅) (denoted as 542 in FIG. 5), which are connected to the output of the encoders s(⋅) 522/532 and m(⋅) 512, respectively. In further embodiments, the projections are implemented using multiple fully-connected layers followed by an activation function. For example, a Gaussian Error Linear Unit activation function may be used. After projection, the embeddings of the complete global views, the masked global views, and the masked local view are obtained as:











E
c

=


𝒫
m

(

Z
c
cls

)


,




(


Eq
.

5


a

)















E
~

g

=


𝒫
s

(


Z
~

g
cls

)


,




(


Eq
.

5


b

)















E
~

l

=


𝒫
s

(


Z
~

l
cls

)


,




(


Eq
.

5


c

)







where Ec represents the embedding of the complete global views, {tilde over (E)}g represents the embedding of the masked global views, and Et represents the embedding of the masked local view. The dimension of Ec, Eg and {tilde over (E)}l is represented by K, which is a predefined value, e.g., 512. For each type of embedding E in the shared space, it is normalized before computing the loss function. The embedding Ei is normalized as follows:











Γ

(

E
i

)

=


exp


(


E
i

/
t

)









k
=
1

K



exp

(


E
k

/
t

)




,




(

Eq
.

6

)







where t represents the temperature used to control the entropy of the distribution. The training of GL-MAE may aim to minimize the distributions between the embedding representations of the global complete view Ec and the masked global view {tilde over (E)}g, and the distributions between the embedding representations of the global complete view Ec and the masked local view {tilde over (E)}l by minimizing:











H

(


Γ

(

E
c

)

,

Γ

(


E
~

g

)


)

+

H

(


Γ

(

E
c

)

,

Γ

(


E
~

l

)


)


,




(

Eq
.

7

)







where H(x, y)=−x log y is cross-entropy loss.


Global-to-Global Consistency

For global unmasked sub-volumes Vg and global masked sub-volumes {tilde over (V)}g, a global-to-global consistency loss function is formulated as:












𝒞
gg

=


1




"\[LeftBracketingBar]"


V
g



"\[RightBracketingBar]"


·



"\[LeftBracketingBar]"



V
~

g



"\[RightBracketingBar]"






{








v
g




V
g












v
~

g





V
~

g





H

(


Γ

(

E
c

)

,

Γ

(


E
~

g

)


)


}



,




(

Eq
.

8

)







where the operator |⋅| computes the number of volumes in the respective set. {tilde over (E)}g learns consistency guided by the global context embedding Ec.


Global-to-Local Consistency

For global unmasked sub-volumes Vg and local masked volumes Vl, a global-to-local consistency loss function is formulated as:











𝒞
gl

=


1




"\[LeftBracketingBar]"


V
g



"\[RightBracketingBar]"


·



"\[LeftBracketingBar]"



V
~

l



"\[RightBracketingBar]"







{








v
g




V
g












v
~

l





V
~

l





H

(


Γ

(

E
c

)

,

Γ

(


E
~

l

)


)


}

.






(

Eq
.

9

)







Local embedding {tilde over (E)}l learn consistency guided by global context embedding Ec during the pretraining process.


Overall Loss Function

The overall loss function is represented by:











=




l

+


β
1





g


+


β
2




𝒞
gg


+


β
3




𝒞
gl




,




(

Eq
.

10

)







where β1, β2, and β3 are hyper-parameters, which are used to balance the relative contributions of these four loss terms. The values of β1, β2, and/or β3 may be tuned according to various usage scenarios. In some embodiments, β1, β2, and β3 are set to 1.0 in experiments empirically.



FIG. 6 is a block diagram of an exemplary process 600 implementing GL-MAE, in accordance with one or more embodiments of the present disclosure. The process 600 may be performed by a computing system 140 as shown in FIG. 1B operating in the network environment 100. GL-MAE is implemented on the computing system 140. The computing system 140 may include one or more client devices 120, one or more servers 130, or any combination thereof in the network environment 100 as shown in FIG. 1A. One or more processer(s) 150 in the computing system 140 may execute instructions stored in the memory 160 to perform any of the blocks as demonstrated in the process 600. The one or more processer(s) 150 in the computing system 140 may perform the process 600 to facilitate the flow diagram 400 as depicted in FIG. 4 and/or the framework 500 as demonstrated in FIG. 5. The process 600 may be performed alone or in combination with other processes described in the present disclosure. It will be appreciated by one skilled in the art that the process 600 may be performed in any suitable environment and blocks in the process 600 may be performed in any suitable order.


At block 610, the computing system receives images of various views based on input image data. The various views include a global complete view, and one or more masked views. In an embodiment, the one or more masked views includes a global masked view and a local masked view.


At block 620, the computing system generates representations corresponding to the global complete view and the one or more masked views.


At block 630, the computing system generates one or more reconstructed images corresponding to the one or more masked views.


At block 640, the computing system evaluates, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views. For example, the computing system projects the representations of the global complete view and the one or more masked views to the shared representation space for evaluation. In some embodiments, the computing system evaluates first consistency between the representations of the global complete view and the global masked view, and second consistency between the representations of the global complete view and the local masked view.


At block 650, the computing system computes one or more losses based on the one or more reconstructed images and the results from the consistency evaluation.


When blocks 610-650 are performed during training of a NN model implementing GL-MAE, the computing system updates one or more parameters in the NN model based on the one or more losses (e.g., as indicated in block 660).


During inference, the computing system generates outputs based on the one or more reconstructed images and/or the evaluation results, (e.g., as indicated in dashed block 670).


The trained NN model with GL-MAE may be applied to various tasks, including image segmentation and disease classification. For image segmentation, the NN model with GL-MAE effectively segments CT scans or Magnetic Resonance Imaging (MRI) scans by identifying areas of interest such as diseased organs and tumor regions. In disease classification, the NN model with GL-MAE provides accurate diagnostic results based on CT scans or MRI scans, including the diagnosis of Covid-19.


In some embodiments, the computing system is communicatively connected to one or more image acquisition devices. The image acquisition device(s) is configured to obtain volumetric image data and transmit the data to the computing system for data processing, for example, through the communication interface 170. The image acquisition devices may include various cameras, a CT scanner, or an MRI scanner.


The following provides exemplary pseudocodes for implementing one or more processes discussed above. For example, Algorithm 1 may be used for generating various views as input to GL-MAE.












Algorithm 1 Pseudocode of data augmentation


in PyTorch-like style.















# Input: volume, global_scale_factor, local_scale_factor,


local_crop_numbers, global_size,


local_size


$ Output: crops


# volume: loading a sample from the 3D CT datasat used for pretraining,


which has been processed by spacing, scaleIntensity etc.


# global_scale_factor, local_scale_factor: factors to scale the image.


# local_crop_numbers: numbers of cropped volume, e.g., 8.


# global_size, local_size: size for global and local sub Volume.


# Define transforms for generating local and global sub volumes


local_transform = Compose([


  RandScaleCropd(local_scale_factor), Resized(local_size)])


global_transform = Compose([


  RandScaleCropd(global_scale_factor), Resized(global_size)]


crops = [ ] # Process of generating the expected sub volumes


crops.append([global_transforms(volume), global_transforms(volume)])


for iter in range(local_crop_numbers):


 crops.append(local_transform(volume))









Algorithm 2 may be used for implementing GL-MAE in pretraining.












Algorithm 2 Pseudocode of proposed SSL


pretraining in PyTorch-like style.















# Input: loader/encoder/decoder/m_encoder/proj_m/proj_s and


hyper-parameters etc.


$ Output: encoder


# loader: 3D CT dataloader for pretraining


# encoder/decoder/m_encoder/proj_m/proj_s: model used for


pre-training/reconstruct/project


# m: momentum factor. mask_ratio


# Initialization, encoder with mask ratio, momentum w/o mask ratio


m_encoder.params = encoder.params


for epoch in range(n_epochs):


 for xs in loader:


  g_xs, l_xs = aug(xs) # augmented with the defined data


  augmentation transform


  # Obtain masked global and local feature from encoder with


  mask_ratio


  g_features, l_features = encoder.forward([g_xs, l_xs], mask_ratio)


  # Reconstruct global and local volume from decoder


  g_volume, l_volume = decoder.forward([g_features, l_features])


  s_g_feature, s_l_feature = proj_s.forward([g_features,


  l_features])


  m_g_features = m_encoder.forward(g_xs, mask_ratio=0) #Obtain


global volume feature


  m_g_features = proj_m.forward(m_g_features)


  # Optimized with the following losses


  custom-character  with [g_xs, l_xs, g_volume, l_volume] computed by Equations


  3 and 4


  custom-character  with [m_g_features, s_g_feature, s_l_feature] computed by


  Equations 8 and 9


  custom-character  = custom-character  + α custom-character


  custom-character  .backward( )


  update(encoder, decoder, proj_m, proj_s)


  m_encoder = (l−m) * m_encoder.params + m * encoder.params









Illustrative experimental results are described in detail below to demonstrate efficacy and advantages of the present disclosure. Additional details and advantages relating to exemplary embodiments of the present disclosure are discussed by Zhuang et al. in “ADVANCING VOLUMETRIC MEDICAL IMAGE SEGMENTATION VIA GLOBAL-LOCAL MASKED AUTOENCODER,” (available at arXiv: 2306.08913), which is hereby incorporated by reference in its entirety.


Datasets and Evaluation Metrics

The SSL pretraining experiments were carried out on the Beyond the Cranial Vault (BTCV) abdomen challenge dataset (Landman et al. 2015) and TCIA Covid19 dataset (An et al. 2020). For downstream tasks, experiments were mainly conducted on the BTCV dataset to follow previous work (Chen et al. 2023; Tang et al. 2022). To assess the model's generalization on Computed Tomography (CT) datasets, its effectiveness was also evaluated on MM-WHS (Zhuang 2018), Medical Segmentation Decathlon (MSD) Task 09 Spleen, and The COVID-19-20 Lung CT Lesion Segmentation Challenge dataset (Covid-19-20 dataset) (Roth et al. 2022). The model was further transferred to Brain Tumor Segmentation (BrasTS) (Simpson et al. 2019) for assessing its cross-modality generalization ability. All datasets used were collected from open source and can be obtained via the cited papers. Dice Score (%) was used as the evaluation metric following (Chen et al. 2023; Tang et al. 2022).


Pretraining Setting

ViT is used as a transformer-based backbone. Both ViT-Tiny (ViT-T) and ViT-Base (ViT-B) were used for the experiments. The pretraining phase was conducted for 1600 epochs for ViT-T and ViT-B without specification, with an initial learning rate of 1e-2, employing AdamW (Kingma and Ba 2014) as an optimizer and a batch size of 256 on four 3090Ti for 3 days. For global views (e.g., vg in 510 and 520 in FIG. 5), images were scaled by a random ratio from the range of [0.5, 1], cropped, and then resized into [160, 160, 160], while for local views (e.g., vl in 530 in FIG. 5), images were scaled by a random ratio from the range of [0.25, 0.5], cropped and resized into [96, 96, 96]. Finally, all images were normalized from [−1000, 1000] to [0, 1]. p and q were set to 2 and 8, respectively.


Finetuning Setting

UNETR (Hatamizadeh et al. 2022) is adopted as the segmentation framework. Finetuning is performed on BTCV dataset. For linear evaluation that freezes the encoder parameters and finetuning the segmentation decoder head, model was finetuned for 3000 epochs using an initial learning rate of 1e-2, and trained on a single 3090Ti GPU with a batch size of 4. For end-to-end segmentation, the model was trained on four 3090Ti GPUs for 3000 epochs, with a batch size of 4, using an initial learning rate of 3e-4.


Experiment Results on Downstream Tasks

End-to-end and linear evaluation. To assess the effectiveness of this method, following (Tang et al. 2022; Chen et al. 2023; He et al. 2023), end-to-end segmentation experiments were conducted on three datasets. FIG. 7 presents Table 1 (700) that shows a comparison of performances between various methods. SL represents Supervised Learning.


As shown in Table 1 (700), GL-MAE (10th row) outperformed the Supervised baseline (3rd row) by a large margin (82.33% vs 79.61%, 95.72% vs 94.20%, and 88.88% vs 83.85%) with full training dataset, indicating that GL-MAE benefits the model from the unlabeled dataset. MAE3D is a recently proposed competitive SSL strategy in medical image analysis, where it has shown superiority (9th row), particularly on dense prediction tasks such as segmentation. GL-MAE (10th row) outperformed the MAE3D (9th row), which further confirms the its effectiveness. SegresNet and 3D U-Net are supervised models with competitive performance. Swin-UNETR uses 3D Swin-Transformer as the backbone, while GLSV was designed for Cardiac CT images. All SSL methods (4-10th rows) achieved better performance than the supervised Baseline, while GL-MAE (10th row) achieved the best performance over three datasets even when using only 25% and 50% annotations of the training datasets. This indicates the superior generalization ability of GL-MAE.


GL-MAE showed consistent performance when using a more lightweight transformer, ViT-T, which requires less computational resources and can be trained and inferred faster. FIG. 8 presents Table 2 (800) that shows a comparison of performances between various methods.


As shown in Table 2 (800), GL-MAE outperformed other methods by a large margin in terms of average Dice score, Normalized surface dice, and Hausdorff distance metric in both linear and end-to-end segmentation evaluation settings. This demonstrates the versatility of GL-MAE when adapting to a lightweight backbone, which is necessary in certain situations such as surgical robots.


Generalization on the unseen datasets. MM-WHS is a pre-dominant small-scale organ dataset that has not been involved in the pretraining. The experimental findings demonstrated that GL-MAE significantly improved the average dice score compared with the MAE3D, indicating its strong generalization capabilities. Furthermore, there were substantial enhancements in the performance of the aorta, LV, and RV, which share analogous structural features with the training data. This suggests that GL-MAE can exploit the structural consistency between organs across varying datasets and generalize effectively to the novel unseen datasets.


COVID-19 lesion segmentation. CT scans are commonly used in diagnosing COVID-19, yet there is a shortage of annotated data. GL-MAE has shown to improvement in COVID-19 lesion segmentation performance, indicating that GL-MAE can capture valuable knowledge from unlabeled CT datasets to improve disease diagnosis, demonstrating the versatility of the proposed method in practical clinical settings.


GL-MAE may be applied to other downstream tasks for other applications, such as processing Magnetic Resonance Imaging (MRI) datasets.


Analysis of the Framework

Ablation study. The framework 500 as demonstrated in FIG. 5 was studied. To better understand each loss term's (e.g., in Equation 10) impact, an ablation study of the global and local terms for both reconstructions and global-guided consistency learning was conducted. FIG. 9 presents Table 3 (900) that showcases the ablation studies conducted on the BTCV validation datasets under both linear and end-to-end segmentation settings, as well as on the MM-WHS under end-to-end segmentation.


ViT-B and ViT-T were both considered as the backbone of the framework 500. The first row represents the supervised baseline without any related strategies. Instead of reconstructing the local patches at a time in each iteration, in the 2nd row, the GL-MAE method firstly reconstructs the local patches q times each iteration, thereby learning rich representation and exhibiting better performance most of the time. In the 3rd row, the reconstruction for global patches was used, further improving performance since the model can learn the global context as well as the local details. In the 4th row, the global-to-global consistency was introduced to learn a more robust representation of distortion caused by masking and learn the critical information, leading to further performance improvement. The last row represents the global-local consistency, which aims to capture the relationship between different parts of the images and their main semantics. The final objective loss function achieved the best performance across several datasets with various settings, demonstrating the importance of global information for volumetric data.


Label-efficient finetuning. An evaluation of GL-MAE was conducted under a semi-supervised learning scheme with ViT-T as the backbone. The experimental results suggested that GL-MAE can improve the dice score even when the amount of annotated training data is limited. Transformer-based models are prone to over-fitting the limited labeled data due to their dense connections, while GL-MAE can reduce the necessity for labeled data and effectively enhance performance even in low-annotated learning scenarios.


Convergence comparison. GL-MAE exhibits faster convergence and superior performance compared to MAE3D. This suggests that pretraining with global complete views and masked views can help stabilize the training process, resulting in faster convergence and more powerful representation. By utilizing global complete views as an “anchor” for local and global masked views, the model establishes a stronger relationship through global-guided consistency. The integration of global context information and different scale reconstructions enhances the overall performance of GL-MAE and contributes to its superior results.


Analysis of mask ratio. The impact of the mask ratio for the reconstruction of GL-MAE was studied. The findings showed that the mask ratio may be adjusted to balance between preserving important information and providing enough diversity for the model to learn robust representations.


Scaling to larger data. Experimental results demonstrated the ability of GL-MAE to scale to larger amounts of data and improve performance on downstream tasks.


Visualization. Studies showed that GL-MAE can improve the completeness of segmentation results (e.g., patches in the reconstructed global and local images). For example, some experiments showed that the results of segmentation using GL-MAE were better than those using MAE3D and Swin-UNETR, particularly in terms of completeness for larger organs.


It is noted that the techniques described herein may be embodied in executable instructions stored in a non-transitory computer readable medium for use by or in connection with a processor-based instruction execution machine, system, apparatus, or device. It will be appreciated by those skilled in the art that, for some embodiments, various types of computer-readable media can be included for storing data. As used herein, a “computer-readable medium” includes one or more of any suitable media for storing the executable instructions of a computer program such that the instruction execution machine, system, apparatus, or device may read (or fetch) the instructions from the computer-readable medium and execute the instructions for carrying out the described embodiments. Suitable storage formats include one or more of an electronic, magnetic, optical, and electromagnetic format. A non-exhaustive list of conventional exemplary computer-readable medium includes: a portable computer diskette; a random-access memory (RAM); a read-only memory (ROM); an crasable programmable read only memory (EPROM); a flash memory device; and optical storage devices, including a portable compact disc (CD), a portable digital video disc (DVD), and the like.


It should be understood that the arrangement of components illustrated in the attached Figures are for illustrative purposes and that other arrangements are possible. For example, one or more of the elements described herein may be realized, in whole or in part, as an electronic hardware component. The elements may be implemented in software, hardware, or a combination of software and hardware. Moreover, some or all of these other elements may be combined, some may be omitted altogether, and additional components may be added while still achieving the functionality described herein. Thus, the subject matter described herein may be embodied in many different variations, and all such variations are contemplated to be within the scope of the claims.


To facilitate an understanding of the subject matter described herein, many aspects are described in terms of sequences of actions. It will be recognized by those skilled in the art that the various actions may be performed by specialized circuits or circuitry, by program instructions being executed by one or more processors, or by a combination of both. The description herein of any sequence of actions is not intended to imply that the specific order described for performing that sequence must be followed. All methods described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.


The use of the terms “a” and “an” and “the” and similar references in the context of describing the subject matter (particularly in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation, as the scope of protection sought is defined by the claims as set forth hereinafter together with any equivalents thereof. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illustrate the subject matter and does not pose a limitation on the scope of the subject matter unless otherwise claimed. The use of the term “based on” and other like phrases indicating a condition for bringing about a result, both in the claims and in the written description, is not intended to foreclose any other conditions that bring about that result. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention as claimed.

Claims
  • 1. A system for training a neural network model, comprising: an image acquisition device configured to obtain volumetric image data; anda computing system in communication with the image acquisition device, wherein the computing system is configured to: obtain the volumetric image data from the image acquisition device;generate, based on the volumetric image data, images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views;generate, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, representations corresponding to the global complete view and the one or more masked views;generate, by the GL-MAE encoder system, one or more reconstructed images corresponding to the one or more masked views;evaluate, by the GL-MAE encoder system, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views;compute, by the GL-MAE encoder system, one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; andupdate, by the GL-MAE encoder system, one or more parameters in the neural network model based on the one or more losses.
  • 2. The system of claim 1, wherein the one or more masked views comprises a global masked view and a local masked view, and wherein the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images, wherein the one or more processors are further configured to: evaluate first consistency between the representations of the global complete view and the global masked view; andevaluate second consistency between the representations of the global complete view and the local masked view.
  • 3. The system of claim 1, wherein the one or more processors are further configured to: project the representations of the global complete view and the one or more masked views to the shared representation space.
  • 4. The system of claim 1, wherein the one or more losses comprises one or more reconstruction losses based on the one or more reconstructed images and one or more consistency losses based on the evaluation results.
  • 5. The system of claim 1, wherein the one or more processors are further configured to: encode, using a first encoder, global complete view images among the received images to generate first representations corresponding to the global complete view images; andencode, using a second encoder, masked view images among the received images to generate second representations corresponding to the masked view images,wherein the first encoder is obtained based on the second encoder.
  • 6. The system of claim 5, wherein parameters in the first encoder are updated using a momentum factor that is dynamically computed based on learnable parameters in the second encoder.
  • 7. A method for training a neural network model, comprising: receiving, by a Global-Local Masked AutoEncoder (GL-MAE) encoder system, images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views;generating, by the GL-MAE encoder system, representations corresponding to the global complete view and the one or more masked views;generating, by the GL-MAE encoder system, one or more reconstructed images corresponding to the one or more masked views;evaluating, by the GL-MAE encoder system, in a shared representation space, consistency between the representations of the global complete view and the one or more masked views;computing, by the GL-MAE encoder system, one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; andupdating, by the GL-MAE encoder system, one or more parameters in the neural network model based on the one or more losses.
  • 8. The method of claim 7, wherein the one or more masked views comprises a global masked view and a local masked view, and wherein the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images.
  • 9. The method of claim 8, wherein evaluating, in the shared representation space, consistency between the representations of the global complete view and the one or more masked views further comprises: evaluating first consistency between the representations of the global complete view and the global masked view; andevaluating second consistency between the representations of the global complete view and the local masked view.
  • 10. The method of claim 7, further comprising: projecting the representations of the global complete view and the one or more masked views to the shared representation space.
  • 11. The method of claim 7, wherein the one or more losses comprises one or more reconstruction losses based on the one or more reconstructed images and one or more consistency losses based on the evaluation results.
  • 12. The method of claim 11, wherein weight of each loss of the one or more losses is tunable.
  • 13. The method of claim 7, wherein generating the representations corresponding to the global complete view and the one or more masked views further comprises: encoding, using a first encoder, global complete view images among the received images to generate first representations corresponding to the global complete view images; andencoding, using a second encoder, masked view images among the received images to generate second representations corresponding to the masked view images,wherein the first encoder is obtained based on the second encoder.
  • 14. The method of claim 13, wherein parameters in the first encoder are updated using a momentum factor that is dynamically computed based on learnable parameters in the second encoder.
  • 15. The method of claim 7, further comprising: receiving a plurality of volumetric medical images;obtaining the images of the plurality of views by applying at least one of cropping, scaling, and downsampling; andobtaining images of the one or more masked views by applying masks with a predefined ratio.
  • 16. The method of claim 7, further comprising: performing, using the neural network model, segmentation on an input image to identify one or more regions of interest.
  • 17. A non-transitory computer-readable medium, having computer-executable instructions stored thereon, for training a neural network model, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to carry out: receiving images of a plurality of views, the plurality of views comprising a global complete view and one or more masked views;generating representations corresponding to the global complete view and the one or more masked views;generating one or more reconstructed images corresponding to the one or more masked views;evaluating in a shared representation space, consistency between the representations of the global complete view and the one or more masked views;computing one or more losses based on the one or more reconstructed images and the results from the consistency evaluation; andupdating one or more parameters in the neural network model based on the one or more losses.
  • 18. The non-transitory computer-readable medium of claim 17, wherein the one or more masked views comprises a global masked view and a local masked view, and wherein the one or more reconstructed images comprises reconstructed global masked view images and reconstructed local masked view images.
  • 19. The non-transitory computer-readable medium of claim 18, wherein evaluating, in the shared representation space, consistency between the representations of the global complete view and the one or more masked views further comprises: evaluating first consistency between the representations of the global complete view and the global masked view; andevaluating second consistency between the representations of the global complete view and the local masked view.
  • 20. The non-transitory computer-readable medium of claim 17, wherein the one or more processors further carry out: projecting the representations of the global complete view and the one or more masked views to the shared representation space.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 63/512,905, filed Jul. 10, 2023, the entirety of which is incorporated herein by reference.

Provisional Applications (1)
Number Date Country
63512905 Jul 2023 US