The present disclosure generally relates to artificial intelligence (AI) applications in medical imaging; and more particularly to a self-supervised learning approach for patch embedding of anatomical consistency applicable to, e.g., medical image analysis.
Labeling medical images is tedious, laborious, and timeconsuming and demands specialty-oriented expertise. Most Al-driven image analysis methods have been developed for photographic images, and directly adopting these to medical images may not achieve optimal results because medical images are markedly different from photographic images. Photographic images, like those in ImageNet, are object-centric, where dominant objects (e.g., dogs and cats) are located at the center with backgrounds of large variation. Naturally, these Al methods developed for photographic images mostly learn from foreground objects. By contrast, medical images acquired with the same imaging protocol have similar anatomical structures, and imaging diagnosis requires not only analyzing “foreground” objects: diseases (abnormalities) but also understanding “background” anatomical structures; furthermore, diseases are often small and obscured in “background” anatomical structures.
It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims.
The present disclosure relates to a self-supervised learning (SSL) model implemented by at least one processor and trained for consistent embedding of anatomical consistency (for, e.g., chest radiography).
More specifically, SSL approaches have recently shown substantial success in learning visual representations from unannotated images. Compared with photographic images, medical images acquired with the same imaging protocol exhibit high consistency in anatomy. To exploit this anatomical consistency, the novel SSL model approach is described herein, dubbed PEAC (patch embedding of anatomical consistency), for medical image analysis. In example implementations of the model, it is proposed to learn global and local consistencies via stable grid-based matching, transfer pre-trained PEAC models to diverse downstream tasks, and extensively demonstrate that (1) PEAC achieves significantly better performance than the existing state-of-the-art fully-supervised and self-supervised methods, and (2) PEAC effectively captures the anatomical structure consistency between patients of different genders and weights and between different views of the same patient, which enhances the interpretability of the proposed methods for medical image analysis.
In some embodiments the present disclosure describes a system for medical image analysis comprising:
In some embodiments, the model associated with a self-supervised learning (SSL) framework that defines a Student-Teacher model may extract features of two crops simultaneously.
In some embodiments, the SSL framework includes an image augmentation and restoration module that aims to restore image crops from the two augmentation ways shuffle patches and add noise.
In some embodiments, the SSL framework includes a global module that aims to enforce the model to learn coarse-grained global features of two crops.
In some embodiments, the SSL framework includes a local module that aims to enforce the model to learn fine-grained local features from overlapped patches.
In some embodiments, under the SSL framework the model learns coarse-grained, fine-grained and contextualized high-level anatomical structure features.
In some embodiments, prior to input to the model, the plurality of medical images is pre-processed in grid-wise cropping to get two crops x, x′∈RC×H×W, C is the number of channels, (H, W) are the crops' spatial dimensions.
In some embodiments, the two crops are input to Student and Teacher encoders fθ
In some embodiments, average pooling operators ⊕: RD×H×W→RD are performed on the local features and the pooled representations are denoted as ys⊕ and yt⊕∈RD.
In some embodiments, the processor applying the model in view of the plurality of medical images matches the anatomical structures across different patients.
In some embodiments, the processor applying the model in view of the plurality of medical images matches the anatomical structures across different views of the same patient.
In some embodiments, the model calculates the consistency loss based on the absolute positions of overlapping image patches of the plurality of medical images.
In some embodiments, the model as trained:
In some embodiments, the model integrates the first crop with the second crop to learn consistent contextualized embedding for coarse-grained global anatomical structures.
In some embodiments, analogous regions of the plurality of medical images are captured by the first and second crops so that global embedding consistency encourages extraction of features of similar local regions.
In some embodiments, the model learns fine-grained and precise anatomical structures from local patch embeddings of overlapped parts.
In some embodiments, the model defines a network that that considers both global and local features of medical images at the same time.
In some embodiments, the model localizes arbitrary anatomical structures across views of the same patient and across patients of different genders and weights and of health and disease.
In some embodiments, the present disclosure describes a method comprising:
In some embodiments, the present disclosure describes a nontransitory, computer-readable medium storing instructions encoded thereon, the instructions, when executed by one or more processors, cause the one or more processors to perform operations to:
Self-supervised learning (SSL) pretrains generic source models without using expert annotation, allowing the pretrained generic source models to be quickly fine-tuned into high-performance application-specific target models and minimizing annotation cost. This paradigm is particularly attractive in medical imaging because labeling medical images is tedious, laborious, and time-consuming and demands specialty-oriented expertise. However, most existing SSL methods were developed for photographic images, and directly adopting these SSL methods to medical images may not achieve optimal results because medical images are markedly different from photographic images. Photographic images, like those in ImageNet, are object-centric, where dominant objects (e.g., dogs and cats) are located at the center with backgrounds of large variation. Naturally, these SSL methods developed for photographic images mostly learn from foreground objects. By contrast, medical images acquired with the same imaging protocol have similar anatomical structures, and imaging diagnosis requires not only analyzing “foreground” objects: diseases (abnormalities) but also understanding “background” anatomical structures; furthermore, diseases are often small and obscured in “background” anatomical structures. In the present disclosure, exemplary embodiments are illustrated with chest X-rays because the chest contains several critical organs prone to a number of diseases associated with significant healthcare costs, and chest X-rays are one of the most frequently used modalities in imaging the chest. Referring to
As illustrated in
To answer this question, a novel SSL framework is presented, called PEAC (patch embedding of anatomical consistency), configured to exploit global and local patterns in health and disease.
As illustrated in
Several features of the PEAC framework and example model implementations include, but are not limited to:
Global features describe the overall appearance of the image. Most recent methods for global feature learning are put forward to ensure that the extracted global features are consistent across different view. The methods to achieve this include contrastive learning and non-contrastive learning methods. Contrastive methods bring representation of different views of the same image closer and spreading representations of views from different images apart. Non-contrastive methods rely on maintaining the informational content consistent of the representations by either explicit regularization or architecture design like Siamese architecture. In opposition to global methods, local features describe the information that is specific to smaller regions of the image. In local features learning methods, a contrastive or consistent loss can be applied directly at pixel level, feature map level or image region level which forces consistency between pixels at similar locations, between groups of pixels and between large regions that overlap in different view of an image. However, at present, the vast majority of methods that use local features calculate embedding consistency or contrastive learning loss based on the relative positions of the features, such as the feature vectors of semantically closest patches or spatially nearest neighbor patches. In contrast, the PEAC method described herein calculates the consistency loss based on the absolute positions of overlapping image patches shown in
As depicted in the Examples below, in certain exemplary embodiments, methods of the present disclosure are illustrated with chest X-rays because the chest contains several critical organs prone to a number of diseases associated with significant healthcare costs, and chest X-rays are one of the most frequently used modalities in imaging the chest. It will be appreciated that, although the general methods depict analysis of chest X-rays, the following general methods, can be applied to other medical images.
One goal of a method associated with Patch Embedding of Anatomical Consistency (PEAC) is to learn global and local anatomical structures underneath medical images. In medical images, there will be a large scale of local patterns such as spinous processes, clavicle, mainstem bronchus, hemidiaphragm, the osseous structures of the thorax, etc. The analogous regions can be captured by the two global crops shown in
As shown in
Before inputting to the model, seed images are pre-processed in grid-wise cropping to get two crops x,x′∈RC×H×W, C is the number of channels, (H, W) are the crops' spatial dimensions, shown in
The loss from Eq. 1 is symmetrized by separately feeding x to Teacher encoder and x′ to Student encoder to compute θ
θ
θ
θ
As the encoders are associated with a Vision Transformer network, the crop is divided into a sequence of N non-overlapping image patches P=(p1, p2, . . . , PN) where
and m is the patch resolution. The encoder of the Student-Teacher model extracts local features s, t∈RD×N from the two crops x, x′. One can denote sk and tk∈RD the feature vectors at position k∈[1, . . . , N] in their corresponding feature maps. Since the image patches are randomly sampled from an image grid with an overlap rate of 50%-100%, the overlapping image patches Om, On are defined for x and x′ respectively, and m∈[m1, . . . , mz], n∈[n1, . . . , nz] are the patch indexes of the overlapping region, z is the number of overlapping patches. Om=0) while it has no impact on the patch appearance distortion (
=1). To align the output of the student and teacher networks regarding local features, the following local patch embedding consistency loss function is defined in Eq. 2.
As indicated, pmθ
θ
θ
θ
can be calculated and
can be calculated for patch order distortion and patch appearance distortion in the student branch. Where n is the number of patches for each image, represent the order ground truth and
O represent the network's patch order prediction, pj and pja represent image original appearance and reconstruction prediction.
Finally, the total loss is defined in Eq. 3, where θ
θ
θ
θ
θ
θ
θ
θ
Pretraining Settings. PEAC can be pretrained with Swin-B as the backbone on an unlabeled ChestX-ray14 dataset. The PEAC and PEAC−1 models can utilize Swin-B as the backbone, pre-trained on an image size of 448 and fine-tuned on 448 and 224 respectively. PEAC−3 adopts ViT-B as the backbone, pretrained and fine-tuned on an image size of 224. As for the prediction heads in the student branch, two single linear layers can be used for the classification (patch order) and restoration tasks (patch appearance), and two 3-layer mlp for the expanders of local and global features. The augments used in the student branch include 50% probability of patch appearance distortion and 50% probability of shuffling patches. More details are described below.
Target Tasks and Datasets. The effectiveness of the inventive method is validated by evaluating it on four downstream datasets including ChestX-ray14, CheXpert, NIH Shenzhen CXR, and RSNA Pneumonia. These are 2D X-ray medical-image datasets for multi-label classification. When fine-tuning on downstream tasks, the final prediction layer is removed and only the parameters of the encoder are used. Randomly initialized linear classification heads are also added for each downstream dataset and finetune the whole parameters for 150 epochs. Details for datasets and target tasks are provided below.
73.77 ± 0.39
81.90 ± 0.15
88.64 ± 0.19
97.17 ± 0.42
82.78 ± 0.21
88.81 ± 0.57
97.39 ± 0.19
74.39 ± 0.66
81.90 ± 0.15
88.64 ± 0.19
97.17 ± 0.42
82.78 ± 0.21
88.81 ± 0.57
97.39 ± 0.19
74.39 ± 0.66
73.77 ± 0.39
80.51 ± 0.15
88.16 ± 0.66
96.81 ± 0.40
81.90 ± 0.15
88.64 ± 0.19
97.17 ± 0.42
73.70 ± 0.48
The PEAC was implemented on ViT-B and Swin-B for their notable scalability, global receptibility, and interpretability. Both PEACs are trained on ChestX-ray14 by amalgamating the official training and validation splits. In PEAC ViT-B, input images of size 224×224 lead to 196 (14×14) shufflable patches, while in PEAC Swin-B, it results in 49 (7×7) shufflable patches due to the Swin hierarchical architecture. To learn the same contextual relationship as in PEAC ViT-B, PEAC Swin-B was pretrained with images of size 448×448, but the tissue (physical) size covered by the images remains unchanged, resulting in the same 196 (14×14) shufflable patches in terms of the (physical) tissue size.
In PEAC, a multi-class linear layer is designated for patch order classification (Eq. 1), and a single convolutional block is employed for patch appearance restoration (Eq. 2). The global and local consistency branches utilize two 3-layer MLPs as expanders before computing consistency losses. When training PEAC, one can use a learning rate of 0.1, a momentum of 0.9 for the SGD optimizer, a warmup period of 5 epochs, and a batch size of 8. The teacher model is updated after each iteration via EMA with an updating parameter of 0.999. For Nvidia RTX3090 GPUs were used for training PEAC models with images of size 224×224 for 300 epochs, but the number of epoch was reduced to 150 when the image size is 448×448.
The PEAC models were evaluated by finetuning on four classification target tasks:
The PEAC pretrained models were transferred to each target task by fine-tuning the whole parameters for the target classification tasks. For the target classification tasks, a randomly initialized linear layer was concatenated to the output of the classification (CLS) token of PEAC ViT-B models. Due to the structural difference with ViT-B model, PEAC Swin-B models don't equip the CLS token and an average pooling was added to the last-layer feature maps, then feed the feature to the randomly initialized linear layer. The AUC (area under the ROC curve) is used to evaluate the multi-label classification performance (ChestX-ray14, CheXpert and NIH Shenzhen CXR), while the accuracy is used to assess the multi-class classification performance (RSNA Pneumonia). In fine-tuning experiments, the AdamW optimizer was used with a cosine learning rate schedule, linear warm up of 20 epochs while the overall epoch is 150, and 0.0005 for the maximum learning rate value. The batch sizes are 32 and 128 for image sizes of 448 and 224, respectively. Training was conducted with the single Nvidia RTX3090 24G GPU for performing each experiment.
The pre-trained PEAC models are adapted to each target task through fine-tuning the whole parameters. For these tasks, a randomly initialized linear layer is appended to the output of the Classification (CLS) token in PEAC ViT-B models. Due to the inherent structural divergence from the VIT-B model, the PEAC Swin-B models do not feature a CLS token. Instead, an average pooling layer is introduced to the final-layer feature maps, which are subsequently inputted into the randomly initialized linear layer. The model's performance is evaluated using AUC (area under the ROC curve) for multi-label classification tasks (ChestX-ray14, CheXpert, NIH Shenzhen CXR), while accuracy metrics are employed for multi-class classification performance (RSNA Pneumonia). Fine-tuning experiments follow an optimization protocol using the AdamW optimizer, integrating a cosine learning rate schedule, a linear warmup spanning 20 epochs out of a total of 150, and a maximum learning rate value of 0.0005. Batch sizes are tailored to image size, with sizes of 32 and 128 used for images of 448 and 224 respectively. Each experiment was performed on a single Nvidia RTX3090 24G GPU.
4.7.1. Ablation Studies: PEAC Versions and their Performance
Our PEAC model involves four losses:
θ
oc
θ
ar
θ
, θ
t
G
θ
, θ
t
L
82.78 ± 0.21
97.39 ± 0.19
74.39 ± 0.66
Some ingredients were removed from the official implementation PEAC and the results (Table 4) show the effectiveness of all loss functions. The POPAR versions involve OD (patch order distortion) and AD (patch appearance distortion) which are studied in and the losses include patch order classification loss θ
θ
θ
θ
θ
θ
Our pretraining and fine-tuning setting include two resolutions 448×448 and 224×224. The downgraded versions PEAC-2 contain 49 pretraining shuffled patches and are pretrained and fine-tuned on 224 size of images while the downgraded versions PEAC-1 include 196 shuffled patches and are pretrained on 448 and fine-tuned on 224 size of images. And the performances on the official implementation PEAC (pretrained and fine-tuned on 448 images) are the best. To accelerate the training process, only two versions PEAC and PEAC(o,a,g) can be pretrained on 448 images.
The local consistency loss was added and based on several methods VICRegL, SimMIM shown in Table 5. In the instance of VICRegL, ConvNext serves as the backbone, with the subsequent addition of local consistency loss precipitating notable enhancements in performance across all three target tasks. The SimMIM methodology employs Swin-B as its backbone, with the sequential addition of global and local consistency losses leading to marked improvements in performance. Moreover, the removal of local consistency loss from the PEAC method corresponds to a decline in performance across the target classification tasks. This evidence underscores the efficacy of the proposed grid-matched local consistency loss.
θ
oc
θ
ar
θ
, θ
t
G
θ
, θ
t
L
82.78 ± 0.21
97.39 ± 0.19
74.39 ± 0.66
Corresponding to Section 4.3 (3) the experiments in Table 6 demonstrate that using Teacher-Student model with global embedding consistency can boost one branch methods. Experiments were conducted based on SimMIM and the inventive method described herein which are all based on Swin-B backbone, pretrained on ChestX-ray14, pretrained and fine-tuned on 224 image resolution. When adding teacher branch for SimMIM to compute the global embedding consistency loss, the classification performances of SimMIM(g) for the three target tasks are significantly improved. Importantly, the input images of the two branches are the two global views which are grid-wise cropped using the subject method and the student branch in SimMIM(g) gets the masked patches as SimMIM while the teacher branch gets no augmentations for the input images. The teacher branch was also added to the one branch methods POPARod−2 and POPAR−2 for computing the global consistency loss. The downstream performances on the two branches Teacher-Student models PEAC(o,g)−2 and PEAC(o,a,g)−2 are much better than one branch methods POPARod−2 and POPAR−2.
To investigate the promotion of the subject method for sensing local anatomy, small local patches were matched across two patients' and one patient's different views of X-ray.
θ
oc
θ
ar
θ
, θ
t
G
θ
, θ
t
L
81.42 ± 0.04
96.91 ± 0.10
74.19 ± 0.15
The PEAC method was also used to match anatomical structures from a patient with no finding (disease) to patients of different weights, different genders, and different health statuses as shown in
Referring to
The instructions 104 may be implemented as code and/or machine-executable instructions executable by the processor 102 that may represent one or more of a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, an object, a software package, a class, or any combination of instructions, data structures, or program statements, and the like. In other words, one or more of the features for processing described herein may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable and the processor 102 performs the tasks defined by the code. In some embodiments, the processor 102 is a processing element of a cloud such that the instructions 104 may be implemented via a cloud-based web application.
In some examples, the processor 102 access input data from a device 108 (e.g., end-user device) in operable communication with a display 110. An end-user, via a user interface 112 rendered along the display 110, can provide input elements 114 to the processor 102 and receive output elements for executing functionality herein. In addition, examples of the system 100 can include a database 118 for storing datasets, images, and other input data as described herein.
The PEAC (patch embedding of anatomical consistency) model described herein relates to a novel self-supervised learning (SSL) framework or scheme for improving consistency in learning visual representations of anatomical structures in medical images and includes a new method, called stable grid-based matching, for ensuring global and local consistency in anatomy. Through extensive experiments, the effectiveness of the scheme was demonstrated (compared to existing state-of-the-art methods). By accurately identifying the features of each common region across patients of different genders and weights and across different views of the same patients, PEAC exhibited a heightened potential for enhanced Al in medical image analysis.
Example features include:
While a number of embodiments of the invention have been described, it is apparent that the basic examples may be altered to provide other embodiments that utilize the methods of this disclosure. Therefore, it will be appreciated that the scope of this invention is to be defined by the appended claims rather than by the specific embodiments that have been represented by way of example.
This is a non-provisional application that claims benefit to U.S. Provisional Application Ser. No. 63/580,946, filed on Sep. 6, 2023, which is herein incorporated by reference in its entirety.
This invention was made with government support under R01 HL128785 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63580946 | Sep 2023 | US |