The technical field generally relates to deep neural networks and their use to create biopsy-free virtually stained images using images obtained of living tissue. The system and method has particular application to reflectance confocal microscopy (RCM) but may be extended to other imaging modalities such as multiphoton microscopy (MPM) and others. More particularly, the technical field relates to systems and methods that utilize trained deep neural networks to rapidly transform in vivo optical biopsy images (e.g., RCM images) of unstained skin into virtually-stained images.
Microscopic evaluation of histologically processed and chemically stained tissue is the gold standard for the diagnosis of a wide variety of medical diseases. Advances in medical imaging techniques, including magnetic resonance imaging, computed tomography, and ultrasound, have transformed medical practice over the past several decades, decreasing the need for invasive biopsies and exploratory surgeries. Similar advances in imaging technologies to aid in the diagnosis of skin disease non-invasively have been slower to progress.
Skin cancers represent the most common type of cancer diagnosed in the world. Basal cell carcinoma (BCC) comprises 80% of the 5.4 million skin cancers seen in the United States annually. Melanoma represents a small percentage of overall skin cancers but represents the leading cause of death from skin cancer and is among the deadliest cancers when identified at advanced stages. Invasive biopsies to differentiate BCC from benign skin neoplasms and melanoma from benign melanocytic nevi represent a large percentage of the biopsies performed globally. Over 8.2 million skin biopsies are performed to diagnose over 2 million skin cancers annually in the Medicare population alone, resulting in countless unnecessary biopsies and scars at a high financial burden. In addition, the process of biopsy, histological tissue processing, delivery to pathologists, and diagnostic assessment requires one day to several weeks for a patient to receive a final diagnosis, resulting in lag time between the initial assessment and definitive treatment. Thus, non-invasive imaging presents an opportunity to prevent unnecessary skin biopsies while improving early detection of skin cancer.
The most used ancillary optical imaging tool used by dermatologists are dermatoscopes, which magnify skin lesions and use polarized light to assess superficial features of skin disease and triage lesions with ambiguous features for tissue biopsy. While dermatoscopes can reduce biopsies in dermatology, their use requires proper training to improve sensitivity of detecting skin cancers over clinical inspection alone. More advanced optical technologies have been developed for non-invasive imaging of skin cancers, including reflectance confocal microscopy (RCM), optical coherence tomography (OCT), multiphoton microscopy (MPM), and Raman spectroscopy, among others. Of these optical imaging technologies, only RCM and MPM technologies provide cellular-level resolution similar to tissue histology and allow for better correlation of image outputs to histology due to their ability to discern cellular-level details.
RCM imaging detects backscattered photons that produce a grayscale image of tissue based on contrast of relative variations in refractive indices and sizes of organelles and microstructures. Currently, RCM can be considered as the most clinically-validated optical imaging technology with strong evidence supporting its use by dermatologists to discriminate benign from malignant lesions with high sensitivity and specificity. Importantly, several obstacles remain for accurate interpretation of RCM images, which requires extensive training for novice readers. While the black and white contrast images can be used to distinguish types of cells and microstructural detail, in vivo RCM does not show nuclear features of skin cells in a similar fashion to traditional microscopic evaluation of tissue histology. Multimodal ex vivo fluorescence and RCM can produce digitally-colorized images with nuclear morphology using fluorescent contrast agents. However, these agents are not used n vivo with a reflectance-based confocal microscopy system. Without nuclear contrast agents, nuclear features critical for assessing cytologic atypia are not discernable. Further, the grayscale image outputs and horizontal imaging axis of confocal technologies pose additional challenges for diagnosticians who are accustomed to interpreting tissue pathology with nuclear morphology in the vertical plane. Combined, these visualization-based limitations, in comparison to standard-of-care biopsy and histopathology, pose barriers for wide adoption of RCM.
On the other hand, hematoxylin and eosin (H&E) staining of tissue sections on microscopy slides represents the most common visualization format used by dermatologists and pathologists to evaluate skin pathology. Thus, conversion of images obtained by non-invasive skin imaging and diagnostic devices to an H&E-like format may improve the ability to diagnose pathological skin conditions by providing a virtual “optical biopsy” with cellular resolution and in an easy-to-interpret visualization format.
Deep learning represents a promising approach for computationally-assisted diagnosis using images of skin. Deep neural networks trained to classify skin photographs and/or dermoscopy images, successfully discriminated benign from malignant neoplasms at similar accuracy to trained dermatologists. Algorithms based on deep neural networks can help pathologists identify important regions of disease, including microscopic tumor nodules, neoplasms, fibrosis, inflammation, and even allow prediction of molecular pathways and mutations based on histopathological features. Researchers also used deep neural networks to perform semantic segmentation of different textual patterns in RCM mosaic images of melanocytic skin lesions as a potential diagnostic aid for clinicians. Deep learning-based approaches have also enabled the development of algorithms to learn image transformations between different microscopy modalities to digitally enhance pathologic interpretation. For instance, using unstained, autofluorescence images of label-free tissue sections, a deep neural network can virtually stain images of the slides, digitally matching the brightfield microscopy images of the same samples stained with standard histochemical stains such as H&E, Jones, Masson's Trichrome and periodic acid Schiff (PAS) without the need for histochemical processing of tissue. These virtually-stained images were found to be statistically indiscernible to pathologists when compared in a blinded fashion to the images of the chemically stained slides. Deep learning-enabled virtual staining of unstained tissue has been successfully applied to other types of label-free microscopic imaging modalities including e.g., quantitative phase imaging and two photon excitation with fluorescence lifetime imaging, but has not been used to obtain in vivo virtual histology.
In one embodiment, a deep learning-based virtual tissue staining system and method is disclosed that rapidly performs in vivo virtual histology of unstained skin. In the training phase of this framework, RCM images were used of excised skin tissue with and without acetic acid nuclear contrast staining to train a deep convolutional neural network (CNN) using structurally-conditioned generative adversarial networks (GAN), together with attention gate modules that process three-dimensional (3D) spatial structure of tissue using 3D convolutions. First, time-lapse RCM image stacks are acquired of ex vivo skin tissue specimens during the acetic acid staining process to label cell nuclei. Using this 3D data, label-free, unstained image stacks were accurately registered to the corresponding acetic acid-stained 3D image stacks, which provided a high degree of spatial supervision for the neural network to map 3D features in label-free RCM images to their histological counterparts. Once trained, this virtual staining framework was able to rapidly transform in vivo RCM images into virtually stained, 3D microscopic images of normal skin. BCC, and pigmented melanocytic nevi with H&E-like color contrast. When compared to traditional histochemically-processed and stained tissue sections, this digital technique demonstrates similar morphological features that are observed in H&E histology. In vivo virtual staining of unprocessed skin through non-invasive imaging technologies such as RCM would be transformative for rapid and accurate diagnosis of malignant skin neoplasms, also reducing unnecessary skin biopsies.
In another embodiment, a method of using in vivo reflectance confocal microscopy (RCM) images of unstained tissue to generate digitally histological-stained microscopic images of tissue. The method includes providing a first trained, deep neural network that is executed by image processing software, wherein the first trained deep neural network receives as input(s) a plurality of in vivo RCM images of tissue and outputs a digitally acetic acid-stained image that is substantially equivalent to an image of actual acetic acid-stained tissue; and providing a second trained, deep neural network that is executed by image processing software, wherein the second trained, deep neural network receives as input(s) a plurality of n vivo RCM images of tissue and/or the corresponding digitally acetic acid-stained images from the first trained, deep neural network and outputs digitally histological-stained images that are substantially equivalent to the images achieved by actual histological staining of tissue. A plurality of in vivo RCM images of the tissue are obtained and are input to the first trained, deep neural network to obtain digitally acetic acid-stained images of the tissue. The plurality of in vivo RCM images and/or and the corresponding digitally acetic acid-stained images are input to the second trained, deep neural network, wherein the second trained, deep neural network outputs the digitally histological-stained microscopic images of the tissue.
In another embodiment, a system is disclosed for generating digitally histological-stained microscopic images from in vivo reflectance confocal microscopy (RCM) images of unstained tissue. The system includes a computing device having image processing software executed thereon or thereby, the image processing software comprising (1) a first trained, deep neural network, wherein the first trained, deep neural network receives as input(s) a plurality of in vivo RCM images of unstained tissue and outputs digitally acetic acid-stained images that are substantially equivalent to the images of the actual acetic acid-stained tissue; and/or (2) a second trained, deep neural network, wherein the second trained, deep neural network receives as input(s) a plurality of in vivo RCM images of unstained tissue and/or the corresponding digitally acetic acid-stained images from the first trained, deep neural network and outputs digitally histological-stained images that are substantially equivalent to the images achieved by actual histological staining of tissue.
In another embodiment, a method of using in vivo images of unstained tissue to generate digitally histological-stained microscopic images of tissue is disclosed. The method includes providing a first trained, deep neural network that is executed by image processing software, wherein the first trained deep neural network receives as input(s) a plurality of in vivo images of unstained tissue and outputs a digitally acetic acid-stained image of the tissue that is substantially equivalent to the image of the actual acetic acid-stained tissue; and providing a second trained, deep neural network that is executed by image processing software, wherein the second trained, deep neural network receives as input(s) a plurality of in vivo images of tissue and/or the corresponding digitally acetic acid-stained images from the first trained, deep neural network and outputs digitally histological-stained images that are substantially equivalent to the images achieved by actual histological staining of tissue. A plurality of in vivo images of the tissue are obtained and input to the first trained, deep neural network to obtain digitally acetic acid-stained images of the tissue. The plurality of in vivo images and/or and the corresponding digitally acetic acid-stained images are inputted to the second trained, deep neural network, wherein the second trained, deep neural network outputs the digitally histological-stained microscopic images of the tissue.
As explained herein, the in vivo reflectance confocal microscopy (RCM) images 20 of tissue 50 are obtained from tissue 50 that is not stained or labeled. The tissue 50 may include skin tissue, cervical tissue, mucosal tissue, epithelial tissue, and the like. The in vivo reflectance confocal microscopy (RCM) images 20 (or other microscopy images 20) preferably comprise a plurality of such images 20 that are obtained from a microscope 110. For example, for RCM images 20, these are obtained with a RCM microscope 110 or other device for obtaining RCM images 20. For example, the plurality of images 20 may include an image stack of separate images focused at different depths within the tissue 50. The RCM microscope 110 may include different types of RCM microscopes 110 including stand-alone, bench-top, and portable devices.
In another embodiment, the system 2 is used to process images 20 of unstained tissue 50 obtained using a different type of microscope 110 used to obtained optical biopsy images of unstained tissue 50. This includes, for example, images 20 obtained from one or more of the following microscopes or imagers 110: a multiphoton microscope (MPM), a fluorescence confocal microscope/imager, a fluorescence microscope/imager, a fluorescence lifetime microscope (FLIM), a structured illumination microscope/imager, a hyperspectral microscope/imager, a Raman microscope/imager, and a polarization microscope/imager.
The system 2 includes a computing device 100 that contains one or more processors 102 therein and image processing software 104 that incorporates a first trained, deep neural network 10 (e.g., a convolutional neural network as explained herein in one or more embodiments) and a second trained, deep neural network 12 (e.g., a convolutional neural network as explained herein in one or more embodiments). As explained herein, the first deep neural network 10 is trained, in one embodiment, with matched acetic acid-stained images or image patches and their corresponding reflectance confocal microscopy (RCM) images or image patches of unstained ex vivo tissue samples, wherein the first trained deep neural network 10 outputs images 36 that are digitally stained that are substantially equivalent to images of the actual acetic acid-stained tissue (i.e., chemically stained). Of course, in embodiments that use a non-RCM microscope 110 (e.g., MPM, fluorescence confocal microscopy, structured illumination microscopy, and polarization microscopy), the corresponding images or image patches would be obtained with the same imaging modality for training.
These images 36 that are output from the first trained deep neural network 10 are thus virtually or digitally stained with acetic acid in response to the training. The second deep neural network 12 network is trained, in one embodiment, with matched ground truth pseudo-H&E images that were mathematically arrived at (or images of actual histologically stained tissue) and acetic acid-stained images or image patches and their corresponding reflectance confocal microscopy (RCM) images or image patches of unstained tissue samples (other stains may also be trained in a similar manner). Once trained, the first trained, deep neural network 10 receives a plurality of in vivo RCM images 20 of the unstained tissue 50 to obtain images 36 digitally stained with acetic acid. The second trained, deep neural network 12 receives a plurality of in vivo RCM images 20 and/or the corresponding digitally stained images 36 with acetic acid from the first trained, deep neural network 10, wherein the second trained, deep neural network 12 outputs digitally stained microscopic images 40 of the tissue 50 that are substantially equivalent to the images achieved by the actual histological staining of tissue (e.g., H&E in one embodiment). In some embodiments, the second trained, deep neural network 12 receives just the corresponding digitally stained images 36 output from the first trained, deep neural network 10 and uses these to output digitally stained images 40 that are substantially equivalent to the images achieved by the actual histologically stained tissue (e.g., H&E stain). The digitally stained microscopic images 40 may include a specific region of interest (ROI) of the tissue 40. The images 40 may also form a larger area or mosaic that is formed through digital stitching of images using image processing software 104. The images 40 may also be used to create a three-dimensional image or volume. Likewise, the images 40 may be used to show a particular plane (e.g., horizontal or vertical plane of the tissue 50).
The computing device 100 may include, as explained herein, a personal computer, laptop, mobile computing device, remote server, or the like, although other computing devices may be used (e.g., devices that incorporate one or more graphic processing units (GPUs) or other application specific integrated circuits (ASICs)). GPUs or ASICs can be used to accelerate training as well as final image output. The computing device 100 may be associated with or connected to a monitor or display 106 that is used to display the digitally stained microscopic images 40. The display 106 may also be used to display the grayscale RCM images. The user may be able to see both simultaneously or toggle between views. The display 106 may be used to display a Graphical User Interface (GUI) that is used by the user to display and view the digitally stained microscopic images 40 (or RCM or other microscopy images 20). In one embodiment, the user may be able to trigger or toggle manually between digitally stained microscopic images 40 or grayscale RCM images 20 using, for example, the GUI. In one preferred embodiment, the trained, deep neural network 10 is a Convolution Neural Network (CNN). In some embodiments, real-time digitally stained microscopic images 40 are generated which may be displayed to the user on the display 106.
For example, in one preferred embodiment as is described herein, the trained, deep neural networks 10, 12 are trained using a GAN model. In a GAN-trained deep neural network 10, 12, two models are used for training. A generative model is used that captures data distribution while a second model estimates the probability that a sample came from the training data rather than from the generative model. Details regarding GAN may be found in Goodfellow et al., Generative Adversarial Nets., Advances in Neural Information Processing Systems, 27. pp. 2672-2680 (2014), which is incorporated by reference herein. Network training of the deep neural networks 10, 12 (e.g., GAN) may be performed the same or different computing device 10. For example, in one embodiment a personal computer may be used to train the networks 10, 12 although such training may take a considerable amount of time. To accelerate this training process, one or more dedicated GPUs may be used for training. As explained herein, such training and testing was performed on GPUs obtained from a commercially available graphics card. Once the deep neural networks 10, 12 have been trained, the deep neural networks 10, 12 may be used or executed on the same or a different computing device 100 which may include one with less computational resources used for the training process (although GPUs may also be integrated into execution of the trained deep neural networks 10, 12).
The image processing software 104 can be implemented using Python and TensorFlow although other software packages and platforms may be used. MATLAB may be used for image registration algorithms as explained herein. The trained deep neural networks 10, 12 are not limited to a particular software platform or programming language and the trained deep neural networks 10, 12 may be executed using any number of commercially available software languages or platforms. The image processing software 104 that incorporates or runs in coordination with the trained, deep neural networks 10, 12 may be run in a local environment or a remote cloud-type environment. In some embodiments, some functionality of the image processing software 104 may run in one particular language or platform (e.g., image registration) while the trained deep neural networks 10, 12 may run in another particular language or platform. Nonetheless, both operations are carried out by image processing software 104.
Training of virtual staining networks for in vivo histology of unstained skin. Traditional biopsy requires cleansing and local anesthesia of the skin, followed by surgical removal, histological processing, and examination by a trained physician in histopathological assessment, typically using H&E staining. Through the combination of two sub-components. i.e., hematoxylin and eosin, this staining method is able to stain cell nuclei blue and the extracellular matrix and cytoplasm pink, so that clear nuclear contrast can be achieved to reveal the distribution of cells, providing the foundation for the evaluation of the general layout of the skin tissue structure. As described herein, a new approach using deep learning-enabled transformation of label-free RCM images 20 into H&E-like output images 40 is shown, without the removal of tissue or a biopsy.
Here, acetic acid was used to quickly stain the ex vivo skin tissue in RCM imaging, bringing nuclear contrast to serve as the ground truth. The training experiments were performed accordingly and took time-lapsed RCM videos in the process of acetic acid staining, through which the 3D image sequences were obtained with feature positions traceable before and after the acetic acid staining. According to these sequences, a rough registration of the images 20 was initially performed before and after staining, which was followed by two more rounds of deep learning-based fine image registration processes to obtain accurately registered image stacks (see
These registered image stacks 20 were then used for the training of the acetic acid virtual staining network 10 named VSAA, where attention gate modules and 3D convolutions are employed to enable the network to better process 3D spatial structure of tissue (see
Virtual staining of RCM image stacks of normal skin samples ex vivo. Staining of skin blocks with acetic acid allowed the visualization of nuclei from excised tissue at the dermal-epidermal junction and superficial dermis in normal skin samples. Using these images as the ground truth (only for comparison), first it was tested whether the RCM images 20 of unstained tissue can be transformed into H&E-like images 40 using the deep learning-based virtual histology method. The data, summarized in
Next, the prediction performance of the model was evaluated through a series of quantitative analyses. To do so, first the acetic acid virtual staining results were generated of the entire ex vivo testing set that contains 199 ex vivo RCM images collected from 6 different unstained skin samples from 6 patients. Segmentation was performed on both the virtual histology images of normal skin samples and their ground truth images to identify the individual nuclei on these images. Using the overlap between the segmented nuclear features of acetic acid virtual staining images and those in the actual acetic acid-stained ground truth images as a criterion, each nucleus in these images was classified into the categories of true positive (TP), false positive (FP) and false negative (FN) and quantified the sensitivity and precision values of the prediction results (see the Methods for details). It was found that the virtual staining results achieved ˜80% sensitivity and ˜70% precision for nuclei prediction on the ex vivo testing image set. Then, using the same segmentation results, the nuclear morphological features in the acetic acid virtual staining and ground truth images were further assessed. Five morphological metrics, including nuclear size, eccentricity, compactness, contrast, and concentration, were measured for this analysis (see Methods for details). As shown in
Virtual staining of RCM image stacks of melanocytic nevi and basal cell carcinoma ex vivo. To determine whether the method can be used to assess skin pathology, features seen in common skin neoplasms were imaged. Melanocytes are found at the dermal-epidermal junction in normal skin and increase in number and location in both benign and malignant melanocytic neoplasms. For the approach to be successful, it needs to incorporate pigmented melanocytes in order to be useful for interpretation of benign and malignant melanocytic neoplasms (nevi and melanoma, respectively). Melanin provides strong endogenous contrast in melanocytes during RCM imaging without acetic acid staining. This allows melanocytes to appear as bright cells in standard RCM images 20 due to high refractive index of melanin. Specimens with normal proportions of melanocytes (
Unlike melanocytes, basaloid cells that comprise tumor islands in BCC appear as dark areas in RCM images 20. This appearance is due to the high nuclear-to-cytoplasmic ratio seen in malignant cells and the fact that nuclei do not demonstrate contrast on RCM imaging. Further, mucin present within and surrounding basaloid islands in BCC further limits the visualization of tumor islands due to a low reflectance signal. Since many skin biopsies are performed to rule out BCC, it was next determined whether acetic acid staining can provide ground truth for skin samples containing BCC. 50% acetic acid concentration allowed sufficient penetration through the mucin layer to stain nuclei of BCC. Discarded, approximately 2 mm-thick, Mohs surgical specimens diagnosed as BCC were used and RCM imaging was performed without and with acetic acid staining (the latter formed the ground truth). As illustrated in the bottom row of
Virtual staining of mosaic RCM images ex vivo. Mosaic images are formed by multiple individual RCM images 20 scanned over a large area at the same depth to provide larger field-of-view of the tissue 50 to be examined for interpretation and diagnosis. To demonstrate virtual staining of mosaic RCM images, ex vivo RCM images 20 of BCC in a tissue specimen 50 obtained from a Mohs surgery procedure were converted to virtual histology. Through visual inspection, the virtual histology image 40 shown in
Virtual staining of in vivo image stacks and mosaic RCM images. Next, it was tested whether RCM images 20 of unstained skin obtained in vivo can give accurate histological information using the trained neural network. In vivo RCM images 20 of lesions that are suspicious for BCC were compared to (1) histology from the same lesion obtained following biopsy or Mohs section histology and (2) images obtained ex vivo with acetic acid staining (ground truth). As summarized in
It was also examined whether the virtual staining method can be applied to mosaic in vivo RCM images, despite the fact that the network 10, 12 was not trained on a full set of mosaic images. These mosaic RCM images 40 are important because they are often used in clinical settings to extend the field-of-view for interpretation and are required for the reimbursement of the RCM imaging procedure. The results reported in
Finally, the inference speed of the trained deep network models 10, 12 were tested using RCM image stacks, and demonstrated the feasibility of real-time virtual staining operation (see Methods for details). For example, using eight Tesla A100 GPUs to perform virtual staining through VSAA 10 and VSHE 12 networks, the inference time for an image size of 896×896-pixels reduced to ˜0.0173 sec and ˜0.0046 sec, respectively. Considering the fact that the frame rate of the RCM device used is ˜9 frames per second (˜0.111 see/image), this demonstrated virtual staining speed is sufficient for real-time operation in clinical settings.
Here, a deep neural network-based approach was applied to perform virtual staining in RCM images of label-free normal skin, BCC, and melanocytic nevi. Grayscale RCM images were also transformed into pseudo-H&E virtually stained images 40 that resembled H&E staining, the visualization format most commonly used by pathologists to assess biopsies of histochemically-stained tissue on microscopy slides.
In the virtual staining inference, a 3D image stack (stack of images 20) was used as the input of the GAN model. An ablation study was conducted to demonstrate that using 3D RCM image stacks, composed of seven (7) adjacent images, is indeed necessary for preserving the quality of the acetic acid virtual staining results. For this comparative analysis, the input of the network VSAA 10 was changed to only one RCM image 20 of unstained skin tissue that was located at the same depth as the actual acetic acid-stained target (ground truth image). Then, a new VSAA 10 was trained without having a major change to its structure, except that the first three 3D convolutions were changed to 2D (see
Using the presented virtual staining framework, good concordance was shown between virtual histology and common histologic features in the stratum spinosum, dermal-epidermal junction, and superficial dermis, areas of skin most commonly involved in pathological conditions. Virtually stained RCM images 40 of BCC show analogous histological features including nodules of basaloid cells with peripheral palisading, mucin, and retraction artifact. These same features are used to diagnose BCC from skin biopsies by pathologists using H&E histology. It was further demonstrated that the virtual staining network successfully inferred pigmented melanocytes in benign melanocytic nevi (see
While the results demonstrate success in obtaining histology quality images in vivo without the need for invasive biopsies, several limitations remain for future investigation and improvement. First, a limited volume of training data was used which was primarily composed of nodular BCC, which contained round nodules. When applied to another type of BCC from the blind testing set containing infiltrative, strand-like tumor islands of BCC with focal keratinization, it resulted in a form of artifact composed of dark blue/purple streaks of basaloid cells similar to the cords/strands seen in the microscopic image of frozen section histology from this sample, but with lower resolution (see
Another limitation of the virtual histology framework is that not all nuclei were placed with perfect fidelity in the transformed, virtually stained images 40. In the quantitative analysis for prediction of nuclei, there remained a positional misalignment between the network inputs and the corresponding ground truth images. This resulted in relatively imprecise learning of the image-to-image transformation for virtual staining, and therefore can be thought of as “weakly paired” supervision. To mitigate this misalignment error in the training image acquisition (time-lapsed RCM imaging process), one can reduce the number of RCM images 20 in a stack in order to decrease the time interval between successive RCM stacks. This may help capture more continuous 3D training image sequences to improve the initial registration of the ground truth images with respect to the input images. One can also further improve the learning-based image registration algorithm (see
All in all, the described virtual histology approach can allow diagnosticians to see the overall histological features, and obtain in vivo “gestalt diagnosis”, as pathologists do when they examine histology slides at low magnification. Ground truth histology was also collected from the same specimen used for RCM imaging (see
Deep learning-enabled in vivo virtual histology is disclosed to transform RCM images 20 into virtually-stained images 40 for normal skin, BCC, and melanocytic nevi. Future studies will evaluate the utility of the approach across multiple types of skin neoplasms and other non-invasive imaging modalities towards the goal of optical biopsy enhancement for non-invasive skin diagnosis.
Following informed consent (Advarra IRB, Pro00037282), 43 patients had RCM images 20 captured during regularly scheduled visits. RCM images 20 were captured with the VivaScope 1500 System (Caliber I.D., Rochester, NY), by a board-certified dermatologist trained in RCM imaging and analysis. RCM imaging was performed through an objective lens-to-skin contact device that consists of a disposable optically clear window. Of course, the systems and methods herein are not limited to a particular make or model of microscope (e.g., portable RCM microscopes or imagers 110 may also be used). The window was applied to the skin over a drop of mineral oil and used throughout the imaging procedure. The adhesive window was attached to the skin with a medical grade adhesive (3M Inc., St. Paul, MN). Ultrasound gel (Aquasonic 100, Parker Laboratories, Inc.) was used as an immersion fluid, between the window and the objective lens. Approximately three RCM mosaic scans and two z-stacks were captured stepwise at 1.52 μm or 4.56 μm increments of both normal skin and skin lesions suspicious for BCC. Upon completion of RCM imaging, patients were managed as per standard-of-care practices. In several cases, skin lesions that were imaged in vivo were subsequently biopsied or excised using standard techniques and the excised tissue was subjected to ex vivo RCM imaging and/or diagnostic tissue biopsy. Tissue diagnosis was confirmed by a board-certified dermatopathologist.
The final in vivo blind testing dataset that was used to present the in vivo results reported in this paper was composed of 979 896/896 RCM images 20 collected in vivo without any acetic acid-stained ground truth. Histopathologic confirmation was obtained on all skin lesions/tumors but was not provided on in vivo RCM images of normal skin.
Discarded skin tissue specimens 50 from Mohs surgery tissue blocks with and without residual BCC tumor were retrieved for ex vivo RCM imaging with IRB exemption determination (Quorum/Advarra, QR #: 33993). Frozen blocks were thawed, and the specimens were thoroughly rinsed in normal saline. Small samples of intact skin stratum corneum, epidermis, and superficial dermis were trimmed from tissue specimens. The skin sample length and width varied depending on the size of the discarded Mohs specimen. The adipose and subcutaneous tissue was trimmed from the superficial skin layers, such that skin samples from stratum comeum to the superficial dermis were approximately 2 mm thick. The trimmed skin samples 50 were placed flat onto an optically clear polycarbonate imaging window with the stratum comeum side down and placed in a tissue block made from 4% agarose (Agarose LE, Benchmark Scientific). The agarose solution was brought to a boiling point and approximately 0.1 mL-0.3 mL was pipetted over the trimmed skin sample and imaging window until that the entire sample was covered by the agarose solution. 10 min was given for the agarose solution to cool to room temperature, hardening into a malleable mold that encapsulated the skin tissue sample flat against the imaging window. A 2 mm curette was used to channel a small opening in the agarose mold to access the center of the skin tissue sample while the perimeter of the sample remained embedded in the agarose mold.
The imaging window with the agarose molded skin tissue 50 was attached to the RCM device (VivaScope 1500, Caliber I.D., Rochester. NY), which operates at a frame rate of 9 frames/sec. Ultrasound gel (Aquasonic 100, Parker Laboratories, Inc.) was used as an immersion fluid, between the window and the objective lens. The optical head of the RCM device was inverted. Image z-stacks containing 40 images each were captured stepwise with 1.52 μm increments to a total depth of 60.8 μm. 10-20 consecutive image stacks were captured in a continuous time-lapse fashion over the same tissue area. Areas with features of interest (e.g., epidermis, dermal-epidermal junction, and superficial dermis, etc.) were selected before imaging. The first image stack captured RCM images 20 of label free skin tissue. After completion of the first image stack, 1-2 drops of 50% acetic acid solution (Fisher Scientific) were added to small opening in the agarose mold with access to the center of the skin tissue sample. While 5% acetic acid is sufficient to stain nuclei of normal skin tissue, a higher concentration was required to penetrate mucin that often surrounds islands of BCC tumor, and thus a standard 50% solution was added to all tissue. RCM time-lapse imaging continued until acetic acid penetrated the area of interest and stained cell nuclei throughout the depth of the image stack. Before and after time-lapse imaging, RCM mosaics (Vivablocks) of the skin tissue sample were also captured at one or several depths. After ex vivo RCM imaging, samples were either fixed in 10% neutral buffered formalin (Thermo Fisher Scientific, Waltham, MA) for histopathology or safely discarded.
The final ex vivo training, validation and testing datasets that were used to train the deep networks 10, 12 and perform quantitative analysis of its blind inference results were composed of 1185, 137 and 199 896×896-pixel ex vivo RCM images of unstained skin lesions and their corresponding acetic acid-stained ground truth, which were obtained from 26, 4 and 6 patients, respectively.
Accurate alignment of the training image pairs is of critical importance for the virtual staining deep neural networks 10, 12 to learn the correct structural feature mapping from the unstained tissue images to their stained counterparts. The principle of the image registration method relies on the spatial and temporal consistency of the time-lapse volumetric image stack captured using RCM during the staining process of the ex vivo tissue samples. In other words, the raw data cover essentially 4-dimensional space, where the three dimensions represent the volumetric images of the tissue and the fourth dimension (time) records the whole staining process of the tissue, i.e., from the unstained state to the stained state, as a function of time.
At this stage, it is noteworthy that small shifts and distortions between the two sets of initially-registered images can still exist and lead to errors during the learning process. To mitigate this, these image pairs were further aligned at a sub-pixel level through the second part of the registration process. In this part, the coarsely registered image pairs were individually fed into a convolutional neural network A (operation 210), whose structure is similar to the generator network reported in
Despite its utility, the calculation of multi-scale correlation can frequently produce abnormal values on DVFs, which result in unsmooth distortions in the registered images from time to time. To mitigate this problem, another round of soft training (operation 230) was applied of a separate network A′(that is similar to A) and a second fine registration step (operation 240) to further improve the registration accuracy. Unlike the 1st fine registration, this 2nd fine registration step (operation 240) was performed based on the DVF generated by a learning-based algorithm, where a deep convolutional neural network B is trained to learn the smooth, accurate DVF between two input images. The training details of this network B are reported in
A pix2pix GAN framework was used as the generative model of acetic acid virtual staining network (VSAA) 10, which includes the training of (1) a generator network for learning the statistical transformation between the unstained input stacks of RCM images 20 and the corresponding acetic acid stained tissue images, and (2) a discriminator network for learning how to discriminate between a true RCM image of an actual acetic acid-stained skin tissue and the generator network's output, i.e., the corresponding virtually stained (acetic acid) tissue image 36. The merit of using this pix2pix GAN framework stems from two aspects. First, it retains the structural distance penalty in a regular deep convolutional network, so that the predicted virtually stained tissue images can converge to be similar with their corresponding ground truth in overall structural features. Second, as a GAN framework, it introduces the competence mechanism by training the two aforementioned networks in parallel. Due to the continuous enhancement of the discrimination ability of the discriminator network during the training process, the generator must also continuously generate more realistic images to deceive the discriminator, which gradually impels the feature distribution of the high-frequency details of the generated images to conform to the target image domain. Ultimately, the desired result of this training process is a generator, which transforms an unstained input RCM image 20 stack into an acetic acid virtually stained tissue image 36 that is indistinguishable from the actual acetic acid-stained RCM image of the same sample at the corresponding depth within the tissue. To achieve this, following the GAN scheme introduced above, the loss functions of the generator and discriminator networks were devised as follows:
where G(⋅) represents the output of the generator network, D(⋅) represents the output probabilistic score of the discriminator network, Itarget denotes the image of the actual acetic acid-stained tissue used as ground truth, Iinput_stack denotes the input RCM image stack (unstained). The generator loss function Eq. (1) aims to balance the pixel-wise structural error of the generator network output image with respect to its ground truth target, the total variation (TV) of the output image, and the discriminator network's prediction of the generator network's output, using the regularization coefficients (α, λ) that are empirically set as (0.02, 15). Specifically, the structural error term structral takes a form of the reversed Huber (or “BerHu”) error, which blends the traditional mean squared error and mean absolute error using a certain threshold as the boundary. The reversed Huber error between 2D images a and b is defined as:
where m, n are the coordinates on the images, and δ is a threshold hyperparameter that is empirically set as 20% of the standard deviation of the normalized ground truth image ztarget. The third term of Eq. (1) penalizes the generator to produce outputs that are more realistic to the discriminator by maximizing the discriminator's response to be 1 (real, like an actual acetic acid-stained tissue image), which increase authenticity of the generated images. The discriminator loss function Eq. (2) attempts to achieve the correct classification between the network's output and its ground truth by minimizing the score of the generated image to be 0 (classified to be a virtually stained tissue image) and maximizing the score of the actual acetic acid-stained tissue image to be 1 (real, classified to be actual/real acetic acid-stained tissue image). Within this adversarial learning scheme, spectral normalization was applied in the implementation of the discriminator network to improve its training stability.
For the generator network, as shown in
As depicted in
During the training of this GAN framework, the input image stacks and the registered target images were randomly cropped to patch sizes of 256×256×7 and 256×256, respectively and used a batch size of 12. Before feeding the input images data augmentation was also applied, including random image rotation, flipping and mild elastic deformations. The learnable parameters were updated through the training stage of the deep network using an Adam optimizer with a learning rate of 1×10−4 for the generator network and 1×10−5 for the discriminator network. Also, at the beginning of the training, for each iteration of the discriminator, there are 12 iterations of the generator network, to avoid the mode collapse, following a potential overfitting of the discriminator network to the targets. As the training evolves, the number of iterations (tGperD) of the generator network for each iteration of the discriminator network linearly decreases, which is given by
where tD denotes the total number of iterations of the discriminator, └⋅┘ represents the ceiling functions. Usually, the tD is expected to be ˜40000 iterations when the generator network converges. A typical plot of the loss functions during the GAN training is shown in
For the pseudo-H&E virtual staining of the actual and virtual acetic acid-stained tissue images, an earlier approach was modified, where epi-fluorescence images were used to synthesize pseudo-color images with H&E contrast. The principle of the pseudo-H&E virtual staining relies on the characteristics of H&E staining that the nucleus and cytoplasm are stained with blue and pink, respectively. In this system and method, an unstained input image collected by RCM (Iinput) and its corresponding actual acetic acid-stained tissue image (Itarget) are subtracted in pixel intensities to extract the foreground component Iforeground that mainly contains the nuclear features:
Note that Itarget and Iinput are initially normalized to (0, 1), and all the operations in Eq. (5) are pixel-wise performed on the 2D images. The selection of the coefficients 1.2 and 0.8 here is empirical. The background component that contains other spatial features including cytoplasm is defined by simply using the unstained input images Iinput. Following this separation of the foreground and background components, a pseudo-H&E acetic acid-stained tissue image Ianalytical-HE,target is analytically computed by colorizing and blending these two components based on a rendering approach, which models transillumination absorption using the Beer-Lambert law:
where βhematoxynlin and βeosin are the 3-element weight vector corresponding to R, G and B channels that helps to mimic the real color of hematoxylin and eosin, respectively. In the experiments disclosed herein, the values of the elements in βhematoxylin and βeosin are empirically chosen as [0.84, 1.2, 0.36]T and [0.2, 2, 0.8]T, respectively. Similarly, a pseudo-H&E acetic acid virtually stained tissue image Ianalytical-HE,output can also be computed by replacing Itarget with an acetic acid virtually stained tissue image Ioutput in Eq. (5).
This analytical approach (Eq. 6) works well on most of the actual and virtual acetic acid-stained tissue images to create H&E color contrast. However, when it comes to the images that contain melanocytes, whose H&E stain produces dark brown, this algorithm fails to generate the correct color at the position of these melanocytes. Considering that the brown color (representing melanin) would not be possible to generate through a pixel-wise linear combination of the images Iinput and Itarget or Ioutput, a learning-based approach was introduced to perform the correct pseudo-H&E virtual staining (VSHE), which can incorporate inpainting of the missing brown features by using the spatial information content of the images. For training purposes, manual labeling of melanocytes was performed to create training data for this learning-based approach. In order to reduce the labor of this manual labeling, first the initial distribution of melanin in a certain field of view was estimated through an empirical formula:
where ⋅ denotes pixel-wise multiplication, and Ith represents a threshold that is selected as 0.2 based on empirical evidence. The constitution of this formula is based on the observation that melanin has strong reflectance in both the unstained/labelfree and actual acetic acid-stained tissue RCM images, namely Iinput and Itarget, respectively. Then, these initial estimations are further cleaned up through a manual labeling process performed with the assistance of a board-certified dermatopathologist, resulting in Imelanin,labeled. This manual labeling process as part of the training forms the core task that will be learned and executed by the learning-based scheme (
where the value of βbrown is empirically chosen as [0.12, 0.24, 0.28]T in order to correctly render the brown color of the melanin. Using Eq. (8), the ground truth images were obtained for the learning-based virtual staining approach to perform the corrected pseudo-H&E virtual staining. Using the ex vivo training set, the pseudo-H&E virtual staining network VSHE 12 was trained to transform the distribution of the input and actual acetic acid-stained tissue images, i.e., Iinput and Itarget, into Ianalytical-HE,target. The architecture of the network VSHE 12 is identical to the ones used in the registration process, except for that the input and output of the network VSHE 12 have 2 and 3 channels, respectively (
Eq. (9) was used to create all the pseudo-H&E virtually stained tissue images. To exemplify the effectiveness of this learning-based pseudo-H&E virtual staining approach, in
CellProfiler was used to conduct morphological analysis of the results. After loading the actual acetic acid-stained tissue images and virtually stained (acetic acid) tissue images using CellProfiler, cell segmentation and profile measurement were performed to quantitatively evaluate the quality of the predicted images when compared with the corresponding ground truth images. In CellProfiler, the typical diameter of objects to detect (i.e., nuclei) was set to 10-25 pixel units and objects that were outside the diameter range or touching the border of each image were discarded. An adaptive thresholding strategy was applied using minimum cross-entropy with a smoothing scale of 6 and a correction factor of 1.05. The size of the adaptive window was set to 50. “Shape” and “Propagate” methods were selected to distinguish the clumped objects and draw dividing lines between clumped objects, respectively. Following this step, the function module “IdentifyPrimaryObjects” was introduced to segment the nuclei in a slice-by-slice manner. Accordingly, well-segmented nuclei images were obtained containing positional and morphological information associated with each detected nuclear object.
For the analysis of nuclear prediction performance of the model, first the function module “ExpandOrShrinkObjects” was employed to slightly expand the detected nuclei by e.g., 4 pixels (˜2 μm), so that the image registration and nuclei tracking-related issues across different sets of images can be mitigated. Then the function module “RelateObjects” was used to assign a relationship between the objects of virtually stained nuclei and actual acetic acid-stained ground truth, and used “FilterObjects” to only retain the virtually stained nuclei objects that present overlap with their acetic acid-stained ground truth, which were marked as true positives (TP). Similarly, false positives (FP) and false negatives (FN) were marked based on the virtually stained nuclei objects that have no overlapping with their ground truth, and the actual acetic acid-stained nuclei objects that have no overlap with the corresponding virtually stained nuclei objects, respectively. Note that in this case one does not have true negative (TN) calculated since one cannot define a nuclear object that does not exist in both the virtually-stained and ground truth images. Next, the numbers of TP, FP and FN events were counted, which were denoted as nTP, nFP and nFN, respectively, and accordingly computed the Sensitivity and Precision values, defined as:
For the nuclear morphological analysis, the function module “MeasureObjectSizeShape” was utilized to compute the nuclei area (“AreaShape_Area”, the number of pixels in one nucleus), compactness (“AreaShape_Compactness”, the mean squared distance of the nucleus's pixels from the centroid divided by the area of the nucleus), and eccentricity (“AreaShape_Eccentricity”, the ratio of the distance between the foci of the effective ellipse that has the same second-moments as the segmented region and its major axis length). “MeasureObjectIntensity” module was employed afterward to compute the nuclei reflectance (“Intensity_IntegratedIntensity_Cell”, the sum of the pixel intensities within a nucleus). The function module “MeasureTexture” was utilized to compute the contrast of the field of view (“Texture_Contrast_Cell”, a measure of local variation in an image). For image similarity analysis, the Pearson Correlation Coefficient (PCC) was calculated for each image pair of the virtual histology results and the corresponding ground truth image based on the following formula:
where Ioutput and Itarget represent the predicted (virtually-stained) and ground truth images, respectively, and E(⋅) denotes the mean value calculation. For all the violin plots presented above, the violin plot function in the Seabom Python library was used to visualize the conformance between the prediction and ground truth images.
The deep neural networks 10, 12 used herein were implemented and trained using Python (v3.6.5) and TensorFlow (v1.15.0. Google Inc.). All the image registration algorithms are implemented with MATLAB r2019a. For the training of the models, a desktop computer was used with dual GTX 1080 Ti graphical processing unit (GPU, Nvidia Inc.) and Intel® Core™ i7-8700 central processing unit (CPU, Intel Inc.) and 64 GB of RAM, running Windows 10 operating system (Microsoft Inc.). The typical training time of the convolutional neural networks used in the registration process and the pseudo-H&E virtual staining network (i.e., networks A, A′, B, and VSHE 12) is ˜24 hours when using a single GPU. For the acetic acid virtual staining network (i.e., VSAA 10), the typical training time for using a single GPU is ˜72 hours. Once the VSAA and VSHE networks 10, 12 are trained, using the same computer with two GTX 1080 Ti GPUs one can execute the model inference at a speed of ˜0.2632 sec and ˜0.0818 sec for an image size of 896-896-pixels, respectively. Using a more powerful machine with eight Tesla A100 GPUs, the virtual staining speed can be substantially increased to ˜0.0173 and ˜0.0046 sec per image (896×896-pixels), for VSAA and VSHE networks, respectively.
The specific procedures of the pyramid elastic registration algorithm are detailed in the following pseudo-code set forth in Table 1:
For performing this elastic registration, the values of α, β, ε and N0 are empirically set as 1.4, 50, 0.5 and 3, respectively. The detailed procedures of calculating the 2D shift map SM based on the 2D cross-correlation map CCM can be summarized as:
Calculate the normalized cross-correlation map nCCM, which is defined as
where CCM is the cross-correlation map, defined as
where f and g represent two images, and ā refers to the two-dimensional mean operator of an image, a. The locations of the maximum and minimum values of CCM indicate the most likely and the most unlikely (respectively) relative shifts of the images. PCC refers to the Pearson correlation coefficient of the two images.
The normalized cross-correlation map nCCM is then fit to a 2D Gaussian function, which is defined as:
where x0 and y0 represent the lateral position of the peak that indicates the shift amount between the two images along the x and y directions, respectively, and A represents the similarity of the two images, f and g.
While embodiments of the present invention have been shown and described, various modifications may be made without departing from the scope of the present invention. For example, in certain embodiments, the functionality of the first and second deep neural networks 10, 12 may be combined into a single deep neural network. The invention, therefore, should not be limited, except to the following claims, and their equivalents.
This application claims priority to U.S. Provisional Patent Application No. 63/219,785 filed on Jul. 8, 2021, which is hereby incorporated by reference in its entirety. Priority is claimed pursuant to 35 U.S.C. § 119 and any other applicable statute.
This invention was made with government support under Grant Number 1926371, awarded by the National Science Foundation. The government has certain rights in the invention. This work was supported by the U.S. Department of Veterans Affairs, and the federal government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/035609 | 6/29/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63219785 | Jul 2021 | US |