The present invention is generally directed to retinal layer segmentation in OCT data. More specifically, it is directed to deep learning approaches to retinal layer segmentation and to the creation of augmented training samples.
Optical coherence tomography (OCT) is a non-invasive imaging technique that uses light waves to penetrate tissue and produce image information at different depths within the tissue, such as an eye. Generally, an OCT system is an interferometric imaging system based on detecting the interference of a reference beam and backscattered light from a sample illuminated by an OCT beam. Each scattering profile in the depth direction (e.g., z-axis or axial direction) may be reconstructed individually into an axial scan, or A-scan. Cross-sectional slice images (e.g., two-dimensional (2D) bifurcating scans, or B-scans) and volume images (e.g., 3D cube scans, or C-scans or volume scans) may be built up from multiple A-scans acquired as the OCT beam is scanned/moved through a set of transverse (e.g., x-axis and/or y-axis) locations on the sample. When applied to the retina of an eye, OCT generally provides structural data that, for example, permits one to view, at least in part, distinctive tissue layers and vascular structures of the retina. OCT angiography (OCTA) expands the functionality of an OCT system to also identify (e.g., render in image format) the presence, or lack, of blood flow in retinal tissue. For example, OCTA may identify blood flow by identifying differences over time (e.g., contrast differences) in multiple OCT scans of the same retinal region, and designating differences in the scans that meet predefined criteria as blood flow.
An OCT system also permits construction of a planar (2D), frontal view (e.g., en face) image of a select portion of a tissue volume (e.g., a target tissue slab (sub-volume) or target tissue layer(s), such as the retina of an eye). Examples of other 2D representations (e.g., 2D maps) of ophthalmic data provided by an OCT system may include layer thickness maps and retinal curvature maps. For example, to generate layer thickness maps, an OCT system may combine en face images, 2D vasculature maps of the retina, with multilayer segmentation data. Thickness maps may be based, at least in part, on measured thickness difference between retinal layer boundaries. Vasculature maps and OCT en face images may be generated, for example, by projecting onto a 2D surface a sub-volume (e.g., tissue slab) defined between two selected layer-boundaries. The projection may use the sub-volume's mean, sum, percentile, or other data aggregation method between the selected two layer-boundaries. Thus, the creation of these 2D representations of a 3D volume (or sub-volume) data often relies on the effectiveness of automated (multi) retinal layer segmentation algorithm(s) to identify the retinal layers (or layer-boundaries) upon which the 2D representations are based/defined.
It is an object of the present invention to provide a practical deep learning solution to retinal layer segmentation of OCT data that outperforms traditional knowledge-based algorithms in terms of execution time.
It is another object of the present invention that the deep learning model be fast and accurate, and generalizes well to unseen data (data dissimilar to training input/output samples) with various pathological structures and is robust to high levels of noise.
It is a further object of the present invention to provide a deep learning model that generalizes well to scans from different instruments (e.g., CARL ZEISS's CIRRUS™ PLEXELITE™, and other OCT systems/instruments) and different scan patterns.
It is still another object of the present invention that the deep learning model require minimal effort to work on new scans with different signal and noise characteristics.
The above objects are met in a method/system that provides a set of augmentation methods to generate a rich and diverse set of labeled data, uses a minimal and efficient network structure, provides proper pre-training and training procedures, and a loss function specific for the MLS problem.
An embodiment of the present invention provides a method and system for segmenting one or more target retinal layers from an optical coherence tomography (OCT) volume scan of an eye. The method/system may include acquiring the OCT volume scan (e.g., C-scan), such as by using an OCT system. Alternatively, the OCT volume scan may be acquired/collected form a data store of previously obtained OCT volume scans. Alternatively, the acquired OCT data may be a B-scan. The acquired OCT volume is then submitted, optionally in B-scan portions, to a deep learning machine model having a self-attention mechanism that differentially weighs the importance (or priorities) of different regions of each B-scan based on the regions' relationship to the one or more target retinal layers. This may be done by enhancing (e.g., weighing more heavily) regions of each B-scan associated with the one or more target retinal layers and deemphasizing (weighing less heavily) regions not associated with the target retinal layers. The deep learning machine model maintains the data density of the width dimension of each B-scan, but reduces the data density of the depth dimension of each B-scan based on the number of target retinal layers. In this manner, the amount of data of each B-scan along the axial direction that needs to be analyzed is reduced to only those portions pertinent to finding the one or more target retinal layers.
That is, each B-scan is made up of multiple adjacent A-scans, and the self-attention mechanism enhances one or more Layer-of-Interest (LOI) regions respectively associated with the one or more target retinal layers within each A-scan, for example, based on topology information. In this manner, all the adjacent A-scans of a B-scan can be processed in parallel without placing an excessive computing cost on the system. For example, if L is the number of target retinal layers to be segmented, then the present deep learning machine model may make L×W number of predictions per B-scan, with each of the L rows of predictions being of size 1×W and representing a Layer-of-Interest (LOI).
Each B-scan comprises a multiple adjacent A-scans. Optionally, the deep learning machine model is based on a neural network that includes a Linear Projection layer that converts the depth dimension of all A-scans (irrespective of their respective axial dimension size) to a common, fixed depth dimension smaller than their original depth dimension. For example, the depth dimension of each A-scan may be reduced at least by a factor of 100. In an embodiment of the present invention, the neural network includes a transformer encoder, and the converted A-scans are input to the transformer encoder. This transformer encoder may include multiple transformer layers. In embodiments, the output of the transformer encoder is projected to a prediction layer by a second Linear Projection layer, and the prediction layer provides segmentation information of the one or more target retinal layers to an output layer that outputs the predictions on a per A-scan basis in parallel.
Optionally, the output from the self-attention mechanism is processed to produce predictions of the segmentation of the one or more target retinal layers and associates confidence maps for each of the predicted segmentations of the one or more target retinal layers. Optionally, the predicted segmentations of the one or more target retinal layers are of the form of 2×(w), where w is the width of a submitted B-scan.
The prediction may include, per target retinal layer, a center prediction termed “center” and a heights prediction term “heights”, and the output upper layer boundary ymax and lower layer boundary ymin per segmented target retinal layer is defined as
where h1 and h2 are hyperparameters defining the thickness prediction of the target retinal. Here, h1 and h2 may be determined experimentally.
The above objects are also met in a method or system for segmenting one or more target retinal layers from an optical coherence tomography (OCT) scan of an eye that includes: acquiring the OCT scan (including at least one B-scan); and submitting the OCT volume scan in B-scan segments to a deep learning machine model based on a neural network whose training set includes augmented training samples. Creation of the augmented training samples may include: collecting raw spectral data with high-resolution using an OCT system; constructing primary high-resolution OCT image data from the collected raw spectral data with high-resolution; defining ground truth layer segmentation label data from the high resolution OCT scan; amending the raw spectral data and generating secondary OCT image data; and using the secondary OCT image as an augmented training input sample and the ground truth layer segmentation label data as part of a training output target sample in the training of the neural network.
In this approach, the primary high-resolution OCT image data and the secondary OCT image data provide structural data. Also, the acquired OCT scan may be a volume scan comprising a plurality of these B-scans.
The raw spectral data may be amended by degrading the raw spectral data. Alternatively, or in addition, the raw spectral data may be amended by applying local wrapping and changes in reflectivity to simulate at least one of a plurality of pathologies. Also, the raw spectral data may be amended by accessing sample noise data from a store of OCT noise scans and applying the sampled noise data to the raw spectral data.
Optionally, the ground truth layer segmentation label data may be defined by submitting the primary high-resolution OCT image data to an automated Multi retinal Layer Segmentation utility.
Other objects and attainments together with a fuller understanding of the invention will become apparent and appreciated by referring to the following description and claims taken in conjunction with the accompanying drawings.
Several publications may be cited or referred to herein to facilitate the understanding of the present invention. All publications cited or referred to herein, are hereby incorporated herein in their entirety by reference.
The embodiments disclosed herein are only examples, and the scope of this disclosure is not limited to them. Any embodiment feature mentioned in one claim category, e.g., system, can be claimed in another claim category, e.g., method, as well. The dependencies or references back in the attached claims are chosen for formal reasons only. However, any subject matter resulting from a deliberate reference back to any previous claims can be claimed as well, so that any combination of claims and the features thereof are disclosed and can be claimed regardless of the dependencies chosen in the attached claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The subject matter of the present disclosure is particularly pointed out and distinctly claimed in the concluding portion of the specification. A more complete understanding of the present disclosure, however, may best be obtained by referring to the following detailed description and claims in connection with the following drawings. While the drawings illustrate various embodiments employing the principles described herein, the drawings do not limit the scope of the claims.
In the drawings wherein like reference symbols/characters refer to like parts:
The following detailed description of various embodiments herein makes reference to the accompanying drawings, which show various embodiments by way of illustration. While these various embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, it should be understood that other embodiments may be realized and that changes may be made without departing from the scope of the disclosure. Thus, the detailed description herein is presented for purposes of illustration only and not for limitation. Furthermore, any reference to singular includes plural embodiments, and any reference to more than one component or step may include a singular embodiment or step. Also, any reference to attached, fixed, connected, or the like may include permanent, removable, temporary, partial, full, or any other possible attachment option. Additionally, any reference to without contact (or similar phrases) may also include reduced contact or minimal contact. It should also be understood that unless specifically stated otherwise, references to “a,” “an” or “the” may include one or more than one, and that reference to an item in the singular may also include the item in the plural. Further, all ranges may include upper and lower values, and all ranges and ratio limits disclosed herein may be combined.
Analysis of the thickness of various retinal layers in an OCT image provides valuable clinical insight and is useful for monitoring eye health. Segmenting different layers, such as the Inner Limiting Membrane (ILM) or the Retinal Pigment Epithelium (RPE), in an OCT image not only has several clinical applications but also helps with algorithm development. An example of the fovea and some retinal layers is provided in
Methods for extracting various retinal layer information (e.g., layer boundary or layer thickness information) from an OCT (e.g., structure) image (and creating a map of this information, e.g., thickness maps) rely on traditional knowledge-based algorithms or methods that semantically segment the OCT image/data using machine learning techniques. While knowledge-based algorithms can be efficient, they often require data-specific hand tuning which can make scaling these methods to accommodate new data types difficult, or impractical.
In addition to knowledge-based approaches, deep learning approaches have also been considered. Deep learning approaches, e.g., artificial intelligence (AI), are attractive since there exists the possibility that they may make use of data overlooked by more traditional knowledge-based approaches, have the potential of being easier to develop and/or train than knowledge-based approaches (if sufficient training data is available), and their resultant trained models can sometimes be faster than knowledge-based approaches. Previous attempts at using deep learning for retinal layer segmentation in OCT data, however, have faced various difficulties that have limited their practical implementation. For example, one difficulty is that deep learning models (e.g., based on neural networks) typically require a large library of training samples, and it is often expensive and impractical to collect such a large library of labeled training samples.
Nonetheless, the study, diagnosis, and monitoring of retinal disease benefits greatly from modeling anatomical structures in OCT images. (Automated) Multi retinal Layer Segmentation (MLS) utility/method/tool/application are crucial component and often an early step in such analysis pipelines. However, various retinal diseases make developing automated algorithms for this task challenging. Automated segmentation methods may be divided into two categories; knowledge-based (classical) and learning-based (e.g., deep learning) methods.
Knowledge-based (classical) algorithms have been developed to estimate the boundaries of several retinal layers from B-scan images. Those methods are usually based on some assumptions about the input images and anatomy, and comprise several hand-designed steps to extract useful features from input images. These are usually slow and may not work on a new set of data that does not satisfy the algorithm's assumptions. Extreme variations in morphology and alignment of layers due to retinal diseases could cause issues for this method. Compared to machine learning-based methods, the primary advantage of such methods is that they do not require human-labeled ground-truth data.
Machine learning methods, especially deep learning ones, have been shown to handle many limitations of classical models. These models are trained in a supervised fashion by providing B-scans or a set of B-scans as input and ground-truth segmentation masks/images (or boundaries) as training target data. The need for a diverse set of training data (e.g., to provide examples of different scan types, scanning systems, imaging artifacts, pathology examples, etc.) with accurate ground-truth (e.g., labels) limits the ability of such models. The generalizability of deep learning methods to unseen data is also an issue for poorly designed networks and limited training data.
Layer segmentation is generally addressed by one of two methods: (1) using a knowledge-based algorithm, which typically involves using a graph based set up to enforce prior knowledge about the layers; or (2) a hybrid deep learning based set up that involves using deep learning (e.g., a neural network) for making dense predictions (predicting for every pixel, as in semantic segmentation) and then using knowledge-based methods (e.g., in post-processing) to extract the layer(s) of interest from these images (e.g., the dense predictions output from the deep learning module).
The limitations of method (1) may include developing a new knowledge-based algorithm or modifying an existing algorithm (e.g., as when signal quality changes, such as due to a change in OCT device characteristics) requires more time and effort than simply retraining a deep-learning (neural network) model with new training data samples. Also, the accuracy of the predictions is not always informed by the general context of the image, and this lack of representational knowledge can make these methods less accurate.
A limitation of using a two-step approach, e.g., method (2), may include the runtime performance. The cost of making dense predictions (e.g., on a pixel-by-pixel basis) is much higher than predicting the layers in a deep learning approach, and the accuracy of the algorithm/model is also highly dependent on the hand-tuned, knowledge-based, post-processing method. These hand-tuned methods face similar drawbacks to those of method (1).
In various embodiments, attempts using deep learning to address obstacles in application of the retinal layer segmentation of OCT may include:
A challenge of retinal layer segmentation (or other retina layer analysis) of OCT data/images using a data centric/deep learning-based technique revolves arounds posing the problem as a multi-stage set up. These stages typically involve segmenting the layers of the OCT image in a semantic or dense prediction manner and later applying several post processing methods to extract the region of interest (ROI) using hand tuned knowledge-based algorithms.
If this process were wrapped in an end-to-end, deep learning-based method, it would imply:
The present invention addresses several difficulties associated with a deep learning approach to retinal layer analysis. Some problems/objectives that the present invention addresses include:
In this present embodiment, the layer segmentation problem is posed as a regression problem. A key feature of the present approach is that for an image of size H×W, the present embodiments make L×W number of predictions per image, where each row prediction of size 1×W represents a layer of interest and there are L such layers.
Two main end-to-end methods of solving this problem are put forth:
1: Layer Detection: The layer segmentation problem is posed as an object localization problem. The network architecture employed is modified to never lose resolution over the image width while aggregating features over the (image) height (axial) dimension. This architecture is made to be fully convolutional. After extracting these A-scan-wise features, another convolutional layer is used for making the layer predictions (e.g., extraction layer boundary information). The predictions of this network are assumed be of the form: (center location, log (layer thickness)). Optionally, exponential notation is used for layer thickness prediction. Finally, the top and bottom boundary layers are calculated using these predicted coordinates.
2: Transformers: A second architecture draws inspiration from the use of transformers in neural networks, which have shown state of the art performance in natural language processing (NLP) applications. Generally, a transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. More specifically, the present invention takes inspiration from Bidirectional Encoder Representations from Transformers (BERT) architectures, as described, for example, in “Bert: Pre-training of deep bidirectional transformers for language understanding” by in Devlin, Jacob, et al., arXiv preprint arXiv:1810.04805 (2018), herein incorporated in its entirety by reference. The present invention, however, reposes the problem to manage the computation complexity of OCT data/image tasks.
The final attention outputs are used to regress (e.g., predict) the output layers and the confidence maps (e.g., associate the output layers with confidence maps/values). The present neural network (or deep learning machine model) produces clinically acceptable outputs while remaining fast (e.g., faster than previously achieved) for real world deployment (e.g., having speeds suitable for practical applications in the field).
The present network is trained on data where the ground truth outputs are created from an automated MLS (multi-retinal layer segmentation) algorithm, and the MLS segmentation confidence metric is used to weigh the loss gradients generated for each A-scan. Hence making this whole set up semi-supervised. The present network is trained using this MLS confidence weighted L1 loss (e.g., loss function) for the layer positions and a Binary Cross Entropy loss for the confidence metric.
Some advantages of the present end-to-end, Layer Detection network are:
Some advantages of the present transformer-based network are:
The present problem of retinal layer detection is solved by using a custom architecture for this problem. The architecture consists of feature extraction and layer regression/prediction halves/parts. Five features of the present invention are:
1. The feature extraction half of a neural network in accordance with the present invention may consist of seven blocks/layers, including:
A depiction of the network intermediate output shapes/blocks/layers is shown in
As shown, the predictions are of the shape/dimension of 2×(w), where w is the image width (e.g., number of A-scans in a B-scan or C-scan input image). If the 1st dimension of this prediction is named as ‘center’ and 2nd dimension is named as ‘heights’, then the output may be defined as:
Where h1 and h2 are hyperparameters influencing the thickness/height prediction of the network. These parameters may be determined experimentally. For the present work, h1=10 and h2=2.3. Parameters ymin and ymax may refer to the upper and lower boundary of the target layer(s), respectively.
The present implementation makes use of the 1D Dice coefficient (or Sørensen-Dice coefficient, or other known similarity coefficient, e.g., a statistic used to gauge the similarity of two samples). Finally, these outputs of the network are trained to maximize the 1D Dice coefficient to measure overlap of each predicted layer at every A-scan location, which can be stated as:
where yp max and yp min are the predicated max value of the top layer boundary and the bottom layer boundary, and yg max and yg min are the ground truth max value of the top layer boundary and the bottom layer boundary. This is an example of estimating two layer boundaries, such as ILM and outer boundary of RNFL. Here, ymin is the ILM and ymax is the RNFL for data point subscript “p”.
The training therefore involves regressing (predicting/identifying) the top and the bottom of the (layer) regions of interest directly by using this new formulation.
The problem of retinal layer segmentation is solved by using a Transformer Architecture in the following manner. The architecture consists of:
3. Linear Projection Layers
4. Transformer Layers
5. Output regression with activation and scaling
The initial (e.g., first) Linear Projection layer (e.g., Linear Projection Layer-1 in
In various embodiments, one reason the present embodiment can use transformers without adversely affecting the computational complexity is that the problem is reformulated. The Computational cost of running a transformer on a sequence of ‘n’ inputs of ‘d’ dimension is O(n2d). Therefore, for images of size h and w, the computational complexity is: O((hw)2d)=O(h2w2d). In a naive set up, one can let each pixel be a 1d feature, so then d=1 and complexity is: O(h2w2).
In the present set up, each input A-scan (of for example 1024 or other number of samples/pixels) is projected with the first Linear Projection layer-1 to a 128-dimensional feature (d=128), e.g., having a feature height/depth of 128. This effectively translates any input A-scan of any size into a representative 128-dimensional vector having a format suitable for (e.g., expected by) the Transformer encoder. Now there are only ‘w’ sequences (A-scans) of 128 dimension each, and the computational complexity thereby become=O(w2d), which is far less than how transformers are traditionally used for vision tasks. This set up is made possible by the special nature of the OCT A-scan data.
In the present exemplary embodiment, a transformer encoder known as BERT (Bidirectional Encoder Representations from Transformers) is used, as described, for example, in “An image is worth 16×16 words: Transformers for image recognition at scale” by Dosovitskiy, Alexey, et al, arXiv preprint arXiv:2010.11929 (2020), herein incorporated in its entirety by reference. It is to be understood, however, that other transformer-based architectures known in the art may be used.
Here, the input to the transformer is the 128-dimensional vector described earlier. The output of this network is also a 128-dimensional vector, which has been refined. These vectors are then projected directly to layer predictions by a second fully connected linear layer (e.g., Linear Projection Layer-2, 24, in
A block diagram of the present workflow described above is illustrated in
The present exemplary network is trained with the L1 (or L1) loss function, as known in the art. In the present embodiment, since the ground truth data is also generated in an automated manner by a knowledge-based algorithm with a confidence metric, this confidence information is used to further improve the training process. The L1 loss for every A-scan is weighted by the confidence metric generated by the knowledge-based algorithm. In this manner, a lower weight (e.g., close to, or substantially equal to, 0) is assigned to loss generated at unreliable (e.g., low confidence) ground truth A-scans, and a higher weight (e.g., close to, or substantially equal to, 1) is assigned to loss generated for ground truths that are reliable (corresponding to high confidence positions of ground truth A-scans).
Transformers have not appeared to have previously been used for OCT data in this manner. The present approach makes the whole set up very efficient by reducing computational complexity and producing state of the art results. Splitting up an OCT image as a sequence of A-scans can be used in two creative ways, as described above. A main benefit being a significant computational speed increase without losing accuracy. An object detection-like network and a transformer network can be trained using this principle, and both perform well, with transformer network outperforming all other methods. The whole network is trained in a semi-supervised method by using confidence metrics from the knowledge-based algorithms to weigh the losses for the deep learning-based method.
A difficulty with creating a deep learning machine model based on a neural network can be the difficulty in obtaining enough training sets, or training pairs, (e.g. training input data sample and corresponding training target output sample). This requirement can be partially address by data augmentation, which tries to generate new training samples from existing training sample data. However, current data augmentation methods for medical applications, such as OCT image data, have traditionally been limited. Herein is presented a novel method of data augmentation for OCT applications.
Some classical methods of retinal layer segmentation can be replaced with machine learning (ML) and/or deep learning (DL) methods. Still, some methods try to take advantage of both approaches. Mishra, Z. et al., (2020), in “Automated retinal layer segmentation using graph-based algorithm incorporating deep-learning-derived information”, Scientific Reports, 10(1), 1-8, incorporated herein in its entirety by reference, describes a method that takes probability maps generated through a fully convolutional neural network and applies a shortest-path algorithm to them to estimate final segmentation masks.
Most DL methods are based on U-Net structure and try to predict layers assignment for each pixel. Two examples of method based on U-Net structure are disclosed in De Fauw et al, (2018), “Clinically applicable deep learning for diagnosis and referral in retinal disease”, Nature medicine, 24(9), 1342-1350, and in Yadav, S. K. et al., (2021), “Deep Learning based Intraretinal Layer Segmentation using Cascaded Compressed U-Net”, medRxiv, both herein incorporated in their entirety by reference. U-Net is an Encoder-Decoder network that might be suitable for pixel-level prediction tasks such as semantic segmentation, but it is slow and inaccurate for layer boundary detection. To handle this task, some methods have tried to develop different network architectures, including fully convolutional networks, as described by Anoop, B. N. et al., (2020), in “Stack generalized deep ensemble learning for retinal layer segmentation in optical coherence tomography images,” Biocybernetics and Biomedical Engineering, 40(4), 1343-1358, herein incorporated in its entirety by reference. In, “Clinically applicable deep learning for diagnosis and referral in retinal disease,” Nature medicine, (2018), 24(9), 1342-1350, herein incorporated in its entirety by reference, De Fauw et al. describe a classification from the segmentation framework. Given any scan cube, the method first performs 3D layer segmentation using a 3D U-Net network and uses its output to perform the final diagnosis and referral tasks. While the segmentation modules generate masks for 15 different anatomies, pathology, and image artifact, the method does not focus on generating accurate B-scan level layer segmentation. Most of these prior art techniques use general data augmentation methods used in a typical deep learning framework. While those augmentations are necessary, they are not sufficient. There have been many attempts to use Generative Adversarial Networks (GAN) to augment OCT data for better training of ML models, as described, for example, in “Data augmentation for patch-based OCT choroid-retinal segmentation using generative adversarial networks”, 2021, Neural Computing and Applications, 33(13), 7393-7408, by Kugelman, J. et al., herein incorporated in its entirety by reference. Training such GANs is complex, time-consuming, and, most important, there is no generative method capable of generating ground-truth layer masks in addition to the OCT images.
Herein is proposed an augmentation method to make the machine model generalize well to different noise levels and complex disease cases. The present approach performs spectral-domain (e.g., raw OCT data) augmentation, time-domain global augmentations, and local morphing. The present neural network architecture may have a minimal fully-convolutional network, but other networks are possible and contemplated in the present invention. A training regime in accordance with this embodiment/approach may include pre-training the machine model on a vast corpus of unlabeled data to generate a strong representation of the input B-scans, and fine-tuning the machine model on ground truth samples acquired from existing classical methods. The present approach provides for effectively using data from a knowledge-based model to train a deep learning machine model. The present approach may use confidence values generated alongside the retinal layers from a classical method to filter out week training data, as discussed above. Confidence values in the cost function of the deep learning machine model may also be incorporated. The values may be added as regularization terms directly to the cost function.
Regarding data augmentation, the described local warping technique has not been used by any other method on OCT data. The present spectral-domain (e.g., raw OCT data) method of data augmentation is also unique to the data generation and workflow pipeline. As for the present deep learning method(s), the cost function is deemed novel. The below combination of neural network structure and self-supervised approach enables enhancement of the operation of the retinal layer segmentation from OCT images. Also, the above-described novel deep learning machine models may also be used with the present novel features.
A: Global Time-Domain Augmentation
This approach applies extensive global augmentations (e.g., affine, adding noise, adjust brightness, and contrast) to the training data to create additional training samples and have more robust machine models and increase the generalizability of the trained machine models.
B: Local Warping
Two significant limitations of existing data that restrict the generalizability of any deep learning machine model trained on this data are: (1) collecting data with all possible retinal diseases and variations is challenging; and (2) even if one can collect such data, the classical MLS tools will have difficulties generating reliable ground-truth retinal segmentation.
While global augmentations, as listed immediately above, will make deep learning models invariant to global geometric or photometric variations, they will not address local variations, such as can occur due to various diseases. Herein is presented a local warping method to deal with this issue. Local warping aims to simulate various pathologies, which can also include changes to OCT reflectivity (e.g., reflectivity changes that corresponds to specific pathology or pathologies). This approach thus provides local warping (changes in shape) and reflectivity changes (e.g., changes in intensity) to simulate different deformations/image artifacts characteristic of one or more specific disease. For example, Age-Related Macular Degeneration (AMD) is a main concern, but other diseases, such as diabetic retinopathy (DR), Glaucoma, vitreoretinal interface (VRI), and a combination of other diseases and deformations will be considered in the simulation (e.g., using the present data augmentation approach). This method will help the DL models learn the shape of the pathologies even without perfect simulations of them. This method differs from GANs since it provides complete control over generated images and corresponding layer boundaries, which is impossible in generative models (e.g., GAN).
The present disease simulation includes two steps. The shape of the retina is morphed in specific locations, and then the intensities are adjusted, or vise vera. This approach may be applied to ground truth examples that have already been labeled. Applying local warping on data with ground-truth layers with annotation significantly increases the diversity of data, since the retinal layer labels accurately carry forward to the warped/amended data, and no new/additional segmentation labels for the morphed image is needed. The method has complete control over the warping parameters and can apply the same augmentation to both B-scans and the associated layer annotations. For illustration purposes,
C1: Spectral Domain Augmentations
The present approach provides OCT data augmentation in the spectral domain to deal with low axial resolution data. The present approach amends raw OCT data to define additional training samples, as opposed to applying data augmentation to OCT imaged data (e.g., after applying Fourier transfer to raw OCT data).
Layer segmentation on low axial resolution data is essential for low-cost OCT devices. While adjusting a classical method to work on low-resolution images is challenging, deep learning machine models could be trained to work on super low-resolution images. To train a robust model to work on images with a low axial resolution, the present approach may include the following four features:
1A: Collect spectral data with high-resolution and reconstruct OCT cubes (volume scans);
2A: Apply existing MLS models that perform well on high-resolution data to generate gold-standard layer segmentation (e.g., from the constructed high-resolution OCT cubes);
3A: Degrade the spectral data to lower resolution with various degrees and reconstruct low-resolution OCT cubes, e.g., low-resolution images, (from the degraded spectral (or raw OCT) data);
4A: Use these low-resolution images and the gold-standard layer annotation from step (2A) as input to train a neural network and define the deep learning machine model.
C2: Augmentation in the Spectral Domain to Handle Noise
Adding noise in the spectral domain (raw OCT data) is a unique augmentation in our method. It is challenging to simulate noise observed on OCT data only in the time (e.g., image) domain. Therefore, the present approach takes advantage of having access to the raw OCT signal to do so. This process may include:
1B: Store noise scans during acquisition;
2B: Reconstruct OCT image;
3B: Apply existing MLS models that perform well on low-noise data to generate gold-standard layer segmentation;
4B: Sample noise from stored noise scans and apply to the raw signal and reconstruct the resultant noisy images;
5B: Use these noisy images as input and gold-standard layer annotation from step (2B) to train the neural network and define the deep learning machine model.
Additional data augmentations that may be made in the spectral domain include:
1C: Shift and sheer B-scans axially in the spectral domain;
2C: Change contrast in spectral domain, as opposed to changing contrast in time-domain (e.g., multiplication of an image with a scalar), which is equivalent to multiplication with another constant in the spectral domain.
1C: Adjusting brightness of the signal. In the time domain, this may be achieved by adding a scalar to the image data, but in the spectral domain this may be achieved by adding a constant to the zero-frequency component.
Exemplary Network Architecture
The present data augmentation methods may be used with the above-described neural network architectures, or with any other NN architectures. For illustration purpose, herein is presented another NN approach.
Training Procedure
The network is pre-trained using self-supervised approach (e.g., SimClr, a framework for contrastive learning of visual representations) on data without ground-truth segmentation.
Data Collection
Manual labeling of data for this task is tedious. Therefore, the present approach uses the segmentation output of an existing knowledge-based (MLS) algorithm to train the present deep learning network. Consequently, the ground truth samples (e.g., training samples) resulting from the MLS may not be flawless. Thus, a question that arises is, how to train a deep learning machine model that performs better than its classical teacher (e.g., the outputs from the MLS)? To answer this question, two methods are developed to alleviate imperfections in the ground-truth data:
1E: Filter out samples in which the classical method has extremely low confidence;
2E: Incorporate confidence values generated by the classical method (alongside the layer segmentation) in the cost function of the deep learning machine model during training.
Cost Function
The cost function used to train the primary machine model comprises two main terms, data and regularization. While the data term may be a simple L1 or L2 loss, various components will form the regularization term:
1F: Confidence values of the ground-truth data generated by classical model;
2F: Smoothing terms;
3F Some terms used to enforce a physical limitation of layer boundaries (distance transform of the binary image calculated from the ground-truth layer positions painted in a blank image);
4F: Incorporate cost images used by classical MLS, where cost images are defined as weighted average of one or more processed segmentation regions or B-scans that are used for segmentation (examples for processed segmentation regions are axial gradient, gradient of magnitude, filtered intensity image, or filtered inverted intensity image, etc.).
Hereinafter is provided a description of various hardware and architectures suitable for the present invention.
Fundus Imaging System
Two categories of imaging systems used to image the fundus are flood illumination imaging systems (or flood illumination imagers) and scan illumination imaging systems (or scan imagers). Flood illumination imagers flood with light an entire field of view (FOV) of interest of a specimen at the same time, such as by use of a flash lamp, and capture a full-frame image of the specimen (e.g., the fundus) with a full-frame camera (e.g., a camera having a two-dimensional (2D) photo sensor array of sufficient size to capture the desired FOV, as a whole). For example, a flood illumination fundus imager would flood the fundus of an eye with light, and capture a full-frame image of the fundus in a single image capture sequence of the camera. A scan imager provides a scan beam scanned across a subject, e.g., an eye, and the scan beam is imaged at different scan positions as it is scanned across the subject creating a series of image-segments that may be reconstructed, e.g., montaged, to create a composite image of the desired FOV. The scan beam could be a point, a line, or a two-dimensional area such a slit or broad line. Examples of fundus imagers are provided in U.S. Pat. Nos. 8,967,806 and 8,998,411.
From the scanner LnScn, the illumination beam passes through one or more optics, a scanning lens SL and an ophthalmic or ocular lens OL, that allow for the pupil of the eye E to be imaged to an image pupil of the system. Generally, the scan lens SL receives a scanning illumination beam from the scanner LnScn at any of multiple scan angles (incident angles), and produces scanning line beam SB with a substantially flat surface focal plane (e.g., a collimated light path). Ophthalmic lens OL may then focus the scanning line beam SB onto an object to be imaged. In the present example, ophthalmic lens OL focuses the scanning line beam SB onto the fundus F (or retina) of eye E to image the fundus. In this manner, scanning line beam SB creates a traversing scan line that travels across the fundus F. One possible configuration for these optics is a Kepler type telescope wherein the distance between the two lenses is selected to create an about telecentric intermediate fundus image (4-f configuration). The ophthalmic lens OL could be a single lens, an achromatic lens, or an arrangement of different lenses. All lenses could be refractive, diffractive, reflective or hybrid as known to one skilled in the art. The focal length(s) of the ophthalmic lens OL, scan lens SL and the size and/or form of the pupil splitting mirror SM and scanner LnScn could be different depending on the desired field of view (FOV), and so an arrangement in which multiple components can be switched in and out of the beam path, for example by using a flip in optic, a motorized wheel, or a detachable optical element, depending on the field of view can be envisioned. Since the field of view change results in a different beam size on the pupil, the pupil splitting can also be changed in conjunction with the change to the FOV. For example, a 45° to 60° field of view is a typical, or standard, FOV for fundus cameras. Higher fields of view, e.g., a widefield FOV, of 60°-120°, or more, may also be feasible. A widefield FOV may be desired for a combination of the Broad-Line Fundus Imager (BLFI) with another imaging modalities such as optical coherence tomography (OCT). The upper limit for the field of view may be determined by the accessible working distance in combination with the physiological conditions around the human eye. Because a typical human retina has a FOV of 140° horizontal and 80°-100° vertical, it may be desirable to have an asymmetrical field of view for the highest possible FOV on the system.
The scanning line beam SB passes through the pupil Ppl of the eye E and is directed towards the retinal, or fundus, surface F. The scanner LnScn1 adjusts the location of the light on the retina, or fundus, F such that a range of transverse locations on the eye E are illuminated. Reflected or scattered light (or emitted light in the case of fluorescence imaging) is directed back along as similar path as the illumination to define a collection beam CB on a detection path to camera Cmr.
In the “scan-descan” configuration of the present, exemplary slit scanning ophthalmic system SLO-1, light returning from the eye E is “descanned” by scanner LnScn on its way to pupil splitting mirror SM. That is, scanner LnScn scans the illumination beam from pupil splitting mirror SM to define the scanning illumination beam SB across eye E, but since scanner LnScn also receives returning light from eye E at the same scan position, scanner LnScn has the effect of descanning the returning light (e.g., cancelling the scanning action) to define a non-scanning (e.g., steady or stationary) collection beam from scanner LnScn to pupil splitting mirror SM, which folds the collection beam toward camera Cmr. At the pupil splitting mirror SM, the reflected light (or emitted light in the case of fluorescence imaging) is separated from the illumination light onto the detection path directed towards camera Cmr, which may be a digital camera having a photo sensor to capture an image. An imaging (e.g., objective) lens ImgL may be positioned in the detection path to image the fundus to the camera Cmr. As is the case for objective lens ObjL, imaging lens ImgL may be any type of lens known in the art (e.g., refractive, diffractive, reflective or hybrid lens). Additional operational details, in particular, ways to reduce artifacts in images, are described in PCT Publication No. WO2016/124644, the contents of which are herein incorporated in their entirety by reference. The camera Cmr captures the received image, e.g., it creates an image file, which can be further processed by one or more (electronic) processors or computing devices (e.g., the computer system of
In the present example, the camera Cmr is connected to a processor (e.g., processing module) Proc and a display (e.g., displaying module, computer screen, electronic screen, etc.) Dspl, both of which can be part of the image system itself, or may be part of separate, dedicated processing and/or displaying unit(s), such as a computer system wherein data is passed from the camera Cmr to the computer system over a cable or computer network including wireless networks. The display and processor can be an all in one unit. The display can be a traditional electronic display/screen or of the touch screen type and can include a user interface for displaying information to and receiving information from an instrument operator, or user. The user can interact with the display using any type of user input device as known in the art including, but not limited to, mouse, knobs, buttons, pointer, and touch screen.
It may be desirable for a patient's gaze to remain fixed while imaging is carried out. One way to achieve this is to provide a fixation target that the patient can be directed to stare at. Fixation targets can be internal or external to the instrument depending on what area of the eye is to be imaged. One embodiment of an internal fixation target is shown in
Slit-scanning ophthalmoscope systems can operate in different imaging modes depending on the light source and wavelength selective filtering elements employed. True color reflectance imaging (imaging similar to that observed by the clinician when examining the eye using a hand-held or slit lamp ophthalmoscope) can be achieved when imaging the eye with a sequence of colored LEDs (red, blue, and green). Images of each color can be built up in steps with each LED turned on at each scanning position or each color image can be taken in its entirety separately. The three, color images can be combined to display the true color image, or they can be displayed individually to highlight different features of the retina. The red channel best highlights the choroid, the green channel highlights the retina, and the blue channel highlights the anterior retinal layers. Also, light at specific frequencies (e.g., individual colored LEDs or lasers) can excite different fluorophores in the eye (e.g., autofluorescence) and the resulting fluorescence can be detected by filtering out the excitation wavelength.
The fundus imaging system can also provide an infrared reflectance image, such as by using an infrared laser (or other infrared light source). The infrared (IR) mode is advantageous because the eye is not sensitive to the IR wavelengths. This may permit a user to continuously take images without disturbing the eye (e.g., in a preview/alignment mode) to aid the user during alignment of the instrument. Also, the IR wavelengths have increased penetration through tissue and may provide improved visualization of choroidal structures. In addition, fluorescein angiography (FA) and indocyanine green (ICG) angiography imaging can be done by collecting images after a fluorescent dye has been injected into the bloodstream. For example, in FA (and/or ICG) a series of time-lapse images may be captured after injecting a light-reactive dye (e.g., fluorescent dye) into a subject's bloodstream. It is noted that care must be taken since the fluorescent dye may lead to a life-threatening allergic reaction in a portion of the population. High contrast, greyscale images are captured using specific light frequencies selected to excite the dye. As the dye flows through the eye, many parts of the eye are made to glow brightly (e.g., fluoresce), making it possible to discern the progress of the dye, and hence the blood flow, through the eye.
Optical Coherence Tomography Imaging System
Generally, optical coherence tomography (OCT) uses low-coherence light to produce two-dimensional (2D) and three-dimensional (3D) internal views of biological tissue. OCT enables in vivo imaging of retinal structures. OCT angiography (OCTA) produces flow information, such as vascular flow from within the retina. Examples of OCT systems are provided in U.S. Pat. Nos. 6,741,359 and 9,706,915, and examples of an OCTA systems may be found in U.S. Pat. Nos. 9,700,206 and 9,759,544, which are herein incorporated in their entirety by reference. An exemplary OCT/OCTA system is provided herein.
Irrespective of the type of beam used, light scattered from the sample (e.g., sample light) is collected. In the present example, scattered light returning from the sample is collected into the same optical fiber Fbr1 used to route the light for illumination. Reference light derived from the same light source LtSrc1 travels a separate path, in this case involving optical fiber Fbr2 and retro-reflector RR1 with an adjustable optical delay. Those skilled in the art will recognize that a transmissive reference path can also be used and that the adjustable delay could be placed in the sample or reference arm of the interferometer. Collected sample light is combined with reference light, for example, in a fiber coupler Cplr1, to form light interference in an OCT light detector Dtctr1 (e.g., photodetector array, digital camera, etc.). Although a single fiber port is shown going to the detector Dtctr1, those skilled in the art will recognize that various designs of interferometers can be used for balanced or unbalanced detection of the interference signal. The output from the detector Dtctr1 is supplied to a processor (e.g., internal or external computing device) Cmp1 that converts the observed interference into depth information of the sample. The depth information may be stored in a memory associated with the processor Cmp1 and/or displayed on a display (e.g., computer/electronic display/screen) Scn1. The processing and storing functions may be localized within the OCT instrument, or functions may be offloaded onto (e.g., performed on) an external processor (e.g., an external computing device), to which the collected data may be transferred. An example of a computing device (or computer system) is shown in
The sample and reference arms in the interferometer could consist of bulk-optics, fiber-optics, or hybrid bulk-optic systems and could have different architectures such as Michelson, Mach-Zehnder or common-path based designs as known by those skilled in the art. Light beam as used herein should be interpreted as any carefully directed light path. Instead of mechanically scanning the beam, a field of light can illuminate a one or two-dimensional area of the retina to generate the OCT data (see for example, U.S. Pat. No. 9,332,902; D. Hillmann et al, “Holoscopy—Holographic Optical Coherence Tomography,” Optics Letters, 36(13): 2390 2011; Y. Nakamura, et al, “High-Speed Three Dimensional Human Retinal Imaging by Line Field Spectral Domain Optical Coherence Tomography,” Optics Express, 15(12):7103 2007; Blazkiewicz et al, “Signal-To-Noise Ratio Study of Full-Field Fourier-Domain Optical Coherence Tomography,” Applied Optics, 44(36):7722 (2005)). In time-domain systems, the reference arm needs to have a tunable optical delay to generate interference. Balanced detection systems are typically used in TD-OCT and SS-OCT systems, while spectrometers are used at the detection port for SD-OCT systems. The invention described herein could be applied to any type of OCT system. Various aspects of the invention could apply to any type of OCT system or other types of ophthalmic diagnostic systems and/or multiple ophthalmic diagnostic systems including but not limited to fundus imaging systems, visual field test devices, and scanning laser polarimeters.
In Fourier Domain optical coherence tomography (FD-OCT), each measurement is the real-valued spectral interferogram (Sj(k)). The real-valued spectral data typically goes through several post-processing steps including background subtraction, dispersion correction, etc. The Fourier transform of the processed interferogram, results in a complex valued OCT signal output Aj(z)=|Aj|eiφ. The absolute value of this complex OCT signal, |Aj|, reveals the profile of scattering intensities at different path lengths, and therefore scattering as a function of depth (z-direction) in the sample. Similarly, the phase, φj can also be extracted from the complex valued OCT signal. The profile of scattering as a function of depth is called an axial scan (A-scan). A set of A-scans measured at neighboring locations in the sample produces a cross-sectional image (tomogram or B-scan) of the sample. A collection of B-scans collected at different transverse locations on the sample makes up a data volume or cube. For a particular volume of data, the term fast axis refers to the scan direction along a single B-scan whereas slow axis refers to the axis along which multiple B-scans are collected. The term “cluster scan” may refer to a single unit or block of data generated by repeated acquisitions at the same (or substantially the same) location (or region) to analyze motion contrast, which may identify blood flow. A cluster scan can consist of multiple A-scans or B-scans collected with relatively short time separations at about the same location(s) on the sample. Since the scans in a cluster scan are of the same region, static structures remain relatively unchanged from scan to scan within the cluster scan, whereas motion contrast between the scans that meets predefined criteria may be identified as blood flow.
A variety of ways to create B-scans are known in the art including but not limited to: along the horizontal or x-direction, along the vertical or y-direction, along the diagonal of x and y, or in a circular or spiral pattern. B-scans may be in the x-z dimensions but may be any cross-sectional image that includes the z-dimension. An example OCT B-scan image of a normal retina of a human eye is illustrated in
In OCT Angiography, or Functional OCT, analysis algorithms may be applied to OCT data collected at the same, or about the same, sample locations on a sample at different times (e.g., a cluster scan) to analyze motion or flow (see for example US Patent Publication Nos. 2005/0171438, 2012/0307014, 2010/0027857, 2012/0277579 and U.S. Pat. No. 6,549,801, which are herein incorporated in their entirety by reference). An OCT system may use any one of a number of OCT angiography processing algorithms (e.g., motion contrast algorithms) to identify blood flow. For example, motion contrast algorithms can be applied to the intensity information derived from the image data (intensity-based algorithm), the phase information from the image data (phase-based algorithm), or the complex image data (complex-based algorithm). An en face image is a 2D projection of 3D OCT data (e.g., by averaging the intensity of each individual A-scan, such that each A-scan defines a pixel in the 2D projection). Similarly, an en face vasculature image is an image displaying motion contrast signal in which the data dimension corresponding to depth (e.g., z-direction along an A-scan) is displayed as a single representative value (e.g., a pixel in a 2D projection image), typically by summing or integrating all or an isolated portion of the data (see for example U.S. Pat. No. 7,301,644 herein incorporated in its entirety by reference). OCT systems that provide an angiography imaging functionality may be termed OCT angiography (OCTA) systems.
Neural Networks
The present invention may use a neural network (NN) machine learning (ML) model. For the sake of completeness, a general discussion of neural networks is provided herein. The present invention may use any, singularly or in combination, of the below described neural network architecture(s). A neural network, or neural net, is a (nodal) network of interconnected neurons, where each neuron represents a node in the network. Groups of neurons may be arranged in layers, with the outputs of one layer feeding forward to a next layer in a multilayer perceptron (MLP) arrangement. MLP may be understood to be a feedforward neural network model that maps a set of input data onto a set of output data.
Typically, each neuron (or node) produces a single output fed forward to neurons in the layer immediately following it. But each neuron in a hidden layer may receive multiple inputs, either from the input layer or from the outputs of neurons in an immediately preceding hidden layer. Each node may apply a function to its inputs to produce an output for that node. Nodes in hidden layers (e.g., learning layers) may apply the same function to their respective input(s) to produce their respective output(s). Some nodes, however, such as the nodes in the input layer InL receive only one input and may be passive, meaning they simply relay the values of their single input to their output(s), e.g., they provide a copy of their input to their output(s), as illustratively shown by dotted arrows within the nodes of input layer InL.
For illustration purposes,
The neural net learns (e.g., is trained to determine) appropriate weight values to achieve a desired output for a input during a training, or learning, stage. Before the neural net is trained, each weight may be individually assigned an initial (e.g., random and optionally non-zero) value, e.g., a random-number seed. Various methods of assigning initial weights are known in the art. The weights are then trained (optimized) so that for a training vector input, the neural network produces an output close to a desired (predetermined) training vector output. For example, the weights may be incrementally adjusted in thousands of iterative cycles by a technique termed back-propagation. In each cycle of back-propagation, a training input (e.g., vector input or training input image/sample) is fed forward through the neural network to determine its actual output (e.g., vector output). An error for each output neuron, or output node, is then calculated based on the actual neuron output and a target training output for that neuron (e.g., a training output image/sample corresponding to the present training input image/sample). One then propagates back through the neural network (in a direction from the output layer back to the input layer) updating the weights based on how much effect each weight has on the overall error so the output of the neural network moves closer to the desired training output. This cycle is then repeated until the actual output of the neural network is within an acceptable error range of the desired training output for the training input. As it would be understood, each training input may require many back-propagation iterations before achieving a desired error range. Typically, an epoch refers to one back-propagation iteration (e.g., one forward pass and one backward pass) of all the training samples, such that training a neural network may require many epochs. Generally, the larger the training set, the better the performance of the trained ML model, so various data augmentation methods may increase the size of the training set. For example, when the training set includes pairs of corresponding training input images and training output images, the training images may be divided into multiple corresponding image segments (or patches). Corresponding patches from a training input image and training output image may be paired to define multiple training patch pairs from one input/output image pair, which enlarges the training set. Training on large training sets, however, places high demands on computing resources, e.g., memory and data processing resources. Computing demands may be reduced by dividing a large training set into multiple mini-batches, where the mini-batch size defines the number of training samples in one forward/backward pass. Here, and one epoch may include multiple mini-batches. Another issue is the possibility of a NN overfitting a training set such that its capacity to generalize from a specific input to a different input is reduced. Issues of overfitting may be mitigated by creating an ensemble of neural networks or by randomly dropping out nodes within a neural network during training, which effectively removes the dropped nodes from the neural network. Various dropout regulation methods, such as inverse dropout, are known in the art.
It is noted that the operation of a trained NN machine model is not a straight-forward algorithm of operational/analyzing steps. When a trained NN machine model receives an input, the input is not analyzed in the traditional sense. Rather, irrespective of the subject or nature of the input (e.g., a vector defining a live image/scan or a vector defining some other entity, such as a demographic description or a record of activity) the input will be subjected to the same predefined architectural construct of the trained neural network (e.g., the same nodal/layer arrangement, trained weight and bias values, predefined convolution/deconvolution operations, activation functions, pooling operations, etc.), and it may not be clear how the trained network's architectural construct produces its output. The values of the trained weights and biases are not deterministic and depend upon many factors, such as the time the neural network is given for training (e.g., the number of epochs in training), the random starting values of the weights before training starts, the computer architecture of the machine on which the NN is trained, choice of training samples, distribution of the training samples among multiple mini-batches, choice of activation function(s), choice of error function(s) that modify the weights, and even if training is interrupted on one machine (e.g., having a first computer architecture) and completed on another machine (e.g., having a different computer architecture). The reasons a trained ML model reaches certain outputs is not clear, and much research is ongoing to determine the factors on which a ML model bases its outputs. So the processing of a neural network on live data cannot be reduced to a simple algorithm of steps. Rather, its operation depends upon its training architecture, training sample sets, training sequence, and various circumstances in the training of the ML model.
Construction of a NN machine learning model may include a learning (or training) stage and a classification (or operational) stage. In the learning stage, the neural network may be trained for a specific purpose and may be provided with a set of training examples, including training (sample) inputs and training (sample) outputs, and optionally including a set of validation examples to test the progress of the training. During this learning process, various weights associated with nodes and node-interconnections in the neural network are incrementally adjusted to reduce an error between an actual output of the neural network and the desired training output. In this manner, a multi-layer feed-forward neural network (such as discussed above) may be made capable of approximating any measurable function to any desired accuracy. The result of the learning stage is a (neural network) machine learning (ML) model that has been learned (e.g., trained). In the operational stage, a set of test inputs (or live inputs) may be submitted to the learned (trained) ML model, which may apply what it has learned to produce an output prediction based on the test inputs.
Like the regular neural networks of
Convolutional Neural Networks have been successfully applied to many computer vision problems. Training a CNN generally requires a large training dataset. The U-Net architecture is based on CNNs and can generally be trained on a smaller training dataset than conventional CNNs.
The contracting path is similar to an encoder, and generally captures context (or feature) information by feature maps. In the present example, each encoding module in the contracting path may include two or more convolutional layers, illustratively indicated by an asterisk symbol “*”, and which may be followed by a max pooling layer (e.g., DownSampling layer). For example, input image U-in is illustratively shown to undergo two convolution layers, each with 32 feature maps. As it would be understood, each convolution kernel produces a feature map (e.g., the output from a convolution operation with a kernel is an image typically termed a “feature map”). For example, input U-in undergoes a first convolution that applies 32 convolution kernels (not shown) to produce an output consisting of 32 respective feature maps. However, as it is known in the art, the number of feature maps produced by a convolution operation may be adjusted (up or down). For example, the number of feature maps may be reduced by averaging groups of feature maps, dropping some feature maps, or other known method of feature map reduction. In the present example, this first convolution is followed by a second convolution whose output is limited to 32 feature maps. Another way to envision feature maps may be to think of the output of a convolution layer as a 3D image whose 2D dimension is given by the listed X-Y planar pixel dimension (e.g., 128×128 pixels), and whose depth is given by the number of feature maps (e.g., 32 planar images deep). Following this analogy, the output of the second convolution (e.g., the output of the first encoding module in the contracting path) may be described as a 128×128×32 image. The output from the second convolution then undergoes a pooling operation, which reduces the 2D dimension of each feature map (e.g., the X and Y dimensions may each be reduced by half). The pooling operation may be embodied within the DownSampling operation, as indicated by a downward arrow. Several pooling methods, such as max pooling, are known in the art and the specific pooling method is not critical to the present invention. The number of feature maps may double at each pooling, starting with 32 feature maps in the first encoding module (or block), 64 in the second encoding module, and so on. The contracting path thus forms a convolutional network consisting of multiple encoding modules (or stages or blocks). As is typical of convolutional networks, each encoding module may provide at least one convolution stage followed by an activation function (e.g., a rectified linear unit (ReLU) or sigmoid layer), not shown, and a max pooling operation. Generally, an activation function introduces non-linearity into a layer (e.g., to help avoid overfitting issues), receives the results of a layer, and determines whether to “activate” the output (e.g., determines whether the value of a node meets predefined criteria to have an output forwarded to a next layer/node). The contracting path generally reduces spatial information while increasing feature information.
The expanding path is similar to a decoder, and may provide localization and spatial information for the results of the contracting path, despite the down sampling and any max-pooling performed in the contracting stage. The expanding path includes multiple decoding modules, where each decoding module concatenates its current up-converted input with the output of a corresponding encoding module. In this manner, feature and spatial information are combined in the expanding path through a sequence of up-convolutions (e.g., UpSampling or transpose convolutions or deconvolutions) and concatenations with high-resolution features from the contracting path (e.g., via CC1 to CC4). Thus, the output of a deconvolution layer is concatenated with the corresponding (optionally cropped) feature map from the contracting path, followed by two convolutional layers and activation function (with optional batch normalization).
The output from the last expanding module in the expanding path may be fed to another processing/training block or layer, such as a classifier block, that may be trained along with the U-Net architecture. Alternatively, or in addition, the output of the last upsampling block (at the end of the expanding path) may be submitted to another convolution (e.g., an output convolution) operation, as indicated by a dotted arrow, before producing its output U-out. The kernel size of output convolution may be selected to reduce the dimensions of the last upsampling block to a desired size. For example, the neural network may have multiple features per pixels right before reaching the output convolution, which may provide a 1×1 convolution operation to combine these multiple features into a single output value per pixel, on a pixel-by-pixel level.
Computing Device/System
In some embodiments, the computer system may include a processor Cpnt1, memory Cpnt2, storage Cpnt3, an input/output (I/O) interface Cpnt4, a communication interface Cpnt5, and a bus Cpnt6. The computer system may optionally also include a display Cpnt7, such as a computer monitor or screen.
Processor Cpnt1 includes hardware for executing instructions, such as those making up a computer program. For example, processor Cpnt1 may be a central processing unit (CPU) or a general-purpose computing on graphics processing unit (GPGPU). Processor Cpnt1 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory Cpnt2, or storage Cpnt3, decode and execute the instructions, and write one or more results to an internal register, an internal cache, memory Cpnt2, or storage Cpnt3. In particular embodiments, processor Cpnt1 may include one or more internal caches for data, instructions, or addresses. Processor Cpnt1 may include one or more instruction caches, one or more data caches, such as to hold data tables. Instructions in the instruction caches may be copies of instructions in memory Cpnt2 or storage Cpnt3, and the instruction caches may speed up retrieval of those instructions by processor Cpnt1. Processor Cpnt1 may include any suitable number of internal registers, and may include one or more arithmetic logic units (ALUs). Processor Cpnt1 may be a multi-core processor; or include one or more processors Cpnt1. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.
Memory Cpnt2 may include main memory for storing instructions for processor Cpnt1 to execute or to hold interim data during processing. For example, the computer system may load instructions or data (e.g., data tables) from storage Cpnt3 or from another source (such as another computer system) to memory Cpnt2. Processor Cpnt1 may load the instructions and data from memory Cpnt2 to one or more internal register or internal cache. To execute the instructions, processor Cpnt1 may retrieve and decode the instructions from the internal register or internal cache. During or after execution of the instructions, processor Cpnt1 may write one or more results (which may be intermediate or final results) to the internal register, internal cache, memory Cpnt2 or storage Cpnt3. Bus Cpnt6 may include one or more memory buses (which may each include an address bus and a data bus) and may couple processor Cpnt1 to memory Cpnt2 and/or storage Cpnt3. Optionally, one or more memory management unit (MMU) help with data transfers between processor Cpnt1 and memory Cpnt2. Memory Cpnt2 (which may be fast, volatile memory) may include random access memory (RAM), such as dynamic RAM (DRAM) or static RAM (SRAM). Storage Cpnt3 may include long-term or mass storage for data or instructions. Storage Cpnt3 may be internal or external to the computer system, and include one or more of a disk drive (e.g., hard-disk drive, HDD, or solid-state drive, SSD), flash memory, ROM, EPROM, optical disc, magneto-optical disc, magnetic tape, Universal Serial Bus (USB)-accessible drive, or other type of non-volatile memory.
I/O interface Cpnt4 may be software, hardware, or a combination of both, and include one or more interfaces (e.g., serial or parallel communication ports) for communication with I/O devices, which may enable communication with a person (e.g., user). For example, I/O devices may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet, touch screen, trackball, video camera, another suitable I/O device, or a combination of two or more of these.
Communication interface Cpnt5 may provide network interfaces for communication with other systems or networks. Communication interface Cpnt5 may include a Bluetooth interface or other type of packet-based communication. For example, communication interface Cpnt5 may include a network interface controller (NIC) and/or a wireless NIC or a wireless adapter for communicating with a wireless network. Communication interface Cpnt5 may provide communication with a WI-FI network, an ad hoc network, a personal area network (PAN), a wireless PAN (e.g., a Bluetooth WPAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a cellular telephone network (such as a Global System for Mobile Communications (GSM) network), the Internet, or a combination of two or more of these.
Bus Cpnt6 may provide a communication link between the above-mentioned components of the computing system. For example, bus Cpnt6 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front-side bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an InfiniBand bus, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or other suitable bus or a combination of two or more of these.
In various exemplary embodiments, a method for segmenting one or more target retinal layers from an optical coherence tomography (OCT) volume scan of an eye is provided. The method includes acquiring, by an OCT system, the OCT volume scan which includes a plurality B-scans. Then submitting, by the OCT system, one or more B-scans to a deep learning machine model that is configured with a self-attention mechanism that enables differentially weighing priority levels of different regions of each B-scan based on a regions' relationship to the one or more target retinal layers by enhancing regions of each B-scan associated with the one or more target retinal layers and deemphasizing regions not associated with the one or more target retinal layers. The deep learning machine model is configured to maintain a data density of a width dimension of each B-scan, and to reduce the data density of the depth dimension of each B-scan based on the number of the one or more target retinal layers to be segmented. Each B-scan comprises a plurality of adjacent A-scans, and wherein the self-attention mechanism enhances one or more Layer-of-Interest (LOI) regions corresponding with the one or more target retinal layers within each A-scan based on topology information. The plurality of adjacent A-scans are processed in parallel.
In various exemplary embodiments, a method for segmenting one or more target retinal layers from an optical coherence tomography (OCT) scan of an eye is provided. The method includes acquiring, by an OCT system, the OCT scan, including at least one B-scan; submitting, by the OCT system, one or more of said at least one B-scan to a deep learning machine model based on a neural network trained with a training set which includes augmented training samples. The creation of the augmented training samples includes: collecting, by a processor, raw spectral data with high-resolution; constructing, by the processor, primary high-resolution OCT image data from the collected raw spectral data with high-resolution; defining, by the processor, ground truth layer segmentation label data from the primary high-resolution OCT image data; amending, by the processor, the raw spectral data and generating secondary OCT image data; and using, by the processor, the secondary OCT image data as an augmented training sample and the ground truth layer segmentation label data as part of a training output target sample in the training set of the neural network. The primary high-resolution OCT image data and the secondary OCT image data provide structural data. An acquired OCT scan is a volume scan comprising a plurality of B-scans. Amending the raw spectral data comprises degrading the raw spectral data, and amending the raw spectral data also comprises applying local wrapping and changes in reflectivity to simulate at least one pathology of a plurality of pathologies, or accessing sample noise data from a store of OCT noise scans and applying the sampled noise data to the raw spectral data. The ground truth layer segmentation label data is defined by submission of the primary high-resolution OCT image data to an automated Multi retinal Layer Segmentation utility.
Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.
Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as field-programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (FDDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. A computer-readable non-transitory storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.
While the invention has been described in conjunction with several specific embodiments, it is evident to those skilled in the art that many further alternatives, modifications, and variations will be apparent in light of the foregoing description. Thus, the invention described herein is intended to embrace all such alternatives, modifications, applications and variations as may fall within the spirit and scope of the appended claims.
This application claims the benefit of priority under 35 U.S.C. 120 to U.S. Provisional Application Ser. No. 63/292,194 entitled “END TO END DEEP LEARNING BASED OCT MULTI RETINAL LAYER SEGMENTATION”, filed on Dec. 21, 2021, the entire contents of which are incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63292194 | Dec 2021 | US |