The present disclosure relates, in general terms, to methods and systems for image segmentation. It is particularly, though not exclusively, applicable to segmentation of medical images, such as volumetric image data from optical coherence tomography.
An essential task in the clinical use of medical images is segmentation, which is the task of detecting, identifying and subsequently quantifying regions of interest in an acquired scan for signs of pathology. Segmentation is important as it allows the quantitative monitoring of biomarkers, such as tissue characteristics, for detection and monitoring of abnormalities, which has applications in disease screening and management.
Various methods have been reported for medical image segmentation. A two-dimensional approach for medical image segmentation has successfully been demonstrated in different applications. Many architectures for volumetric segmentation have been developed based on the U-Net architecture. A similar approach was also proposed in developing V-Net. However, these approaches usually require the whole volumetric image to be considered, which is potentially computationally expensive with extensive memory requirements.
Recently, recurrent networks have been gaining popularity due to their sequential approach for volumetric medical images. One known application of recurrent networks uses a combination of Fully Convolutional Networks (FCN) and Recurrent Neural Networks (RNN). However, this still requires the whole medical scan to be available. Addressing this issue, several approaches to combine the sequential property of spatial context in medical images with two-dimensional segmentation have been proposed. However, these methods are computationally heavy and are prone to memory leakages which are common in recurrent networks.
A network called D-UNet enables the architecture to learn spatial context of adjacent slices during the encoding stage using three-dimensional convolution and by combining it with two-dimensional segmentation that treated the adjacent slices as different channels. One way is to learn the spatial context in the same manner using a proposed Globally Guided Progressive Fusion Network (GGPF-Net). These methods successfully outperformed and replaced sequential and three-dimensional FCN approaches in terms of performance and computation efficiency. However, the spatial context learnt by such architectures is dependent on the quality of the labels available.
It would be desirable to overcome or alleviate at least one of the above-described problems, or at least to provide a useful alternative.
Disclosed herein is a method of segmenting a volumetric image comprising a plurality of slices, the method comprising:
In some embodiments, the reconstruction DNN comprises a convolutional feature extractor for generating first feature data from the adjacent slices, and a reconstruction downsampler for generating first reduced-dimension feature data from the first feature data at one or more scales.
In some embodiments, the reconstruction DNN comprises a reconstruction upsampler for transforming the first reduced-dimension feature data to first upsampled data having the same dimensions as the first feature data.
In some embodiments, the reconstruction DNN comprises one or more dimension reduction layers for applying a dimension reduction mechanism to the first feature data and/or to the first reduced-dimension feature data.
In some embodiments, the dimension reduction mechanism comprises:
In some embodiments, layers of the reconstruction downsampler are connected to layers of the reconstruction upsampler via respective ones of the dimension reduction layers by concatenation.
In some embodiments, the segmentation DNN comprises a convolutional feature extractor for generating second feature data from the target slice, and a segmentation downsampler for generating second reduced-dimension feature data from the second feature data at one or more scales.
In some embodiments, the segmentation DNN comprises a segmentation upsampler for transforming the second reduced-dimension feature data to second upsampled data having the same dimensions as the second feature data.
In some embodiments, layers of the segmentation downsampler are connected to layers of the segmentation upsampler.
In some embodiments, the reconstruction DNN is configured to share spatial information with the segmentation DNN by element-wise addition of output of layers of the reconstruction upsampler to output of layers of the segmentation upsampler.
In some embodiments, the loss function of the segmentation DNN is the 2D Intersection over Union (IoU) loss function.
In some embodiments, the volumetric image is a 3D medical image.
In some embodiments, the 3D medical image is a 3D optical coherence tomography (OCT) image.
In some embodiments, the 3D OCT image is a retinal image, and wherein the target slice corresponds to a layer of the choroid.
In some embodiments, the method is repeated for a plurality of target slices, and wherein the method further comprises generating a choroidal thickness map from segmentation of the plurality of target slices.
Also disclosed herein is a system for segmentation of a volumetric image comprising a plurality of slices, the system comprising:
Further disclosed herein is a non-transitory computer-readable storage having instructions stored thereon for causing at least one processor to carry out the disclosed method.
Embodiments will now be described, by way of non-limiting example, with reference to the drawings in which:
The present disclosure relates to a computationally efficient and accurate segmentation approach, which is robust to interstitial variations, for the segmentation of volumetric medical images. In the present disclosure, a novel segmentation multi-task learning architecture that is capable of fully automated three-dimensional segmentation of volumetric medical image data is proposed for volumetric segmentation. The proposed architecture incorporates both reconstruction and segmentation tasks. Simultaneous reconstruction and segmentation extracts intra-slice features which are directly used for segmentation. In particular, the multi-task learning architecture aggregates the spatial context in adjacent cross-sectional slices to reconstruct a central slice. Said multi-task learning architecture reconstructs said central slice by learning the spatial information between the adjacent slices. Soft parameter sharing between the reconstruction and segmentation tasks may be used to channel the spatial information. Said soft parameter sharing aggregates the spatial features more explicitly by directly learning the correlation between adjacent slices and the slices that will be segmented.
Spatial context learnt by the proposed reconstruction mechanism may be fused using a U-Net-based architecture. In the present disclosure, the proposed U-Net-based architecture is referred to as Spatial Aggregated Networks (SA-Net) due to its aggregation of spatial information. SA-Net learns the spatial information between adjacent cross-sections to reconstruct a selected cross-section. Said SA-Net is a convolutional neural network that is based on a fully convolutional network and its architecture can be modified and extended to work with fewer training images and to yield more precise segmentations. The main idea of the proposed U-Net-based architecture is to supplement a contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence these layers increase the resolution of the output. Further, a successive convolutional layer can then learn to assemble a precise output based on this information. In the proposed SA-Net, there are a large number of feature channels in the upsampling part, which allows the network to propagate context information to higher resolution layers. It will be appreciated that incorporating spatial information from corresponding adjacent slices enables the proposed SA-Net architecture to explicitly integrate spatial correspondences. In general, the present disclosure does not require the whole volumetric image to be considered, thus avoiding costly computation and extensive memory requirements. At the same time, the proposed approach is not computationally heavy and is also not prone to memory leakage problems which are common in recurrent networks.
As shown in
Detailed connections between the segmentation DNN 108 and reconstruction DNN 112 are shown in
In the reconstruction DNN 112, explicit spatial information from the adjacent slices 114 may be extracted. In particular, explicit spatial information from the adjacent slices 114 may be extracted by using a series of 3D convolutions. It will be appreciated that the reconstruction DNN 112 can be divided into downsampling and upsampling parts. The adjacent slices 114 are downsampled and the convolutions are repeated to extract multi-scale representations of spatial context. In some embodiments, the reconstruction DNN 112 comprises a convolutional feature extractor 202 for generating first feature data from the adjacent slices 114. The adjacent slices 114 are then downsampled and the convolutions are repeated to extract multi-scale representations of spatial context. During the downsampling process, rich spatial information is exploited from the adjacent slices 114 by using 3D convolution and max pooling layers. In one embodiment as shown in
After the downsampling stage, convolutional upsampling is later performed at different levels to ensure consistent representation of information from different scales, and is concatenated with the residuals at the same scale. In some embodiments, the reconstruction DNN 112 comprises a reconstruction upsampler 206 for transforming the first reduced-dimension feature data generated by the reconstruction downsampler 204 to first upsampled data having the same dimensions as the first feature data at one or more scales. After upsampling a final 2D convolution is performed and the loss between output and the ground truth (i.e., the Ii slice) is calculated. Embodiments of the present disclosure use mean squared error to calculate the similarity distance between the predicted output ypred and ground truth ytrue.
Other similarity or dissimilarity measurement such as SSIM (structural similarity) index may also be used.
In some embodiments, the reconstruction DNN 112 may further comprise one or more dimension reduction layers for applying a dimension reduction mechanism (DRM) 210 to the first feature data generated by the convolutional feature extractor 202. The reconstruction DNN 112 may also comprise one or more dimension reduction layers for applying another DRM 212 to the first reduced-dimension feature data generated by the reconstruction downsampler 204. In the present disclosure, said DRM 210 and 212 are used for a more efficient representation of the information. In particular, to reduce the number of parameters given by the 3D convolution layers, in the bottleneck block the present disclosure incorporates said DRM 210 and 212 to convert 3D information into two-dimensional (2D) information. In some embodiments, the converted 3D information generated by the DRMs 212 and 210 are then upsampled using 2D convolution layers in reconstruction upsamplers 206 and 208, respectively.
As shown in
In some embodiments, as illustrated in
In the segmentation DNN 108, explicit spatial information from the target slice 104 may be extracted (see
After the downsampling stage, convolutional upsampling is later performed at different levels to ensure consistent representation of information from different scales, and is concatenated with the residuals at the same scale. In some embodiments, the segmentation DNN 108 comprises segmentation upsamplers 218 and 220 for transforming the second reduced-dimension feature data to second upsampled data having the same dimensions as the second feature data. In the upsampling part, high-resolution features during downsampling are concatenated with low-resolution features. In each end of the upsampling block, consisting of one 2D upsampling layer and two 2D convolution layers, the knowledge of the inter-slice features from the reconstruction branch is fused. The high-resolution 2D volumetric features are added element-wise with 2D intra-slice extracted features to incorporate the inter-correlation features between slices.
As shown in
As shown in
In embodiments as illustrated in
The loss function of the segmentation DNN may be the 2D Intersection over Union (IoU) loss function. The upsampling DNN is ended with a one-by-one 2D convolution and a sigmoid activation function. The present disclosure uses 2D IoU loss function to maximize the intersection region between the prediction and the ground truth. This 2D IoU loss function is defined as
In some embodiments, the volumetric image is a 3D medical image. In particular, the proposed SA-Net could potentially be applied for the segmentation and detection of structures in medical imaging modalities that acquire 3D volumetric data, which include but are not limited to Optical Coherence Tomography, Computed Tomography and Magnetic Resonance Imaging. The 3D medical image may be a 3D optical coherence tomography (OCT) image. Said OCT refers to a relatively recent medical imaging approach which enables high resolution depth-resolved imaging of structures below the surface of the retina. This allows visualization of sub-retinal changes which were not observable using fundus photography. The utility of OCT imaging has led to its widespread adoption in many clinical practices and has even replaced fundus photography as the main form of ophthalmic imaging for some practices. In the present disclosure, the 3D OCT image may be a retinal image, and the target slice may correspond to a layer of the choroid.
Also disclosed herein is a system for segmentation of a volumetric image comprising a plurality of slices, comprising at least one processor; and computer-readable storage having stored thereon instructions for causing the at least one processor to carry out the disclosed method.
After pre-processing, embodiments of the present disclosure use a five-fold cross-validation strategy to train and evaluate the proposed model. To avoid training bias and risk of overfitting, it is ensured that all images from the same eye were in the same fold. This avoids a scenario where the testing and training partitions could potentially consist of different images from the same eye. The overall experimental result is then obtained by averaging over all validation sets in each fold. The architecture is developed using Python version 3.7.4 and TensorFlow version 2.0. Experiments were conducted using a workstation with GPU NVIDIA RTX 2080 Ti and 64GB RAM.
The feature extractor 202/204, for example as illustrated in
The proposed SA-Net for volumetric segmentation of the choroid was evaluated. The choroid is clinically of interest as the vascular layer of the eye, providing upwards of 60% of the blood supply to the retina. Variations in the choroid have been linked to many ocular conditions, including age-related macular degeneration and diabetic retinopathy. Until recently, OCT imaging of the choroid has been challenging, as it is obscured by the highly scattering retinal pigment epithelium and visibility of the choroid was highly limited using spectral domain OCT systems operating at the 800 nm range. However, the adoption of swept-source lasers operating at 1000 nm into OCT systems has provided a window of opportunity for choroidal analysis due to reduced scattering.
The proposed SA-Net was evaluated on two OCT datasets. The first data set is composed of 40 high myopia eyes acquired using a commercial swept-source OCT (SS-OCT) system, DRI OCT Triton (Topcon Corp., Japan) with a 1050 nm wavelength, scanning speed of 100,000 A-scans/sec and 7 mm × 7 mm scanning protocol, centred at the macula. Each eye volume in the Triton data set contains 256 slices with dimensions 256 × 128. Another separate data set is obtained by acquiring scans from nine normal eyes using the PLEX Elite 9000 SS-OCT system (Carl Zeiss Meditec, Jena, Germany) operating at a wavelength range between 1040 nm and 1060 nm, with a scanning speed of 100,000 A-scans/sec and 15 mm × 9 mm scanning protocol. Each eye volume in the PLEX data set contains 834 slices with dimension 512 × 500. Prior pre-processing is performed to limit the field of view of the acquired scans to the macula region and to resize the dimensions to 256 × 128. The network receives the target slice for segmentation together with the adjacent slices as inputs for reconstruction. Slices from the ends of the volume are padded by averaging the target slice with the available adjacent slices.
The segmentation result was evaluated volumetrically by calculating the IoU, dice score and accuracy over a volume, with respect to ground truth segmentation. The inter-slice correlation was assessed by measuring the quality of the choroidal thickness map generated from the choroidal segmentation. The method was repeated for a plurality of target slices, and further comprised generating a choroidal thickness map from segmentation of the plurality of target slices. In particular, the choroidal thickness map was obtained by stacking the choroidal thickness obtained from each slice. The generated map was evaluated by calculating the structural similarity index, which assesses the similarity of the predicted thickness map and ground truth thickness map. Given two images with the same dimension, x and y, SSIM formula is given by
where µx, µy, σx, σy, σxy are the average of x, the average of y, the variance of x, the variance of y, and the covariance of x and y respectively. While c1 = (0.001DR)2 and c2 = (0.003DR)2. In the present disclosure, DR or dynamic range is given by:
Table 1 shows the result comparison for the Triton data set using the proposed SA-Net with other segmentation approaches such as 3D U-Net, BC U-Net and GGPF-Net. The results show that the SA-Net architecture has successfully outperformed other architectures for volumetric segmentation. This demonstrates that learning the adjacent spatial features explicitly from reconstruction enabled more precise 3D volumetric segmentation.
Table 2 also shows the result for the PLEX data, where the proposed architecture achieved similar results. It is also important to take note that the network complexity and computational power needed for the present architecture are much less than those needed for BC U-Net, resulting in faster learning and inference time.
Table 3 shows a detailed comparison between the proposed SA-Net and the state-of-art networks. Spatial information can provide useful context for volumetric segmentation. In the proposed SA-Net, incorporating spatial information from corresponding adjacent slices enabled our proposed SA-Net architecture to explicitly integrate spatial correspondences. SA-Net was compared with other recent approaches to segment the choroid in volumetric OCT images from two different commercial devices, and it was demonstrated that SA-Net outperformed the other approaches in segmentation accuracy and quality of the generated choroidal thickness map, with lesser computational requirements. The results show that SA-Net could be used for efficient and accurate segmentation of OCT data as well as potentially other volumetric medical images.
Also disclosed herein is a non-transitory computer-readable storage having instructions stored thereon for causing at least one processor to carry out the disclosed method.
As shown, the mobile computer device 700 includes the following components in electronic communication via a bus 706:
Although the components depicted in
The display 302 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro-projector and OLED displays).
In general, the non-volatile data storage 704 (also referred to as non-volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 704, or by instructions stored in memory 704.
In some embodiments for example, the non-volatile memory 704 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.
In many implementations, the non-volatile memory 704 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 704, the executable code in the non-volatile memory 804 is typically loaded into RAM 708 and executed by one or more of the N processing components 710.
The N processing components 710 in connection with RAM 708 generally operate to execute the instructions stored in non-volatile memory 704. As one of ordinarily skill in the art will appreciate, the N processing components 710 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.
The transceiver component 712 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
It should be recognized that
It will be appreciated that embodiments of the present disclosure provides a novel segmentation architecture that is capable of fully automated three-dimensional segmentation of volumetric medical image data. This architecture encompasses the following key novel aspects. First, soft parameter sharing aggregates the spatial features more explicitly by directly learning the correlation between adjacent slices and the slice that will be segmented. In addition, simultaneous reconstruction and segmentation extracts intra-slice features which are directly used for segmentation. Further, automated generation of volumetric choroidal representation enables 3D visualization of the choroid. Last but not least, generation of full-field choroidal thickness maps enables enface analysis of thickness variations in the choroid across the retina.
It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising”, will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.
Number | Date | Country | Kind |
---|---|---|---|
10202008522X | Sep 2020 | SG | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2021/050530 | 9/2/2021 | WO |