Recently, deep learning has shown promising results in automating the segmentation of various medical images1,2. However, training of these deep learning algorithms requires large sets of training data from expert annotators. As such, using coregistration (spatial alignment) as a way to transfer one annotated mask or landmark across an entire image set is a valuable tool for reducing the number of manual labels required in a purely deep learning setting. Coregistration can also be used to spatially align annotated landmarks or masks from one image onto another and warp images into a common reference frame to ease manual or automated comparison.
Traditional coregistration methods iteratively optimize an objective function on each new pair of images to be coregistered on, which is a computationally expensive process and can take hours to complete on a given image volume. Deep learning-based coregistration is capable of calculating the deformation without iteratively optimizing an objective function. When coupled with a graphics processing unit (GPU) as a processing unit, this results in a significantly reduced computational cost for computing the registration.
Traditional coregistration methods calculate displacement vector fields across all image pairings through a variety of iterative methods such as elastic-type modeling3, statistical parametric mapping4, and free-form deformation with b-splines5.
Frameworks for using deep convolutional neural networks (CNNs) to perform variants of coregistration on medical imaging are beginning to emerge. The majority of these methods are focused on creating deformation fields that minimize the difference between a pair of images. Hu et al. in particular proposed a weakly supervised method for registering magnetic resonance (MR) images onto intraoperative transrectal ultrasound prostate images6. Their method learns both affine transformation for global alignment of one image onto another as well as dense deformation fields (DDFs) of one image onto another. However, the method described in Hu et al. requires anatomical landmark points for training the model, the collection of which is time consuming and expensive. Balakrishnan et al. proposed a fully unsupervised CNN for coregistration of 3D MRI brain datasets where the loss function is purely based on the raw image data7. The approach of Balakrishnan et al. only learns the DDF of two images and accounts for affine transformations by feeding the DDF through a spatial transformation layer.
System Overview
The implementation described herein is a novel framework for unsupervised coregistration using CNNs, which is referred to herein as DeformationNet. DeformationNet takes a fully unsupervised approach to image coregistration. Advantageously, DeformationNet also explicitly stabilizes images or transfers contour masks across images. For the architecture of DeformationNet, global alignment is learned via affine deformations in addition to the DDF, and an unsupervised loss function is maintained. The use of an unsupervised loss function obviates the need for explicit human-derived annotations on the data, which is advantageous since acquisition of those annotations is one of the major challenges for supervised and semi-supervised CNNs. DeformationNet is also unique in that, in at least some implementations, it applies an additional spatial transformation layer at the end of each transformation step, which provides the ability to “fine-tune” the previously predicted transformation so that the network might correct previous transformation errors.
Training
One implementation of the training phase of the DeformationNet system is shown in
1. Training a Global Network to learn global image alignment via an affine matrix for warping an inputted target image onto an inputted source image coordinate system (102, 103, and 105); and
2. Training a Local Network to learn a DDF for warping localized features of an inputted target image onto an inputted source image (105 and 106).
In at least some implementations, each pair of source and target images from a medical images database (101) represents two cardiac MR images from the same patient and possibly the same study. These cardiac MR series may include but are not limited to: Delayed Enhancement short axis (SAX) images, Perfusion SAX images, SSFP SAX images, T1/T2/T2* mapping SAX images, etc.
Creating an Affine Transformation Matrix for Mapping a Target Image Coordinates onto a Source Image Coordinates (102, 103, and 104)
An affine transformation matrix with N or more affine transformation parameters, where N is an integer greater than or equal to 0, is learned via a Global Network (104) wherein the input is a pair of images that includes a source image (103) and a target image (102). The learned affine transformation parameters are defined as those parameters which, when applied to the target image, align the target image with the source image. In at least some implementations, the target image is resized to match the size of the source image before the affine matrix is learned.
In at least some implementations, the Global Network (104) is a regression network. A version of the Global Network (104) includes 32 initial convolutional filters. At least some implementations downsample using strides in the convolutional layers and there are 2 convolutional layers with kernel size 3, a batch normalization layer with a momentum rate, a dropout layer, and a ReLU nonlinearity layer before each downsampling operation. In at least some implementations, the last layer of the Global Network (104) is a dense layer mapping to the desired number of affine parameters.
In at least some implementations, the affine parameter outputs of the Global Network (104) are used as input to another affine spatial transformation layer that is bounded by different scaling factors for rotation, scaling, and zooming. The scaling factors control the amount of affine deformations that can be made to the target image. In at least some implementations, the affine spatial transformation matrix output by the affine spatial transformation layer includes a regularization operation that is implemented in the form of a bending energy loss function. A gradient energy loss function for regularization of the affine spatial transformation matrix may also be used, for example. This regularization further prevents the learned affine spatial transformation matrix from generating unrealistically large transformations.
Creating a DDF for Warping a Transformed Target Image to Match a Source Image (106)
In at least some implementations, a DDF is learned via a Local Network (106) wherein the input is a pair that includes a source image (102) and a target image (103). In some implementations, the target image (102) has first been warped onto the source image coordinates via an affine transformation matrix learned in the global network(104), providing a warped target image (105) to be input into the Local Network (106).
In at least some implementations, the Local Network (106) is a neural network architecture that includes a downsampling path and then an upsampling path. A version of such Local Network includes 32 initial convolutional filters and skip connections between the corresponding downsampling and upsampling layers. At least some implementations downsample using strides in the convolutional layers and there are 2 convolutional layers with kernel size 3, a batch normalization layer with a momentum rate, a dropout layer, and a ReLU nonlinearity layer before each downsampling or upsampling operation. This upsampling allows the DDF to be the same size as the inputted source and target images provided that padding was used.
In at least some implementations, the learned DDF output of the Local Network (106) goes through a freeform similarity spatial transformation layer. As an example, this freeform similarity spatial transformation layer can include affine transformations or dense freeform deformation field warpings5, or both. If affine transformations are used, they may be scaled to control the amount of deformations that can be made to the target images. In at least some implementations, the DDF also includes a regularization operation that is implemented in the form of a bending energy loss function5. A gradient energy loss function may also be used to regularize the DDF. This regularization prevents the learned DDF from generating deformations that are unrealistically large.
In at least some implementations, the CNN models may be updated via backpropagation with an adam optimizer and a mutual information loss function between the source image and the target image that has been warped by the DDF (i.e., warped target image 105). Adam optimizer adjusts its learning rate through training using both the first and second moments of the backpropagated gradients. Other non-limiting examples of optimizers that may be used include stochastic gradient descent, minibatch gradient descent, adagrad, and root mean squared propagation. Other non-limiting examples of loss functions may include root mean squared error, L2 loss, L2 loss with center weighting, and cross correlation loss7 between the source image and the DDF that has been applied to the target image. These loss functions only depend on the raw input data and what the DeformationNet learns from that raw data.
Advantageously, the absence of any dependence on explicit hand-annotations allows for this system to be fully unsupervised.
Storing Weight of Trained Networks (108)
Weights of the trained Global Network (104) and Local Network (106) can be stored in storage devices including hard disks and solid state drives to be used later for image stabilization or segmentation mask transferring.
Image pairings that may be used for image stabilization inference include but are not limited to: images from the same slice of a cardiac MR image volume but captured at different time points; images from the same time point of a cardiac MR image volume but different slices; images from any image of the same MR image volume; images from distinct MR image volumes; images from other medical imaging that involve a time series such as breast, liver, or prostate DCE-MRI (dynamic contrast enhancement MM); or images from fluoroscopy imaging.
Overview of Inference Steps
Segmentation Mask Selection
Implementations of attaining the segmentations masks (304) shown in
The CNN (402) may not be able to accurately predict segmentations for every image, so it may be important to choose images with good quality segmentation masks as the target image for (303) (
The general actions of the above described possible heuristic implementation are explained in the following example pseudocode:
1. for image in set of 2D images:
2. select images with best quality
In at least some implementations, the image with the segmentation probability mask corresponding to the highest quality score across the group of 2D images will be treated as the single target image (407) and some or all of the other images will be treated as source images to which the target image's segmentation mask (304) will be warped.
The processor-based device 604 may include one or more processors 606, a system memory 608 and a system bus 610 that couples various system components including the system memory 608 to the processor(s) 606. The processor-based device 604 will at times be referred to in the singular herein, but this is not intended to limit the implementations to a single system, since in certain implementations, there will be more than one system or other networked computing device involved. Non-limiting examples of commercially available systems include, but are not limited to, ARM processors from a variety of manufactures, Core microprocessors from Intel Corporation, U.S.A., PowerPC microprocessor from IBM, Sparc microprocessors from Sun Microsystems, Inc., PA-RISC series microprocessors from Hewlett-Packard Company, 68xxx series microprocessors from Motorola Corporation.
The processor(s) 606 may be any logic processing unit, such as one or more central processing units (CPUs), microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), etc. Unless described otherwise, the construction and operation of the various blocks shown in
The system bus 610 can employ any known bus structures or architectures, including a memory bus with memory controller, a peripheral bus, and a local bus. The system memory 608 includes read-only memory (“ROM”) 1012 and random access memory (“RAM”) 614. A basic input/output system (“BIOS”) 616, which can form part of the ROM 612, contains basic routines that help transfer information between elements within processor-based device 604, such as during start-up. Some implementations may employ separate buses for data, instructions and power.
The processor-based device 604 may also include one or more solid state memories, for instance Flash memory or solid state drive (SSD) 618, which provides nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the processor-based device 604. Although not depicted, the processor-based device 604 can employ other nontransitory computer- or processor-readable media, for example a hard disk drive, an optical disk drive, or memory card media drive.
Program modules can be stored in the system memory 608, such as an operating system 630, one or more application programs 632, other programs or modules 634, drivers 636 and program data 638.
The application programs 632 may, for example, include panning/scrolling 632a. Such panning/scrolling logic may include, but is not limited to logic that determines when and/or where a pointer (e.g., finger, stylus, cursor) enters a user interface element that includes a region having a central portion and at least one margin. Such panning/scrolling logic may include, but is not limited to logic that determines a direction and a rate at which at least one element of the user interface element should appear to move, and causes updating of a display to cause the at least one element to appear to move in the determined direction at the determined rate. The panning/scrolling logic 632a may, for example, be stored as one or more executable instructions. The panning/scrolling logic 632a may include processor and/or machine executable logic or instructions to generate user interface objects using data that characterizes movement of a pointer, for example data from a touch-sensitive display or from a computer mouse or trackball, or other user interface device.
The system memory 608 may also include communications programs 640, for example a server and/or a Web client or browser for permitting the processor-based device 604 to access and exchange data with other systems such as user computing systems, Web sites on the Internet, corporate intranets, or other networks as described below. The communications programs 640 in the depicted implementation is markup language based, such as Hypertext Markup Language (HTML), Extensible Markup Language (XML) or Wireless Markup Language (WML), and operates with markup languages that use syntactically delimited characters added to the data of a document to represent the structure of the document. A number of servers and/or Web clients or browsers are commercially available such as those from Mozilla Corporation of California and Microsoft of Washington.
While shown in
A user can enter commands and information via a pointer, for example through input devices such as a touch screen 648 via a finger 644a, stylus 644b, or via a computer mouse or trackball 644c which controls a cursor. Other input devices can include a microphone, joystick, game pad, tablet, scanner, biometric scanning device, etc. These and other input devices (i.e., “I/O devices”) are connected to the processor(s) 606 through an interface 646 such as touch-screen controller and/or a universal serial bus (“USB”) interface that couples user input to the system bus 610, although other interfaces such as a parallel port, a game port or a wireless interface or a serial port may be used. The touch screen 648 can be coupled to the system bus 610 via a video interface 650, such as a video adapter to receive image data or image information for display via the touch screen 648. Although not shown, the processor-based device 604 can include other output devices, such as speakers, vibrator, haptic actuator, etc.
The processor-based device 604 may operate in a networked environment using one or more of the logical connections to communicate with one or more remote computers, servers and/or devices via one or more communications channels, for example, one or more networks 614a, 614b. These logical connections may facilitate any known method of permitting computers to communicate, such as through one or more LANs and/or WANs, such as the Internet, and/or cellular communications networks. Such networking environments are well known in wired and wireless enterprise-wide computer networks, intranets, extranets, the Internet, and other types of communication networks including telecommunications networks, cellular networks, paging networks, and other mobile networks.
When used in a networking environment, the processor-based device 604 may include one or more wired or wireless communications interfaces 614a, 614b (e.g., cellular radios, WI-FI radios, Bluetooth radios) for establishing communications over the network, for instance the Internet 614a or cellular network.
In a networked environment, program modules, application programs, or data, or portions thereof, can be stored in a server computing system (not shown). Those skilled in the relevant art will recognize that the network connections shown in
For convenience, the processor(s) 606, system memory 608, network and communications interfaces 614a, 614b are illustrated as communicably coupled to each other via the system bus 610, thereby providing connectivity between the above-described components. In alternative implementations of the processor-based device 604, the above-described components may be communicably coupled in a different manner than illustrated in
The various implementations described above can be combined to provide further implementations. To the extent that they are not inconsistent with the specific teachings and definitions herein, all of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet, including but not limited to U.S. Provisional Patent Application No. 61/571,908 filed Jul. 7, 2011; U.S. Pat. No. 9,513,357 issued Dec. 6, 2016; U.S. patent application Ser. No. 15/363,683 filed Nov. 29, 2016; U.S. Provisional Patent Application No. 61/928,702 filed Jan. 17, 2014; U.S. patent application Ser. No. 15/112,130 filed Jul. 15, 2016; U.S. Provisional Patent Application No. 62/260,565 filed Nov. 20, 2015; 62/415,203 filed Oct. 31, 2016; U.S. Provisional Patent Application No. 62/415,666 filed Nov. 1, 2016; U.S. Provisional Patent Application No. 62/451,482 filed Jan. 27, 2017; U.S. Provisional Patent Application No. 62/501,613 filed May 4, 2017; U.S. Provisional Patent Application No. 62/512,610 filed May 30, 2017; U.S. patent application Ser. No. 15/879,732 filed Jan. 25, 2018; U.S. patent application Ser. No. 15/879,742 filed Jan. 25, 2018; U.S. Provisional Patent Application No. 62/589,825 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,805 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,772 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,872 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,876 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,766 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,833 filed Nov. 22, 2017; U.S. Provisional Patent Application No. 62/589,838 filed Nov. 22, 2017; PCT Application No. PCT/US2018/015222 filed Jan. 25, 2018; PCT Application No. PCT/US2018/030963 filed May 3, 2018; U.S. patent application Ser. No. 15/779,445 filed May 25, 2018; U.S. patent application Ser. No. 15/779,447 filed May 25, 2018; U.S. patent application Ser. No. 15/779,448 filed May 25, 2018; PCT Application No. PCT/US2018/035192 filed May 30, 2018 and U.S. Provisional Patent Application No. 62/683,461 filed Jun. 11, 2018 are incorporated herein by reference, in their entirety. Aspects of the implementations can be modified, if necessary, to employ systems, circuits and concepts of the various patents, applications and publications to provide yet further implementations.
This application claims the benefit of priority to U.S. Provisional Application No. 62/722,663, filed Aug. 24, 2018, which application is hereby incorporated by reference in its entirety.
These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.
1. Norman, B., Pedoia, V. & Majumdar, S. Use of 2D U-Net Convolutional Neural Networks for Automated Cartilage and Meniscus Segmentation of Knee MR Imaging Data to Determine Relaxometry and Morphometry. Radiology 288, 177-185 (2018).
2. Lieman-Sifry, J., Le, M., Lau, F., Sall, S. & Golden, D. FastVentricle: Cardiac Segmentation with ENet. in Functional Imaging and Modelling of the Heart 127-138 (Springer International Publishing, 2017).
3. Shen, D. & Davatzikos, C. HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans. Med. Imaging 21, 1421-1439 (2002).
4. Ashburner, J. & Friston, K. J. Voxel-Based Morphometry—The Methods. Neuroimage 11, 805-821 (2000).
5. Rueckert, D. et al. Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans. Med. Imaging 18, 712-721 (1999).
6. Hu, Y. et al. Label-driven weakly-supervised learning for multimodal deformarle image registration. in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) 1070-1074 (2018).
7. Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J. & Dalca, A. V. An Unsupervised Learning Model for Deformable Medical Image Registration. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 9252-9260 (2018).
8. Lin, C.-H. & Lucey, S. Inverse Compositional Spatial Transformer Networks. arXiv [cs.CV] (2016).
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/047552 | 8/21/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62722663 | Aug 2018 | US |