The invention relates to the field of image synthesis and image processing, and more specifically to a method and a device for generating an augmented (i.e., adjusted) 3D representation of an object, for example an area of a human body.
In the context of interventional procedures, doctors use medical imaging data to perform a diagnostic or therapeutic or a treatment response evaluation act. For example, interventional radiology combines a radiological imaging technique (using X-rays) with an invasive procedure for diagnostic and/or therapeutic purposes. The intervention is guided and controlled by the radiological image. This is also the case in the domain of radiotherapy where radiological planning data are combined with online imaging data to deliver, adapt and guide treatment.
Usually, a first series of images of a patient's body area is taken before the intervention, for example approximately one week before the intervention. This first series of images, called “planning data” or “planning images”, is used by the doctor to prepare the intervention, in particular by visualizing the body area on which he must intervene and by locating the diseased tissues as well as determine areas and tissues that are to be preserved. During the intervention, the doctor is guided by a second series of images, called “interventional images” or “interventional data”, of the patient's body area obtained in “near” real time and are used to guide the process.
Planning images should be as accurate as possible, and are generally high or very high resolution images, eventually 3D images, obtained using computationally expensive image processing methods. Due to the absence of acquisition constraints, these data are usually multi-modal exploiting complementarity of information across imaging modalities (computed tomography, magnetic resonance imaging, positron emission tomography, etc.) and associated with the highest possible level of resolution. Planning images may for example be obtained with a 3D computed tomography scan (or “3D CT scan”) technique. Conversely, interventional images must be obtained quickly and regularly during the intervention, to have real-time information on the area of the body to be treated. For this reason, interventional images are usually lower resolution images, and could be generally multiple planes 2D images. For example, interventional images are obtained with a 2D Magnetic Resonance (MR) Imaging technique.
Other types of images may be used. For example, in the context of guided breast or prostate biopsies, the planning images may be CT and/or MR images and the interventional images may be ultrasound images. In the context of cardiac or brain surgeries, the planning images may be CT and/or MR images and the interventional images may be angiography images. In the context of radiotherapy (RT), the planning images may be CT images and the interventional images may be X-rays and/or 2D-MR images. In the context of radiation therapy planning data can be 3D CT or MR images and treatment guidance data could be sparse 2D MR acquisitions or low resolution 3D MR sequences.
During the intervention, the physician has to make the link between the interventional images that he receives in real time and the planning images (i.e., to “map” the two series of images), to better adapt the treatment or the diagnosis. Indeed, mapping 3D planning data into the 2D interventional images enables the overlay of each image's specific information and is thus essential for successful treatment. However, performing a mapping of the 2D images obtained during the intervention on the 3D planning images is not always easy, in particular due to the position and movement of the patient during the intervention, but also to the difference in resolution between images. The time interval between the 3D planning sequence and the 2D treatment implementation could lead to important anatomical changes associated either with tissue collapse due to the surgical operation or to normal physiological changes of the human body.
In this context, methods have emerged in recent years to register a second series of planning images (usually low resolution and/or 2D images obtained during an interventional procedure) with a first series of planning images (usually high resolution and/or 3D images previously obtained). When the planning images represent 3D volumes and the interventional images are 2D frames, these methods are called “Slice-to-Volume (SV) registration”. Image registration consists in aligning and combining data to obtain one unique coordinate system. SV registration consists in determining a slice from a given 3D volume that corresponds to an input 2D image. For example, an SV registration method makes it possible to determine to which slice of the 3D volume resulting from the planning data corresponds an interventional 2D image. When several slices of the 3D volume corresponding to several input 2D images are determined, the method is called “Multi-Slice-to-Volume (MSV) registration”.
Some popular clinical settings in this context of Slice-to-Volume registration refer to ultrasound towards CT/MR (guided breast or prostate biopsies), angiography to CT/MR for cardiac and brain surgeries, or X-rays/2D MR to CT for radiotherapy (RT). Matching radiology and digital pathology is another domain where the same challenges are to be addressed due to the partial correspondences between the recovered specimen and prior 3D imaging acquisition.
In this respect, one has to face three main challenges: (i) anatomical changes and tissue deformation, (ii) sparse, partial-view and low-quality data during treatment implementation due to acquisition time and nature of signal limitations, and (iii) difference on the nature/modality of images to be aligned/registered.
Early literature to solve these challenges is mostly focused on solving the registration problem, with the objective of determining a mapping between the interventional data and the planning 3D images by covering mostly plane selection and in-plane deformations. Traditional Deformable Image Registration (DIR) methods iteratively align two volumes through optimization of an energy function. Recently, learning-based methods have spread out due to impressive performance and the ability to infer transformation parameters for any input volume pair from a unique model, first in a supervised setting, and more and more without labels.
However, the existing solutions have drawbacks, because they require complex calculations on too many data which make them difficult to use in an interventional context, are not scalable enough with respect to the sparsity and the nature of interventional signal, fail to deal with severe changes in the nature of the images to be registered, require excessive computing resources and computational time that prohibits their use in a real clinical setting or do not give sufficiently accurate results.
The present invention aims to adjust a reference 3D representation of an object (e.g. an area of the human body on which a doctor must intervene) compliant with an imaging modality, called “source imaging modality”, so that the adjusted 3D representation coincides at least partially with 2D interventional images compliant with another imaging modality, called “target imaging modality”. In other words, the 3D representation compliant with the source imaging modality is registered onto the 2D images compliant with the target imaging modality. For example, the source imaging modality may be a computed tomography scan and the reference 3D representation compliant with the source imaging modality may be a planning 3D CT scan. The target imaging modality may be Magnetic Resonance Imaging and the 2D interventional images compliant with the target imaging modality may be interventional 2D MR images. Sections of the adjusted 3D representation are therefore registered with the 2D interventional images, which allows the doctor to have a 3D representation of the area of the human body on which he must intervene which is as faithful as possible to the real area at the time of the intervention.
To achieve this aim, the invention proposes a compositional solution that relies simultaneously on two learning models: one concerns the automatic translation from one imaging modality to another, and the other the elastic registration of 3D representations to the partial 2D interventional images. The translation model creates a bijective model able to transform interventional 2D images compliant with the target imaging modality into 2D images into similar 2D images compliant with the source imaging modality. These similar 2D images compliant with the source imaging modality can therefore be used to build a 3D sparse representation of the object compliant with the source imaging modality. The 3D reference representation may then be deformed and adjusted to the obtained 3D sparse representation via the registration model. Joint learning of the two learning blocks may advantageously performed in a completely unsupervised way, and allows to obtain a more accurate adjusted representation of the object than existing methods, because each learning block benefits from the other. In particular, registration is done in a faster and more precise way because the two input representations are compliant with the same imaging modality. The translation block benefits from a backpropagation of the loss function of the registration block, which allows faster convergence and a more efficient overall learning model.
A first aspect of the invention therefore relates to a method implemented by computer for training a machine learning architecture for generating an adjusted 3D representation of an object based on a reference 3D representation of said object compliant with a source imaging modality and a plurality of images of at least part of said object compliant with a target imaging modality, so that 2D sections of the adjusted 3D representation are at least partially registered with images of the plurality of images, the machine learning architecture comprising a first machine learning architecture and a second machine learning architecture. The method may comprise:
By “object”, it is meant any element whose representation is obtained using an imaging technique. For example, the object may be a human body part.
The term “imaging modality” refers to the imaging technique used for obtaining 3D or 2D images. For example, the imaging modality may designate the imaging technology (for example, in the field of medical imaging, an imaging technology among: x-ray imaging, scanner, scintigraphy, positron emission tomography, ultrasound, magnetic resonance imaging, etc.), but also specific characteristics of the imaging technique used (e.g., resolution).
By “image or representation compliant with the imaging modality”, it is meant an image or a representation that have been obtained or synthetized according to the same imaging acquisition principles or that could have been obtained by using said imaging modality. In other words, the wording also includes images or representations that have been generated by a processing technique and which simulated images or representations obtained by the imaging modality.
By “source imaging modality”, it is meant the imaging modality associated with the 3D representation to register. The wording “target imaging modality” refers to an imaging modality distinct from the target imaging modality, and designates the imaging modality associated with the 2D images onto which the 3D representation shall be registered. Terms “target” and “source” are commonly used in the field of image registration, but they could be replaced by any other denominations.
By “adjusted 3D representation”, it is meant the input 3D representation of the object (i.e. the “reference 3D representation”) is deformed to be registered on a set of images (the “plurality of images”). For example, the reference 3D representation may be a planning 3D CT scan of a body area of a subject and the plurality of images may be a set of interventional MR images or angiography images or con beam CT images. In such cases, registration is necessary to cope the deformation of the subject, which may be due for example to breathing, movements, anatomical changes, . . . .
The challenges to be met here are on the one hand the fact that the 3D representation must be registered from a plurality of 2D images, and on the other hand the fact that the 3D representation and the set of 2D images do not come from the same imaging modality and the underlying signal differences make the establishment of meaningful anatomical correspondences challenging.
By “2D section of a 3D representation”, it is meant a 2D representation of a section (or a “slice”) of the 3D representation. For example, the 3D representation may comprise a voxel grid along 3 axes (x,y,z). A 3D section of the 3D representation may correspond to the subset of voxels (x,y) when one of the coordinate (z) is fixed. The corresponding 2D section may then be defined as the 2D projection of the 3D section on a plane (x,y). For example, a respective pixel (x,y) may be associated with each voxel (x,y) of the 3D section and its value may be set to the value of the corresponding voxel.
In the context of CT scan, the sections may advantageously cross sections, but the sections may be performed along any axis (longitudinal, sagittal, or other random axis orientation of interest).
By “sparse 3D representation” it is meant a 3D representation of which some sections correspond to images of the object (non-empty sections), and other sections are empty or filled with predefined values (for example, null values). This sparse 3D representation has the same size (along the 3 axes x, y, z) as the reference 3D representation. The non-empty sections may be filled with corresponding 2D images (e.g., the simulated source images), while the empty sections may be filled with zero values. To fill the non-empty sections, it is possible, for example, to associate to each voxel of a non-empty section a value corresponding to a respective pixel of a 2D image.
Therefore, the sparse 3D representations of the second set of training data are constructed so that sections of said sparse 3D representations are filled according to the pixel values of the simulated source images.
In a practical example, a full 3D CT images (i.e., reference 3D source representation) is provided as input because this modality is acquired with 3D techniques so to have a full 3D representation of the organ. It is possible to obtain it for planning data because there is the time, as it is not at the day of treatment but a few days before. On the other hand, at treatment day, there is less time for image acquisition, and the imaging machines in the interventional room are not the same due to size constraint. So the treatment day it is only possible to acquire partial view of the organ, which means that only one 2D image is acquired, or if it is necessary to see the organ at different locations a small set of 2D images that are separated by empty space (non-acquired data) are acquired. Once this input 2D images (i.e., plurality of reference 2D target images), we just name “sparse 3D representation” a 3D volume which takes into account also these empty slices/spaces. This “sparse 3D representation” has the same size as the full 3D planning data (i.e., reference 3D source representation) since the organ has the same size, however less data is available. And after the first machine learning module, the same idea is applied but the modality has changed.
In other word, we artificially build a 3D volume from a set of 2D images, taking into account the real size of the organ, so we insert empty slices between each real acquisitions/simulations.
By “deformation field”, also called “displacement field”, it is meant a set of data (voxel-wise 3D vector displacement) characterizing a transformation to be applied to a 3D representation to perform registration. This deformation field may advantageously concatenate two components: a first component corresponding to a linear (rigid, similarity, affine, etc.) transformation, and a second component corresponding to a non-rigid (or “elastic”) transformation.
The first learned model generates, from one image corresponding to a given imaging modality, a corresponding image that “synthetizes” the image that would have been obtained of the same object by using another imaging acquisition principle than the one that have been used. In other words, the overall appearance (including edges) is the same between the input image and the simulated image, but the simulated image “looks like” an image obtained by the other imaging modality.
The second learned model registers one 3D representation with another 3D representation. In the context of the present invention, this other 3D representation is sparse. Therefore, this registration is performed on non-empty (or non-zero) intersections between the deformed 3D planning representation and the interventional 2D sparse signal in terms of image correspondences and is constrained to be regular and smooth on the whole 3D domain. More specifically, as detailed hereinafter, registration is performed on all sections of the sparse representation (empty and non-empty), but to assess the quality of registration (measurable via the loss function associated with the second ML architecture), only the non-empty sections are considered. In addition, a regularization component of the loss function may be added, to impose a continuous deformation field, and thus a smoothing between the successive sections. Therefore, two neighbor voxels (according to any direction x, y or z) must have “close” displacements applied to them. This makes it possible to respect a certain similarity between the empty sections and the non-empty sections, and to obtain a coherent deformation field, even for the empty voxels.
According to the invention, the machine learning (ML) architecture comprises a first ML architecture for performing the translation from one imaging modality to another, and a second ML architecture for performing the registration between two 3D representations. The first and the second ML architectures are jointly trained. Therefore, each ML architecture benefits from the other. More specifically, the second ML architecture takes advantage of the fact that the input 3D representations correspond to the same imaging modality, which improves the quality of the registration. The first ML architecture takes advantage of the backpropagation of the error associated with the second ML architecture, which improves performance of translation from one imaging modality to another. The two training architectures they may be either trained independently, used as pre-trained models and re-trained jointly or being trained simultaneously through an end-to-end principle.
In one or several embodiments, the first machine learning architecture may be associated with a first loss and the second machine learning architecture may be associated with a second loss. The joint training of the first machine learning architecture and the second machine learning architecture may comprise a joint minimization of the first loss and the second loss weighed according to the importance of the two tasks. Advantageously, this end-to-end architecture forces the signal to be shared across the two models and allows everything to be trained at once. It leads to a concurrent/joint training between both tasks, the registration (i.e., first machine learning architecture) and the translation (i.e., second machine learning architecture. This has the advantage of improving the performance as each task benefits to the other one. The concurrent nature of the present approach creates mutual benefit for both tasks: image translation is naturally eased by explicit handling of out-of-plane deformations while registration benefits from bringing multimodal signals into the same domain.
In one or several embodiments, the first machine learning architecture and/or the second machine learning architecture may be trained in an unsupervised manner.
In one or several embodiments, each of the 3D training source representations may represent a part of a body of a respective subject among a plurality of subjects. The training target images may comprise partial or full sets of training target images, each set of training target images being associated with a respective subject of the plurality of subjects and representing successive sections of the part of the body of said respective subject.
In other words, for each subject of a plurality of subjects, there are one 3D training source representation and a set of training target images. These representation and images are used for training the ML architecture according to the above method.
In one or several embodiments, the first machine learning architecture may be adapted to synthetize, from an image compliant with one imaging modality among the source imaging modality and the target imaging modality, a corresponding image compliant with the other imaging modality. The first machine learning architecture may further be trained based on pairs of images, each pair of images comprising:
For example, for each pair of images, the new image may be obtained by:
In one or several embodiments, the first loss may be function of:
In one or several embodiments, the target imaging modality may be magnetic resonance imaging and the source imaging modality may be computed tomography imaging.
In one or several embodiments, the first machine learning architecture may comprise a cycle generative adversarial network, GAN.
In one or several embodiments, the second machine learning architecture may comprise a convolutional neural network, CNN.
Another aspect of the invention relates to a method implemented by computer for generating an adjusted 3D representation of an object based on a reference 3D source representation of said object compliant with a target imaging modality and a plurality of reference target images of at least part of said object compliant with a target imaging modality. The method may comprise:
In one or several embodiments, the reference target images may represent successive sections of at least part of the object. For example, the reference target images may be successive images of a body area along a longitudinal axis.
Yet another aspect of the invention relates to a device comprising a processor configured to carry out any of the above methods. The device may also comprise an input interface to receive data and an output interface to output data.
Yet another aspect of the invention relates to a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods.
The present disclosure further pertains to a non-transitory program storage device, readable by a computer, tangibly embodying a program of instructions executable by the computer to perform a method for generating an adjusted 3D representation or a method for training, compliant with the present disclosure.
Such a non-transitory program storage device can be, without limitation, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, or any suitable combination of the foregoing. It is to be appreciated that the following, while providing more specific examples, is merely an illustrative and not exhaustive listing as readily appreciated by one of ordinary skill in the art: a portable computer diskette, a hard disk, a ROM, an EPROM (Erasable Programmable ROM) or a Flash memory, a portable CD-ROM (Compact-Disc ROM).
The present invention is illustrated by way of example, and not by way of limitation, by the figures in which:
Expressions such as “comprise”, “include”, “incorporate”, “contain”, “is” and “have” are to be construed in a non-exclusive manner when interpreting the description and its associated claims, namely construed to allow for other items or components which are not explicitly defined also to be present.
The terms “adapted”, “augmented” and “configured” are used in the present disclosure as broadly encompassing initial configuration, later adaptation or complementation of the present device, or any combination thereof alike, whether effected through material or software means (including firmware).
The term “processor” should not be construed to be restricted to hardware capable of executing software, and refers in a general way to a processing device, which can for example include a computer, a microprocessor, an integrated circuit, or a programmable logic device (PLD). The processor may also encompass one or more Graphics Processing Units (GPU), whether exploited for computer graphics and image processing or other functions. Additionally, the instructions and/or data enabling to perform associated and/or resulting functionalities may be stored on any processor-readable medium such as, e.g., an integrated circuit, a hard disk, a CD (Compact Disc), an optical disc such as a DVD (Digital Versatile Disc), a RAM (Random-Access Memory) or a ROM (Read-Only Memory). Instructions may be notably stored in hardware, software, firmware or in any combination thereof.
Without loss of generality, it is considered in the following that the source imaging modality is Computed Tomography (CT) scan imaging and the target imaging modality is Magnetic Resonance (MR) imaging. The reference 3D representations are therefore 3D CT scan volumes representing a whole area of a human body of a subject and the plurality of images compliant with the target imaging modality are 2D MR images of random/arbitrary views partially representing said area of the human body. By “partially”, it is meant that each of the MR images represent area spaced sufficiently apart that a 3D reconstruction of the subject's body area is not possible from the images as they are. For example, MR images represent 2D cross sections of the area spaced several tens of millimeters apart, e.g. 1 to 2 centimeters. Possibly, the MR images themselves only represent a part of the corresponding cross-section of the associated 3D representation. For example, the 3D CT scan can cover a complete area of the body, e.g. the whole abdomen, and the 2D MR images can represent only part of the abdomen, e.g. the stomach. Of course, the invention extends to other imaging modalities.
In
First, the 3D CT scans 110 may be processed (step 115) to obtain, for each 3D CT scan 110, a corresponding group of 2D CT images 120. For example, for a given 3D CT scan 110 of a subject, the corresponding group of 2D CT images 120 may comprise successive sections (e.g. cross sections) of said 3D CT scan. For example, these sections may be obtained randomly from the 3D CT scan 110. For instance, the 3D CT scan 110 may comprise a grid of voxels with respective dimensions XCT, YCT, ZCT along three directions (x, y, z). To obtain one 2D CT image 120, it is possible to randomly draw an integer k between 1 and ZCT, to obtain the section of voxels corresponding to z=k and to define the 2D image CT as a 2D grid of pixels having the same values as the obtained section of voxels. Other processing techniques are possible. Such techniques are known by the person skilled in the art and are not further detailed. Advantageously, for each subject, the number of 2D CT images 120 extracted from the 3D CT scan 110 is equal to the number of 2D MR images 130.
The training 2D MR images 130 and the obtained 2D CT images 120 are used to train (step 135) a first machine learning (ML) architecture so as to obtain a first learned model adapted to perform a translation from the target imaging modality to the source imaging modality (e.g. MR-to-CT translation), i.e. to generate, from an image compliant with the target imaging modality, a corresponding image compliant with the source imaging modality. The first learned model allows to “transform” a 2D MR image into a corresponding 2D CT scan image, called “2D pseudo-CT image”. In other words, the obtained 2D pseudo-CT image simulates the image one would have obtained using a CT scan imaging technique instead of the MR imaging technique. The training (step 135) of the first ML architecture is further detailed below, with reference to
The first ML architecture outputs a set of 2D pseudo-CT images 140 respectively associated with the set of training 2D MR images 130. These 2D pseudo-CT images 140 are then processed (step 145) to obtain sparse 3D pseudo-CT representations 145.
In other words, at step 145, for each plurality of 2D pseudo-CT images 140 corresponding to the same subject, a corresponding 3D pseudo-CT representation 150 is constructed. The 2D pseudo-CT images 140 correspond to sections (e.g. cross sections) of the constructed 3D pseudo-CT representation 150. The 3D pseudo-CT representation 150 is therefore “sparse”, since volumes between two sections are empty. In embodiments, these empty volumes may be filled by zero values. For example, the 3D pseudo-CT representation 150 may be a regular grid of voxels in 3D space. The values of the voxels of one section of the 3D pseudo-CT representation 150 may be set according to the values of the pixels of the corresponding 2D pseudo-CT image 140. The values of the voxels of the 3D pseudo-CT representation 150 that do not correspond to a 2D pseudo-CT image 140 may be set to zero or to any other reference value.
The obtained 3D pseudo-CT representations 150 are used, together with the original 3D CT scans 110, to train a second ML architecture (step 155) so as to obtain a second learned model adapted to perform a registration between an original 3D CT scan 110 and a corresponding 3D pseudo-CT representation 150, i.e. to determine a transformation (e.g. a deformation field) to apply to the original 3D CT scan 110 to align certain points of interest (e.g. points corresponding to edges of non-empty sections, i.e. sections that do not correspond to empty sections of the associated 3D pseudo-CT representation 150) with the corresponding points of the 3D pseudo-CT representation 150. The training (step 155) of the second ML architecture is further detailed below, with reference to
The transformation determined at step 155 is then applied (step 165) to the original 3D CT scans 110 to obtain transformed 3D CT scans 170 at least partially registered with corresponding 3D pseudo-CT representations 150.
To sum up, training data 120 and 130 used to train the first ML architecture comprise images obtained from a plurality of subjects. More specifically, for each subject, training data comprises 2D RM images 130 and 2D CT scan images 120 (obtained by slicing the 3D CT scan 110 of the subject). From the 2D RM images 130 of one subject, corresponding 2D pseudo-CT images 140 are obtained as outputs of the first ML architecture (step 140). These 2D pseudo-CT images 140 simulate images that would have been obtained instead of the 2D RM images 130 using CT scan imaging instead of RM imaging on the subject. These 2D pseudo-CT images 140 are then used to construct a 3D pseudo-CT representation 150 of which certain sections correspond to the 2D pseudo-CT images 140. This 3D pseudo-CT representation 150 is sparse, since only a few sections are available. Then, the 3D CT scan 110 of the subject is deformed, or “adjusted”, (steps 155 and 165) to be at least partially registered with the sparse 3D pseudo-CT representation 150 obtained for the subject.
It is noted that the first and the second ML architectures are not trained independently, but jointly. In other words, the first ML architecture is associated with a first loss, the second ML architecture is associated with a second loss, and the first loss and the second loss are jointly minimized to train the first and the second ML architectures. Training data comprises pairs of data (3D CT scan 110 and set of 2D RM images 130) obtained from a plurality of subjects, and for each subject, a 3D pseudo-CT scan 150 is generated by the first ML architecture from the respective set of 2D RM images 130, and this 3D pseudo-CT scan 150 is forwarded to the second ML architecture to warp the original 3D CT scan 110.
In other words, if the first ML architecture (dedicated to translation) is associated with a first loss Ltrans and the second ML architecture (dedicated to registration) is associates with a second loss Lreg, the total loss Ltotal for the whole ML architecture (translation and registration) may be a linear combination of L trans and Lreg. For example, Ltotal=Ltrans+Lreg. The first and the second loss are jointly minimized during training of the complete ML architecture, to minimize L total, providing an end-to-end, concurrent (or cooperative) training of the global machine learning architecture.
As detailed above, in one or several embodiments, the original training set comprises a plurality of 3D CT scans 110 and a plurality of training 2D MR images 130.
As represented in
The training device 200 may further comprise a 2D-to-3D post-processing module 245 to obtain, from the obtained 2D pseudo-CT images, corresponding sparse 3D pseudo-CT representations (element 150 of
The training device 200 may further comprise a second ML module 255 comprising the second ML architecture. This second ML module 255 may receive as input training data comprising the sparse 3D pseudo-CT representations (element 150 of
As represented in
For a given 2D MR image IMR of a set of 2D MR images 130 obtained from a subject, the generator GCT may produce a 2D pseudo-CT image ÎCT=GCT(IMR), which may be forwarded into the discriminator DCT along with a “true” 2D CT image ICT randomly sampled from the 3D CT scan associated with the set of 2D MR images 130, i.e. the 3D CT scan of the same subject. A distance between the true 2D CT image ICT and the generated 2D pseudo-CT image ÎCT may be calculated. For example, this distance may be an L1 distance denoted as Ladv(CT).
Similarly, for a given 2D CT image ICT of a set of 2D CT images 120 obtained from a subject, the generator GMR may produce a 2D pseudo-MR image ÎMR=GMR(ICT), which may be forwarded into the discriminator DMR along with a “true” 2D MR image ÎMR randomly sampled from the set of 2D MR images 130 obtained from the same subject. A distance between the true 2D MR image IMR and the generated 2D pseudo-MR image ÎMR may be calculated. For example, this distance may be an L1 distance denoted as Ladv(MR).
The paths corresponding to the two above paragraphs are represented in solid lines in
To ensure that the generated 2D pseudo-CT image ÎCT is concordant with IMR on a pixel-basis, the generator GMR may be asked to reconstruct IMR from ÎCT by minimizing, for instance, their L1 distance, denoted as Lrec(MR). In other words, the generator GMR may produce a 2D pseudo-MR image ÎMR from the 2D pseudo-CT image ÎCT generated by GCT, and the distance Lrec(MR) between the true 2D MR image IMR and the 2D pseudo-MR image ÎMR may be determined.
Similarly, to ensure that the generated 2D pseudo-MR image ÎMR is concordant with ICT on a pixel-basis, the generator GCT may be asked to reconstruct ICT from ÎMR by minimizing, for instance, their L1 distance, denoted as Lrec(CT). In other words, the generator GCT may produce a 2D pseudo-CT image ĪCT from the 2D pseudo-MR image ÎMR generated by GMR, and the distance Lrec(CT) between the true 2D CT image ICT and the 2D pseudo-CT image ĪCT may be determined.
The paths corresponding to the two above paragraphs are represented in dashed lines in
In embodiments, the first ML architecture may also take advantage of the structure-consistency Modality Independent Neighbourhood Descriptor (MIND) loss described in the article of Heinrich et al.: “MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration”, Medical Image Analysis, 16(7):1423-1435, October 2012.
This loss, denoted as LMIND assesses structural similarity around each voxel v regardless of the intensity distribution. By comparing IMR and ÎCT=GCT(IMR), it helps in both transferring style between modalities and enforcing tissue consistency. Such loss for generation of 2D pseudo-CT images may be formulated as:
Similarly, a loss LMIND-CT may be applied for generation of 2D pseudo-MR images:
The total loss of the first ML architecture may therefore be:
L
trans
=L
adv+λrecLrec+ΔIdLId+ΔMINDLMIND
where λrec, λId and λMIND are weights balancing each loss.
As detailed above, each 2D pseudo-CT image ICT is generated by GCT from a respective 2D pseudo-MR image IMR which is itself generated by GMR from a respective “true” 2D CT image ICT. Therefore, from the first ML architecture of
These corresponding pairs of MR and/or CT images (reference 310 of
Referring again to
An aim of the second machine learning architecture is to allows a mapping between the 3D CT scan of a subject onto the generated sparse 3D pseudo-CT scan. It is noted that the 3D CT scan and the sparse 3D pseudo-CT scan have the same size (typically, they are represented by respective matrices of pixels having the same size). In the following, the 3D CT scan (element 110 of
The second ML architecture may typically comprise a network that produces, for each voxel of the 3D CT scan, a 3D displacement field (or “deformation field”) Rf,m to be applied to said voxel of the 3D CT scan to match the 3D CT scan with the generated sparse 3D pseudo-CT scan. This 3D displacement field may comprise, for each voxel v, a respective triplet of values, each value representing a respective component in a 3D coordinate system.
The second ML architecture may comprise a Convolutional Neural Network (CNN) trained with pairs of 3D volumes, each pair comprising a 3D CT scan and a corresponding sparse 3D pseudo-CT scan. CNNs are well-known tools for the person skilled in the art to perform registration, therefore the architecture of the CNN if not further detailed in this specification.
The CNN may output the displacement field, which may comprise two components: an affine transformation matrix 510 and a deformable transformation matrix 520.
The affine transformation matrix 510 represents a linear transformation to apply to the 3D CT scan (for example so that a region of interest of the 3D CT scan is globally at the same “place” as the same region of interest in the sparse 3D pseudo-CT scan). Such linear transformation may be decomposed into three basic transformations: rotation, dilation and translation, which may be represented in 3D by 12 parameters. As represented in
The deformable transformation matrix 520 may represent the non-rigid transformation to apply to each voxel of the 3D CT scan. It may be a matrix of the same size as the 3D CT scan and the sparse 3D pseudo-CT scan, each element of the matrix comprising 3 values to respectively apply to each component of the concerned voxel.
The final 3D displacement field Rf,m to be applied to said voxel of the 3D CT scan is therefore obtained from the affine transformation matrix 510 and the deformable transformation matrix 520.
Advantageously, the training of the second ML architecture is fully unsupervised. This training may be performed by maximizing the similarity between f and Rf,m(m).
Since f is sparse (i.e., it contains sections filled with zero values), a classical volumetric loss is not optimal as it would induce noise within the gradient of pixels registered onto empty slices by imposing such pixels to be equal to 0. To circumvent this issue, it is possible to mask the pixel-wise loss for empty slices as follows:
wherein fi designates a section (i.e., a slice) of f and ∥.∥2 designates the Euclidean distance. In other words, the loss is computed by computing the mean square error (MSE) only on non-zero sections of f. Other embodiments are possible. For example, the Normalized Cross-Correlation (NCC) loss may be used instead of the MSE.
In addition, because the loss is masked for CT slices within the original sparse 3D pseudo-CT scan, there may be additional degrees of freedom for the registration. To bypass this issue, it is possible, in one or several embodiments, to add a 3D regularization loss Lregu3D that penalizes large local variations of the displacement field by minimizing the mean square error of the spatial gradient of Rf,m as proposed in the article of Balakrishnan et al., “VoxelMorph: A Learning Framework for Deformable Medical Image Registration”, IEEE Transactions on Medical Imaging, 38(8):1788-1800, August 2019.
Together, Lsim2D and Lregu3D make it possible to obtain a final transformation precise, smooth and with realistic mapping, while enforcing the continuity of inter-slice displacements with respect to the surrounding non-empty ones. The total loss for the second ML architecture may therefore be formulated as follows:
L
reg(f,m,R)=λsimLsim2D(f,Rf,m(m))+λregu3D(Rf,m)
Once the operational global trained model 260 is obtained (i.e., after the training phase represented in
In one or several embodiments, a 3D CT scan (also referred to as “reference 3D source representation”) of an area of a subject body may be received at step 615. A plurality of 2R MR images (also referred to as “operational 2D target images”) of at least part of the same area of the same subject body may be received at step 625. Also, operational (global) trained model 260 may be received at step 635. It is noted that steps 615, 625 and 635 may be performed in parallel or successively, in any order.
The 3D CT scan and the plurality of 2D MR images are then used as inputs of the operational trained model to determine, at step 645, a deformed 3D CT scan adjusted to the plurality of 2R MR images.
The device 700 may comprise an input interface for receiving the 3D CT scan 710 of the area of the subject body, the plurality of 2D MR images 720 of at least part of the same area of the same subject body, and the operational trained model 260. The device 700 may also comprise a processing module 710 for determining, by using the 3D CT scan 710 and the plurality of 2D MR images 720 as inputs of the operational trained model 260, a deformed 3D CT scan 720 adjusted to the plurality of 2R MR images.
In these embodiments, the device 800 comprises a computer, this computer comprising a memory 801 to store program instructions loadable into a circuit and adapted to cause a circuit 802 to carry out steps of the methods of
The circuit 802 may be for instance:
The computer may also comprise an input interface 803 for the reception of training data and/or the operational data of a subject and an output interface 804 to provide the trained operational model and/or the deformed 3D CT scan of a subject.
To ease the interaction with the computer, a screen 805 and a keyboard 806 may be provided and connected to the computer circuit 802.
Furthermore, the flow charts represented in
Experiments and Results
Performances of the method for determining an adjusted 3D representation of an object from a plurality of 2D images of at least part of said object have been assessed with pelvis 3D CT scans and 2D MR images. More precisely, two private clinical datasets for patients undergoing radiotherapy (RT) have been used. The first dataset comprises 451 pairs between the planning CT and the 0.35 T TrueFISP MR sequences of treatment delivery. The second dataset involves 217 pairs between the planning CT and the 1.5 T T2 MR sequences. 0.35 T TrueFISP MR and 1.5 T T2 MR correspond to different image qualities (resolution of 0.35 T TrueFISP MR being lower than resolution of 1.5 T T2 MR) Such a gap in texture resolution is an argument for the ability of the method to perform in many study cases. The ratio between the 3D planning CT and the multi-slice treatment MR in terms of slices was 10:1 (i.e. the MR images corresponds in average to 1 slice out 10 slices of the 3D CT).
Both datasets were preprocessed independently with normalization, resampling and cropping to get 256×256×96 (x, y, z) volumes with an (x, y) resolution of 1 mm2 and a z resolution of 3 mm. For each volume and modality, 8 anatomical structures were manually segmented by internal experts: anal canal, bladder, left/right femoral head, rectum, penile bulb, seminal vesicle and prostate—when applicable. Each dataset was separated into three groups for training (60%), validation (20%) and testing (20%). Loss weights were set as follows to have balanced leverage between each component: λrec=λId=10, λMIND=λsim=5 and Aregu=0.1.
Performance of the MT-to-CT translation part of the method has been assessed against two prior art methods, namely conventional CycleGAN and CycleGAN with a MIND compensation in the loss, by using two metrics quantifying the quality of image reconstruction: SSIM (for “Structural SIMilarity”), which assesses the structural degradation of a reconstructed image compared to the original one, and FID (for “Fréchet inception distance”), which measures the distance between distributions of generated and original sets as latent vectors. Also, results were compared between the new method without MIND compensation and with MIND compensation.
The Results for MT-to-CT translation part are provided below:
As shown in the above table, performance of the new method with MIND compensation outperforms the CycleGAN algorithms, with or without MIND. Performance of the new method without MIND compensation is similar than CycleGAN with MIND compensation. These results suggest that the joint learning of both the first (translation) ML architecture and the second (registration) ML architecture improves performance of each architecture considered independently from the other. In other words, the concurrent registration module improves the translation module by giving feedback on reconstruction error though backpropagation.
Performance of registration part of the new method has been assessed against several methods of the prior art (SyN (ANTs) affine, SyN deformable, VoxelMorph (SSIM), VoxelMorph (MIND)), by using two metrics quantifying the quality of image segmentation: Dice score and Hausdorff distance between masks of 2D MR images and masks of the deformed 3D CT representation. The Hausdorff distance is a real positive or null value which represents the highest of all the distances from a point in one set to the closest point in the other set, and is minimized (i.e., close to 0) when registration is successful. Dice score is a real value comprised between 0 and 1 which measures the level of intersection between the two segmentation masks, and is maximized (i.e., close to 1) when registration is successful.
Also, results were compared between different embodiments of the registration method: when only affine transformation is considered for the displacement field (denoted “affine” in the second table below), without considering MIND compensation (denoted “no MIND” in the second table below), without the first ML architecture (denoted “no end-to-end” in the second table below)—i.e., when the second ML architecture is trained independently from the first ML architecture. Finally, results were compared between the method with NCC-based second loss and the method with MSE-based second loss.
Results are provided in the second table below:
The end-to-end approach reaches the best performance among all other approaches, which confirms that the concurrent approach (i.e. the joint training of the two ML architectures) outperforms independent trainings.
A person skilled in the art will readily appreciate that various parameters disclosed in the description may be modified and that various embodiments disclosed may be combined without departing from the scope of the invention. Of course, the present invention is not limited to the embodiments described above as examples. It extends to other variants. For example, the first and/or the second trained models may be used independently from each other. For instance, the first trained model may be used for performing an image translation from a first imaging modality to another modality, and the second trained model may be used for performing registration between any two 3D volumes. Also, even if the second trained model has been presented above in the context of CT images (i.e., the “source imaging modality”), it can be adapted to perform registration between MR volumes (i.e. the “target imaging modality”). Also, even if the present invention has been more specifically described in the context of medical imaging, it may be applied to any imaging field, for example in geophotography or in astrophotography.
Number | Date | Country | Kind |
---|---|---|---|
22194697.3 | Sep 2022 | EP | regional |