This application claims priority from Japanese Application No. 2023-052392, filed on Mar. 28, 2023, the entire disclosure of which is incorporated herein by reference.
The present disclosure relates to an information processing apparatus, an information processing method, and an information processing program.
In the related art, a tubular structure, such as a large intestine or a bronchus, of a patient is observed or treated by using an endoscope. An endoscopic image is an image in which a color and texture inside the tubular structure are clearly expressed by an imaging element, such as a charge coupled device (CCD), and represents the inside of the tubular structure as a two-dimensional image. Therefore, it is difficult to understand which position inside the tubular structure is represented by the endoscopic image. In particular, since the bronchus has a small diameter and a narrow visual field, it is difficult to reach a target position with a distal end of the endoscope.
Therefore, various methods have been proposed, which navigate a route to the target point in the tubular structure by using a virtual endoscopic image generated based on a three-dimensional image acquired by tomography using modalities, such as a computed tomography (CT) apparatus and a magnetic resonance imaging (MRI) apparatus. For example, JP5718537B proposes a method of acquiring route information indicating a route of a tubular structure based on a three-dimensional image, generating a large number of virtual endoscopic images along the route based on the three-dimensional image, and performing matching between the virtual endoscopic image and a real endoscopic image, to specify a distal end position of an endoscope.
In addition, for example, Jake Sganga, David Eng, Chauncey Graetzel, David B. Camarillo. “Offsetnet: Deep learning for localization in the lung using rendered images.” In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 5046-5052, 2019. proposes a method of identifying a real viewpoint, that is, a position of an endoscope by estimating a viewpoint difference between a real endoscopic image and a virtual endoscopic image by using a learning model. In the learning model in Jake Sganga, David Eng, Chauncey Graetzel, David B. Camarillo. “Offsetnet: Deep learning for localization in the lung using rendered images.” In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 5046-5052, 2019. is trained using, as training data, a combination of the real viewpoints obtained from the real endoscopic image and an electromagnetic sensor, a combination of the virtual endoscopic image and the viewpoint thereof, and the viewpoint difference thereof. In addition, it is also described that, in order to reinforce the training data, a combination of the viewpoints specified based on an image obtained by transforming the virtual endoscopic image into a real endoscopic image style and the three-dimensional image is used as the training data, instead of the combination of the real viewpoints obtained by the real endoscopic image and the electromagnetic sensor.
In addition, as another method of navigating the route to the target point in the tubular structure, for example, Mehmet Turan, Yasin Almalioglu, Helder Araujo, Ender Konukoglu, Metin Sitti. “Deep EndoVO: A recurrent convolutional neural network (RCNN) based visual odometry approach for endoscopic capsule robots.” In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR), 2017. proposes a method of estimating a viewpoint difference between frames of a moving image acquired from an endoscope capsule, by using an estimated depth.
In addition, for example, Faisal Mahmood, Richard Chen, Nicholas J. Durr. “Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training.” In Proceedings of IEEE Transactions on Medical Imaging (Volume: 37, Issue: 12), 2018. describes that a large intestine endoscopic image is transformed into a virtual endoscopic style image by using a generative adversarial network (GAN), and the depth is estimated based on the virtual endoscopic image. In Faisal Mahmood, Richard Chen, Nicholas J. Durr. “Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training.” In Proceedings of IEEE Transactions on Medical Imaging (Volume: 37, Issue: 12), 2018., a constraint is applied so that an image change amount in a pixel unit is reduced before and after the transformation.
In addition, for example, in Mali Shen, Yun Gu, Ning Liu, Guang-Zhong Yang. “Context-Aware Depth and Pose Estimation for Bronchoscopic Navigation.” In Proceedings of IEEE Robotics and Automation Letters (Volume: 4, Issue: 2), 2019. describes that a real bronchoscope image is transformed into a depth image by using a GAN, and a depth is estimated. In Mali Shen, Yun Gu, Ning Liu, Guang-Zhong Yang. “Context-Aware Depth and Pose Estimation for Bronchoscopic Navigation.” In Proceedings of IEEE Robotics and Automation Letters (Volume: 4, Issue: 2), 2019., the transformed depth image is transformed into a bronchoscope image again, and a constraint is applied so that the bronchoscope image is matched with the original real bronchoscope image, thereby storing a bronchus structure in the image. In addition, Mali Shen, Yun Gu, Ning Liu, Guang-Zhong Yang. “Context-Aware Depth and Pose Estimation for Bronchoscopic Navigation.” In Proceedings of IEEE Robotics and Automation Letters (Volume: 4, Issue: 2), 2019. describes that ground truth data of the depth image is automatically extracted from CT.
Meanwhile, the real endoscopic image may include noise that is not included in the virtual endoscopic image. For example, since the endoscope is inserted into the tubular structure, body fluids or the like may adhere to a lens of the endoscope distal end and the lens may be fogged. In addition, for example, an object that cannot be captured by tomography may appear in the real endoscopic image. In addition, for example, fine texture such as gloss and a blood vessel that is generated on an interior wall of the real tubular structure may be omitted in the virtual endoscopic image generated based on the three-dimensional image.
In a case in which noise is included in the real endoscopic image, the viewpoint difference between the real endoscopic image and the virtual endoscopic image cannot be estimated, and a real position of the endoscope cannot be specified. That is, there may be a case in which the route to the target point in the tubular structure cannot be accurately navigated.
The present disclosure provides an information processing apparatus, an information processing method, and an information processing program capable of accurately estimating a viewpoint difference between a real image and a virtual image of an endoscope.
A first aspect of the present disclosure relates to an information processing apparatus comprising: at least one processor, in which the processor acquires an input image including any one of a real image, which is captured by an endoscope inserted into a tubular structure of a subject and which represents an interior wall of the tubular structure, or a virtual image, which is generated based on a three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint predetermined in the three-dimensional image, uses a transformation model, which is trained to transform any one of the input real image or virtual image into an image style of the other of the input real image or virtual image, to transform the input image into a transformed image having an image style that is not included in the input image, acquires an input depth image, which represents a distance for each pixel from a viewpoint of the input image to the interior wall of the tubular structure, and a transformed depth image, which represents a distance for each pixel from a viewpoint of the transformed image to the interior wall of the tubular structure, and trains the transformation model by using a loss function including a degree of similarity between the input depth image and the transformed depth image.
In the first aspect, the transformation model may be obtained by a generative adversarial network that has been trained by using training data including a real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, and a virtual image for training, which is generated based on a three-dimensional image of a subject that is the same as or different from the subject of the real image for training and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image.
In the first aspect, the input image may include the real image, the transformation model may be a learning model trained to receive input of the real image, transform the input real image into a virtual image style, and output the transformed real image, and the processor may transform the real image into a transformed image having the virtual image style by using the transformation model.
In the first aspect, the input image may include the virtual image, the transformation model may be a learning model trained to receive input of the virtual image, transform the input virtual image into a real image style, and output the transformed virtual image, and the processor may transform the virtual image into a transformed image having the real image style by using the transformation model.
In the first aspect, the processor may use a depth image generation model that has been trained in advance to receive input of an image, which represents the interior wall of the tubular structure, and output a depth image, which represents a distance for each pixel from the viewpoint of the input image to the interior wall of the tubular structure, to generate the input depth image based on the input image and the transformed depth image based on the transformed image.
In the first aspect, the depth image generation model may be a model that has been trained in advance through supervised learning using training data including a combination of a virtual image for training, which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image, and a depth image for training, which is generated based on distance information from the virtual viewpoint in the three-dimensional image to the interior wall of the tubular structure and which represents a distance for each pixel from the virtual viewpoint to the interior wall of the tubular structure.
In the first aspect, the depth image generation model may be a model that has been trained in advance through supervised learning using training data including a combination of an image obtained by transforming a virtual image for training, which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image, into a real image style by using a transformation model that has been trained in advance to receive input of the virtual image, transform the input virtual image into the real image style, and output the transformed virtual image, and a depth image for training, which is generated based on distance information from the virtual viewpoint in the three-dimensional image to the interior wall of the tubular structure and which represents a distance for each pixel from the virtual viewpoint to the interior wall of the tubular structure.
In the first aspect, the depth image generation model may be a model that has been trained through supervised learning using training data including a combination of a real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, and a depth image for training, which is generated based on distance information from a viewpoint corresponding to a viewpoint at which the real image for training is captured, in the three-dimensional image of the subject, to the interior wall of the tubular structure and which represents a distance for each pixel from the viewpoint corresponding to the viewpoint at which the real image for training is captured to the interior wall of the tubular structure.
In the first aspect, the depth image generation model may be a model that has been trained in advance through supervised learning using training data including a combination of a real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, and a depth image for training, which is generated based on an actual measurement value of a distance from a viewpoint at which the real image for training is captured to the interior wall of the tubular structure, the actual measurement value being obtained by a distance-measuring sensor mounted on the endoscope, and which represents a distance for each pixel from the viewpoint at which the real image for training is captured to the interior wall of the tubular structure.
A second aspect of the present disclosure relates to an information processing method including: acquiring an input image including any one of a real image, which is captured by an endoscope inserted into a tubular structure of a subject and which represents an interior wall of the tubular structure, or a virtual image, which is generated based on a three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint predetermined in the three-dimensional image; using a transformation model, which is trained to transform any one of the input real image or virtual image into an image style of the other of the input real image or virtual image, to transform the input image into a transformed image having an image style that is not included in the input image; acquiring an input depth image, which represents a distance for each pixel from a viewpoint of the input image to the interior wall of the tubular structure, and a transformed depth image, which represents a distance for each pixel from a viewpoint of the transformed image to the interior wall of the tubular structure; and training the transformation model by using a loss function including a degree of similarity between the input depth image and the transformed depth image.
A third aspect of the present disclosure relates to an information processing program for causing a computer to execute a process including: acquiring an input image including any one of a real image, which is captured by an endoscope inserted into a tubular structure of a subject and which represents an interior wall of the tubular structure, or a virtual image, which is generated based on a three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint predetermined in the three-dimensional image; using a transformation model, which is trained to transform any one of the input real image or virtual image into an image style of the other of the input real image or virtual image, to transform the input image into a transformed image having an image style that is not included in the input image; acquiring an input depth image, which represents a distance for each pixel from a viewpoint of the input image to the interior wall of the tubular structure, and a transformed depth image, which represents a distance for each pixel from a viewpoint of the transformed image to the interior wall of the tubular structure; and training the transformation model by using a loss function including a degree of similarity between the input depth image and the transformed depth image.
According to the aspects described above, the information processing apparatus, the information processing method, and the information processing program according to the present disclosure can accurately estimate the viewpoint difference between the real image and the virtual image of the endoscope.
Hereinafter, description of an embodiment of the present disclosure will be made with reference to the accompanying drawings.
The endoscope apparatus 3 comprises an endoscope 31 that images an inside of a tubular structure of a subject, a processor device 32 that generates an image of the inside of the tubular structure based on a signal obtained by the imaging, and the like. Examples of the tubular structure include a bronchus, a large intestine, and a small intestine.
In the endoscope 31, an insertion part to be inserted into the tubular structure of the subject is attached continuously with an operating part 3A. The endoscope 31 is connected to the processor device 32 via a universal cord that is attachably and detachably connected to the processor device 32. The operating part 3A includes various buttons for giving an instruction for an operation such that a distal end 3B of the insertion part is bent in an up-down direction and a left-right direction within a predetermined angle range, or operating a puncture needle attached to the distal end of the endoscope 31 to collect a tissue sample. The endoscope 31 is, for example, a bronchoscope, a large intestine endoscope, a small intestine endoscope, a laparoscope, and a thoracoscope.
In the present embodiment, the endoscope 31 is a flexible endoscope for the bronchus, and is inserted into the bronchus of the subject. Then, light guided by an optical fiber from a light source device (not shown) provided in the processor device 32 is applied from the distal end 3B of the insertion part of the endoscope 31, and the image of the inside of the bronchus of the subject is acquired by an imaging optical system of the endoscope 31. It should be noted that, in order to facilitate the description, the distal end 3B of the insertion part of the endoscope 31 will be referred to as an endoscope distal end 3B in the following description.
The processor device 32 transforms an imaging signal captured by the endoscope 31 into a digital image signal, and corrects an image quality by digital signal processing, such as white balance adjustment and shading correction, to generate a real image T0. That is, the real image TO is an image that is captured by the endoscope inserted into the tubular structure (bronchus) of the subject and represents the interior wall of the tubular structure (bronchus). The real image T0 is a color moving image displayed at a predetermined sampling rate, such as 30 fps, and one frame of the moving image is the real image T0. The real image T0 is sequentially transmitted to, for example, the image storage server 5 and the information processing apparatus 10.
The three-dimensional image capturing apparatus 4 is an apparatus that images an examination target part of the subject to generate a three-dimensional image representing the part, and is, specifically, a CT apparatus, an MRI apparatus, a positron emission tomography (PET) apparatus, an ultrasound diagnostic apparatus, or the like that captures an image by a method other than a method of inserting an endoscope 31 into a tubular structure to capture the tubular structure. The three-dimensional image generated by the three-dimensional image capturing apparatus 4 is transmitted to and stored in the image storage server 5. In the present embodiment, the three-dimensional image capturing apparatus 4 generates a three-dimensional image V0 obtained by imaging a chest including the bronchus. It should be noted that, in the present embodiment, the three-dimensional image capturing apparatus 4 is the CT apparatus, but the present disclosure is not limited to this.
The image storage server 5 is a computer that stores and manages various data, and comprises a large-capacity external storage device and software for database management. The image storage server 5 communicates with another device via the network 8, and transmits and receives image data and the like to and from the other device. Specifically, the image data, such as the real image T0 acquired by the endoscope apparatus 3, the three-dimensional image V0 generated by the three-dimensional image capturing apparatus 4, and a virtual image K0 generated by the information processing apparatus 10, are acquired via the network, are stored in a recording medium, such as a large-capacity external storage device, and are managed. It should be noted that the real image T0 is a moving image. Therefore, it is preferable that the real image T0 is transmitted to the information processing apparatus 10 without passing through the image storage server 5. It should be noted that a storage format of the image data or the communication between the devices via the network 8 is based on a protocol, such as digital imaging and communication in medicine (DICOM).
Meanwhile, the real image T0 of the endoscope is an image in which the color, texture, and the like of the inside of the tubular structure are clearly represented, but represents the inside of the tubular structure as a two-dimensional image. Therefore, it is difficult to understand which position inside the tubular structure is represented by the real image T0. In particular, since the bronchus has a small diameter and a narrow visual field, it is difficult to reach a target position with the endoscope distal end 3B.
Therefore, the information processing apparatus 10 according to the present embodiment supports enabling to understand which position inside the tubular structure is represented by the real image T0 based on the real image T0 obtained by the endoscope apparatus 3 and the three-dimensional image V0 obtained by the three-dimensional image capturing apparatus 4. Specifically, the information processing apparatus 10 performs viewpoint difference estimation processing of estimating a viewpoint difference ΔL between a virtual viewpoint that is virtually set in the three-dimensional image V0 and a viewpoint of the real image T0. Hereinafter, description of an example of the information processing apparatus 10 according to the present embodiment will be made.
First, description of an example of a hardware configuration of the information processing apparatus 10 will be made with reference to
The storage unit 22 is realized by, for example, a storage medium, such as a hard disk drive (HDD), a solid state drive (SSD), and a flash memory. The storage unit 22 stores an information processing program 27 of the information processing apparatus 10. The CPU 21 reads out the information processing program 27 from the storage unit 22, deploys the read-out program into the memory 23, and executes the deployed information processing program 27. The CPU 21 is an example of a processor according to the present disclosure. The storage unit 22 stores a transformation model M1, a depth image generation model M2, and a viewpoint difference estimation model M3. As the information processing apparatus 10, for example, a personal computer, a server computer, a smartphone, a tablet terminal, a wearable terminal, or the like can be applied as appropriate.
Next, description of an example of a functional configuration of the information processing apparatus 10 will be made with reference to
The acquisition unit 11 acquires the real image T0 captured by the endoscope 31 disposed at a predetermined viewpoint position in the bronchus from the endoscope apparatus 3. It should be noted that the acquisition unit 11 may perform data compression processing on the real image T0 in order to reduce processing amounts in various types of processing to be described later. For example, the acquisition unit 11 may perform binarization based on a brightness value on the real image T0. In the following description, in a case in which the real image is simply referred to as the “real image T0”, the real image T0 on which the data compression processing is performed is included.
In addition, the acquisition unit 11 acquires the three-dimensional image V0 of the subject captured by the three-dimensional image capturing apparatus 4. As described above, the three-dimensional image V0 is obtained, for example, by performing CT imaging on the chest including the bronchus, and consists of a plurality of tomographic images T1 to Tm (m is 2 or more) (see
In addition, the acquisition unit 11 may acquire the virtual image K0 that is generated based on the three-dimensional image V0 and represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint P10 predetermined in the three-dimensional image V0. Hereinafter, specific description of the acquisition method of the virtual image K0 based on the three-dimensional image V0 will be made.
First, description of a method using surface rendering will be made as an example of the acquisition method of the virtual image K0 based on the three-dimensional image V0. As shown by a broken line in
Next, the acquisition unit 11 sets each position at a predetermined interval along the route 40 as the viewpoint. These viewpoints are examples of a “virtual viewpoint predetermined in the three-dimensional image” according to the present disclosure.
Further, the acquisition unit 11 generates a projection image by performing the central projection of projecting the three-dimensional image V0 on a plurality of visual lines that extend radially in an insertion direction of the endoscope distal end 3B (that is, a direction toward a distal end of the bronchus) from each set viewpoint to a predetermined projection surface. This projection image is the virtual image K0 that is virtually generated as though the imaging is performed at the position of the endoscope distal end 3B. The acquisition unit 11 generates the virtual image K0 for each set viewpoint.
It should be noted that the acquisition unit 11 need only generate the virtual image K0 along at least the route 40, and can also generate, of course, the virtual image K0 not along the route 40. For example, the acquisition unit 11 may set each position in a region of the substantially entire bronchus as the viewpoint and generate the virtual image K0 at each position. Moreover, the acquisition unit 11 may select a part of the virtual image K0 generated in the region of substantially the entire bronchus along the route 40.
The method using the surface rendering has been described above. However, instead of method using the surface rendering, for example, a known volume rendering method or the like may be used to generate the virtual image K0. In the method using the volume rendering, the virtual image K0 as viewed from any virtual viewpoint set in the three-dimensional image V0 can be generated based on the pixel value, the CT value, and the like specified from the three-dimensional image V0. In this case, unlike the method using the surface rendering, the bronchus image B0 does not need to be generated, and the virtual image K0 can be directly generated from the three-dimensional image V0 as shown by a solid line in
It should be noted that, in both the methods, an angle of view (range of visual line) of the virtual image K0 and the center (center in the projection direction) of the visual field are set in advance by input or the like from the user. In addition, a plurality of virtual images K0 at the viewpoints generated by the acquisition unit 11 are stored in, for example, the storage unit 22, the image storage server 5, and the like. It should be noted that, in the present embodiment, since the three-dimensional image is a CT image, the virtual image K0 may be a monochrome image generated based on the CT value that forms the CT image. In addition, the virtual image KG may be a monochrome image colored in a pseudo manner. Hereinafter, in
Viewpoint difference estimation processing Description of the viewpoint difference estimation processing according to the present embodiment will be made with reference to
As shown in
This transformation processing is performed to remove noise that is included in the real image T0 and is not included in the virtual image K0. Examples of such noise include noise caused by, for example, body fluids or the like adhering to a lens of the endoscope distal end 3B and fogging the lens, noise that cannot be captured by tomography but appears in the real image TO, and fine texture such as gloss and a blood vessel generated on the interior wall of the tubular structure. The “virtual image style” is an expression form in which these kinds of noise peculiar to the real image are removed, and is a so-called computer graphics (CG) style expression form. In the real-to-virtual transformed image TK0, it is desired that the noise included in the real image T0 is removed and the structure of the tubular structure in the real image T0 is not changed.
The generation unit 13 generates a first depth image D1 representing a distance for each pixel from the viewpoint P11 of the real image T0 to the interior wall of the tubular structure by using the depth image generation model M2. Similarly, the generation unit 13 generates a second depth image D2 representing a distance for each pixel from the virtual viewpoint P10 to the interior wall of the tubular structure by using the depth image generation model M2. The depth image generation model M2 is a machine learning model that has been trained in advance to receive input of an image representing the interior wall of the tubular structure and output a depth image representing a distance for each pixel from the viewpoint of the input image to the interior wall of the tubular structure.
The “depth image” represents a distance from the viewpoint position by the pixel value thereof. For example, in a case in which the pixel value is decreased as the distance from the viewpoint position is increased, the image is darker as the distance from the viewpoint position is increased. By setting a correlation between the pixel value of the depth image and the distance in advance, the distance from the viewpoint position can be calculated based on the pixel value of the depth image. It should be noted that the “distance from the viewpoint position” represented by the pixel value of the depth image is not limited to the distance from the viewpoint position itself, and may be represented by various values corresponding to the distance from the viewpoint position. For example, a viewpoint coordinate system having the viewpoint as an origin is set, the projection surface of the depth image is set in an XY plane, and a direction perpendicular to the projection surface (XY plane) is set as a Z direction. In this case, the “distance from the viewpoint position” may be simply represented by a distance in a Z-axis direction from the viewpoint (origin) (that is, a coordinate in the Z-axis direction). For example, the “distance from the viewpoint position” may be represented as a distance from the projection surface of the depth image.
In addition, the correlation between the pixel value of the depth image and the distance is not limited to the proportional relationship that the pixel value is decreased as the distance from the viewpoint position is increased, and may be determined by, for example, an inverse proportion or logarithmic proportion relationship. In addition, for example, although the pixel value is generally represented by 256 gradations of 8 bits in integers of 0 to 255 in many cases, the present disclosure is not limited to this, and the pixel value of the depth image may be represented by any value such as a negative number or a decimal number. The same applies to each of the depth images for training described later.
Specifically, the generation unit 13 inputs the real-to-virtual transformed image TK0 transformed from the real image T0 by the transformation unit 12 to the depth image generation model M2, to obtain the first depth image D1 at the viewpoint P11 of the real image T0. That is, it can be said that the first depth image D1 according to the present embodiment is an image generated based on the pixel value of the real-to-virtual transformed image TK0 obtained by transforming the real image T0 into the virtual image style.
In addition, the generation unit 13 obtains the second depth image D2 at the virtual viewpoint P10 by inputting the virtual image K0, which is generated from the three-dimensional image V0 acquired by the acquisition unit 11, at the virtual viewpoint P10, to the depth image generation model M2. That is, it can be said that the second depth image D2 according to the present embodiment is an image generated based on the pixel value of the virtual image K0.
The estimation unit 14 uses at least one of the real image T0 at the viewpoint P11 or the first depth image D1 and at least one of the virtual image K0 at the virtual viewpoint P10 or the second depth image D2, to estimate the viewpoint difference ΔL between the viewpoint P11 of the real image T0 and the virtual viewpoint P10. It should be noted that, in the estimation of the viewpoint difference ΔL, at least one of the first depth image D1 or the second depth image D2 is used.
As described above, a distance for each pixel from the viewpoint position to the interior wall of the tubular structure can be calculated based on the pixel value of the depth image. Therefore, since at least one of the first depth image D1 or the second depth image D2 is used, the viewpoint difference ΔL in the tubular structure having a three-dimensional structure can be estimated with higher accuracy than in a case in which the viewpoint difference ΔL is estimated by using only the real image T0, which is a two-dimensional image, and/or the virtual image K0.
Specifically, the estimation unit 14 estimates the viewpoint difference ΔL between the viewpoint P11 of the real image T0 and the viewpoint P10 of the virtual viewpoint by using the viewpoint difference estimation model M3. The viewpoint difference estimation model M3 is, for example, a machine learning model that has been trained in advance to at least receive input of at least one of the first depth image or the second depth image and output the viewpoint difference by using the input of at least one of the first depth image or the second depth image. In the example in
In the example in
In addition, the estimation unit 14 may correct the position of the virtual viewpoint P10 so that the estimated viewpoint difference ΔL is reduced. For example, as shown in
It should be noted that the viewpoint difference ΔL may include a posture difference in addition to the misregistration amount. For example, the viewpoint difference ΔL may be represented by a displacement vector, a combination of the angle and the distance, an Euler angle, a rotation vector, or the like. Further, for example, the viewpoint difference ΔL may be represented by a relative posture between the virtual viewpoint P10 and the viewpoint P11 of the real image T0.
As described above, the viewpoint difference estimation processing is performed. It should be noted that, as described above, the real image T0 constitutes one frame of the moving image. Therefore, the viewpoint difference estimation processing is repeatedly performed for each of the real images TO sequentially acquired in the moving image.
Next, description of a training method of the transformation model M1 used in the viewpoint difference estimation processing will be made using a plurality of methods. The training unit 15 trains the transformation model M1 by using at least one of the following methods.
The GAN of
The transformation model M1 corresponds to the generator in the GAN and is a machine learning model, such as a neural network, which transforms the input real image for training into an expression form close to the virtual image for training, which is the ground truth data, and outputs the transformed input image. A discriminator M1D is a machine learning model, such as a neural network, which discriminates whether the input image is the virtual image for training that is the ground truth (real) data or the real-to-virtual transformed image for training that is the fake data.
The training unit 15 generates a real-to-virtual transformed image for training LTK0 having an expression form close to the virtual image for training LK0, which is the ground truth data, by inputting the real image for training LT0 to the transformation model M1 (generator). In addition, the training unit 15 inputs any one of the real-to-virtual transformed image for training LTK0 generated by the transformation model M1 or the virtual image for training LK0, which is the ground truth data, to the discriminator M1D, to obtain the discrimination result. Then, the training unit 15 feeds back the discrimination result by the discriminator M1D and information indicating whether or not the discrimination result is correct to the transformation model M1 (generator). In this way, the transformation model M1 and the discriminator M1D are trained in a mutually interactive manner.
Although the first training method using the GAN can also perform training using the non-pair data, a constraint in the GAN is only that an expression form is the virtual image style. Therefore, even in a case in which the expression form is appropriately transformed before and after the transformation by the transformation model M1, there is a possibility that a problem occurs in which the bronchus structure in the image is inappropriately transformed.
In the second training method using the CycleGAN, a constraint is applied so that the data obtained by reverse transformation is returned to the original state. Therefore, it is possible to train the transformation model M1 while searching for pseudo pair data in which the bronchus structures are also similar in addition to the expression form. As a result, it is possible to generate the transformation model M1 in which the bronchus structure in the image is less likely to change, and it is possible to improve the accuracy of the transformation.
The CycleGAN in
The training unit 15 by inputting the real-to-virtual transformed image for training LTK0 output from the transformation model M1 to the reverse transformation model M1R. generates a real-to-virtual-to-real transformed image for training LTKT0 having a structure and an expression form which are close to the real image for training LT0 which is the original input data. In addition, the training unit 15 trains the transformation model M1 by using a loss function Loss1 including a degree of similarity between the real image for training LT0, which is the original input data, and the real-to-virtual-to-real transformed image for training LTKT0. The degree of similarity between the real image for training LT0 and the real-to-virtual-to-real transformed image for training LTKT0 is represented by, for example, a squared error for each pixel of each image, a squared error after normalizing the pixel value of each image, and a degree of similarity of cosines. In addition, for example, the degree of similarity is not limited to the degree of similarity between the pixel values themselves, and may be represented by a degree of similarity between the reciprocals of the pixel values and the logarithms of the pixel values.
Specifically, first, the training unit 15 acquires an input image including any one of the real image for training or the virtual image for training. The “input image” here means an image that is input to the transformation model in the subsequent stage. In addition, the training unit 15 transforms the input image into a transformed image having an image style that is not included in the input image by using a transformation model trained to transform any one of the input real image for training or virtual image for training into an image style of the other of the input real image for training or virtual image for training.
In the example in
In addition, the training unit 15 acquires an input depth image Dt representing a distance for each pixel from the viewpoint of the input image (real image for training LT0) to the interior wall of the tubular structure. Specifically, the training unit 15 generates the input depth image Dt by inputting the input image (real image for training LT0) to the depth image generation model M2.
Similarly, the training unit 15 acquires a transformed depth image Dtk representing a distance for each pixel from the viewpoint of the transformed image (real-to-virtual transformed image for training LTK0) to the interior wall of the tubular structure. Specifically, the training unit 15 generates the transformed depth image Dtk by inputting the transformed image (real-to-virtual transformed image for training LTK0) to the depth image generation model M2.
The training unit 15 trains the transformation model M1 by using the loss function Loss2 including a degree of similarity between the input depth image Dt and the transformed depth image Dtk. That is, the training unit 15 trains the transformation model M1 so that the transformed depth image Dtk approximates the input depth image Dt. The degree of similarity between the input depth image Dt and the transformed depth image Dtk is represented by, for example, a squared error for each pixel of each image, a squared error after normalizing the depth (pixel value) of each image, and a degree of similarity of cosines. In addition, for example, the degree of similarity is not limited to the degree of similarity between the pixel values themselves, and may be represented by a degree of similarity between the reciprocals of the pixel values and the logarithms of the pixel values.
As described above, with the training method of the transformation model M1 using the loss function Loss2 including the degree of similarity between the depth images, the transformation model M1 can be trained while searching for the pseudo pair data by applying a constraint so that the depth images before and after the transformation approximate each other. Therefore, it is possible to generate the transformation model M1 in which the bronchus structure in the image is less likely to change, and it is possible to improve the accuracy of the transformation.
It should be noted that, from a viewpoint of ease of preparation of the training data, it is desired that the depth image generation model M2 is trained based on the virtual image generated based on the three-dimensional image (details will be described later). The inventors have found that a good depth image can be obtained even in a case in which the real image is input to the depth image generation model M2 that has been trained based on the virtual image. Therefore, in the present method, the input depth image Dt of the real image for training LT0 is obtained by using, in another use, the depth image generation model M2 that has been trained based on the virtual image. With such a form, it is possible to save the time and effort of creating the depth image generation model that has been trained for the real image.
In the first to third training methods, the method of the unsupervised learning using the non-pair data as the training data has been described. On the other hand, supervised learning can also be applied by specifying the real position of the endoscope distal end 3B in the tubular structure by the electromagnetic sensor or the like and using the pair data of the real image and the virtual image having the same viewpoint. In this case, the training unit 15 performs supervised learning on the transformation model M1 by using the training data including a combination of the real image for training and the virtual image for training that are specified to have the same viewpoint.
It should be noted that, in general, it is difficult to prepare a large amount of such pair data. Therefore, the training unit 15 may train the transformation model M1 by combining the first to fourth training methods as appropriate.
So far, the form has been described in which the real image is transformed into the real-to-virtual transformed image having the virtual image style, but the technique of the present disclosure can also be applied to a form in which the virtual image is transformed into a virtual-to-real transformed image having the real image style. That is, it is also possible to apply the first to fourth training methods related to the training of the transformation model M1 in another use, to generate the reverse transformation model M1R trained to receive input of the virtual image, transform the input virtual image into the real image style, and output the transformed virtual image.
The reverse transformation model M1R can be generated by replacing the real image for training LT0 and the virtual image for training LK0 in the first to fourth training methods related to the training of the transformation model M1. That is, the input image to the reverse transformation model M1R in a training phase includes the virtual image for training LK0.
Next, description of a training method of the depth image generation model M2 used in the viewpoint difference estimation processing will be made using a plurality of methods. The training unit 15 trains the depth image generation model M2 by using at least one of the following methods. Hereinafter, since the depth image generation models M2 obtained by the training methods receive input of different types of data, the depth image generation models M2 will be separately referred to as depth image generation models M2A to M2C.
Here, the virtual image for training LK0 is an image, which is generated based on the three-dimensional image V0 of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint P4 predetermined in the three-dimensional image V0. The depth image for training LD0 is an image, which is generated based on distance information from the virtual viewpoint P4 in the three-dimensional image V0 to the interior wall of the tubular structure and which represents a distance for each pixel from the virtual viewpoint P4 to the interior wall of the tubular structure, and is the ground truth data. The virtual viewpoint P4 is an example of a fourth viewpoint according to the present disclosure.
The distance information indicates, for example, a distance from the virtual viewpoint P4 in the three-dimensional image V0 to a point at which an opacity is equal to or greater than a predetermined value. In a case in which the opacity is equal to or greater than the predetermined value in the three-dimensional image V0, it is considered that the portion corresponds to the interior wall of the tubular structure. In addition, for example, the distance information may indicate a distance from the virtual viewpoint P4 to a surface of the bronchus image B0 in a case in which the bronchus image B0 is generated based on the three-dimensional image V0 and then the virtual image for training LK0 is generated (in a case in which the virtual image for training LK0 is generated by the surface rendering).
The depth image generation model M2A is a machine learning model, such as a neural network, which transforms the input virtual image for training into the depth image and outputs the depth image. The training unit 15 generates a depth image Dk0 having an expression form close to the depth image for training LD0, which is the ground truth data, by inputting the virtual image for training LK0 to the depth image generation model M2A.
In addition, the training unit 15 trains the depth image generation model M2A by using a loss function Loss3 including a degree of similarity between the depth image Dk0 generated by the depth image generation model M2A and the depth image for training LD0 which is the ground truth data. The degree of similarity between the depth image Dk0 and the depth image for training LD0 is represented by, for example, a squared error for each pixel of each image, a squared error after normalizing the pixel value of each image, and a degree of similarity of cosines. In addition, for example, the degree of similarity is not limited to the degree of similarity between the pixel values themselves, and may be represented by a degree of similarity between the reciprocals of the pixel values and the logarithms of the pixel values.
As described above, the depth image generation model M2A obtained by the first training method can prepare the pair data for training (virtual image for training LK0 and depth image for training LD0) based on the virtual image based on the three-dimensional image V0. Therefore, it is easy to prepare the data for training as compared with other methods.
Here, the virtual image for training LK0 is an image, which is generated based on the three-dimensional image V0 of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint P5 predetermined in the three-dimensional image V0. As described above, the reverse transformation model M1R is a machine learning model that has been trained in advance to receive input of the virtual image, transform the input virtual image into the real image style, and output the transformed virtual image. The depth image for training LD0 is an image, which is generated based on distance information from the virtual viewpoint P5 in the three-dimensional image V0 to the interior wall of the tubular structure and which represents a distance for each pixel from the virtual viewpoint P5 to the interior wall of the tubular structure, and is the ground truth data. The virtual viewpoint P5 is an example of a fifth viewpoint according to the present disclosure. Since the distance information is the same as the distance information in the first training method, the description thereof will be omitted.
The depth image generation model M2B is a machine learning model, such as a neural network, which transforms the input virtual-to-real transformed image for training into the depth image and outputs the depth image. The training unit 15 generates a depth image Dkt0 having an expression form close to the depth image for training LD0, which is the ground truth data, by inputting the virtual-to-real transformed image for training LKT0 to the depth image generation model M2B.
In addition, the training unit 15 trains the depth image generation model M2B by using a loss function Loss4 including a degree of similarity between the depth image Dkt0 generated by the depth image generation model M2B and the depth image for training LD0 which is the ground truth data. The degree of similarity between the depth image Dkt0 and the depth image for training LD0 is represented by, for example, a squared error for each pixel of each image, a squared error after normalizing the pixel value of each image, and a degree of similarity of cosines. In addition, for example, the degree of similarity is not limited to the degree of similarity between the pixel values themselves, and may be represented by a degree of similarity between the reciprocals of the pixel values and the logarithms of the pixel values.
The depth image generation model M2B obtained as described above causes the time and effort of transforming the virtual image for training LK0 into the virtual-to-real transformed image for training LKT0 as compared with the first training method. On the other hand, in the operational phase, the accuracy can be maintained even in a case in which the real image T0 is used as the input to the depth image generation model M2B.
Here, the depth image for training LD0 is generated based on the distance information from the viewpoint corresponding to the viewpoint at which the real image for training LT0 is captured, in the three-dimensional image V0 of the subject, to the interior wall of the tubular structure, and is used as the ground truth data. The viewpoint corresponding to the viewpoint at which the real image for training LT0 is captured, in the three-dimensional image V0, that is, the real position of the endoscope distal end 3B in the inside of the tubular structure can be specified by, for example, the electromagnetic sensor or the like provided in the endoscope distal end 3B that captures the real image for training LT0. Since the distance information is the same as the distance information in the first training method, the description thereof will be omitted.
The depth image generation model M2C is a machine learning model, such as a neural network, which transforms the input REAL image for training into the depth image and outputs the depth image. The training unit 15 generates a depth image Dt0 having an expression form close to the depth image for training LD0, which is the ground truth data, by inputting the real image for training LT0 to the depth image generation model M2C.
In addition, the training unit 15 trains the depth image generation model M2C by using a loss function Loss5 including a degree of similarity between the depth image Dt0 generated by the depth image generation model M2C and the depth image for training LD0 which is the ground truth data. The degree of similarity between the depth image Dt0 and the depth image for training LD0 is represented by, for example, a squared error for each pixel of each image, a squared error after normalizing the pixel value of each image, and a degree of similarity of cosines. In addition, for example, the degree of similarity is not limited to the degree of similarity between the pixel values themselves, and may be represented by a degree of similarity between the reciprocals of the pixel values and the logarithms of the pixel values.
Since the depth image generation model M2C obtained as described above needs to specify the viewpoint corresponding to the viewpoint at which the real image for training LT0 is captured, in the three-dimensional image V0, that is, the real position of the endoscope distal end 3B in the inside of the tubular structure, as the training data, it is difficult to prepare the training data. On the other hand, in the operational phase, higher accuracy can be maintained even in a case in which the real image T0 is used as the input to the depth image generation model M2c.
In the first to third training methods, it has been described that the depth image for training LD0 is generated based on the distance information derived based on the three-dimensional image V0, but the present disclosure is not limited to this. The depth image for training LD0 may be generated, for example, based on an actual measurement value of the distance from the viewpoint at which the real image for training LT0 is captured to the interior wall of the tubular structure, the actual measurement value being obtained by a distance-measuring sensor mounted on the endoscope distal end 3B or the like. As the distance-measuring sensor, for example, various depth sensors, such as a time of flight (ToF) camera, can be used.
The depth image for training LD0 based on the actual measurement value obtained by the distance-measuring sensor is more accurate although it is difficult to prepare the data. Therefore, the accuracy of each of the depth image generation models M2A to M2C can be improved.
Next, description of a training method of the viewpoint difference estimation model M3 used in the viewpoint difference estimation processing of
Therefore, the viewpoint difference estimation model M3 according to the present embodiment is trained through supervised learning using training data including a combination of at least one of a first virtual image for training as viewed from the virtual viewpoint P1 predetermined in the three-dimensional image V0 or a first depth image for training representing a distance for each pixel from the virtual viewpoint P1 to the interior wall of the tubular structure, at least one of a second virtual image for training as viewed from the virtual viewpoint P2 that is predetermined in the three-dimensional image V0 and is different from the virtual viewpoint P1 or a second depth image for training representing a distance for each pixel from the virtual viewpoint P2 to the interior wall of the tubular structure, and a viewpoint difference ΔL0 between the virtual viewpoint P1 and the virtual viewpoint P2. Here, each of the first virtual image for training and the second virtual image for training is generated based on the three-dimensional image V0 of the subject. The virtual viewpoint P1 is an example of a first viewpoint according to the present disclosure. The virtual viewpoint P2 is an example of a second viewpoint according to the present disclosure.
For example, the viewpoint difference estimation model M3 in the example of
Specifically, the training unit 15 first generates a depth image DP by inputting the virtual image for training LKP generated based on the three-dimensional image V0 of the subject to the depth image generation model M2A. That is, the depth image DP is an image representing the distance for each pixel from the virtual viewpoint P1 to the interior wall of the tubular structure. Similarly, the training unit 15 generates a depth image DQ by inputting the virtual image for training LKQ generated based on the three-dimensional image V0 of the subject to the depth image generation model M2A. That is, the depth image DQ is an image representing the distance for each pixel from the virtual viewpoint P2 to the interior wall of the tubular structure.
Thereafter, the training unit 15 inputs the virtual images for training LKP and LKQ and the depth images DP and DQ to the viewpoint difference estimation model M3, to obtain the estimated viewpoint difference ΔL. In addition, the training unit 15 trains the viewpoint difference estimation model M3 by using a loss function Loss6 including a degree of similarity between the estimated viewpoint difference ΔL estimated by the viewpoint difference estimation model M3 and the viewpoint difference ΔL0 between the virtual viewpoint P1 and the virtual viewpoint P2, which are the ground truth data.
Although the example of the viewpoint difference estimation processing has been described above, the technique of the present disclosure is not limited to this, and various Examples shown below are also included. All of the following examples have a common point in that the viewpoint difference estimation processing of estimating the viewpoint difference ΔL between the viewpoint P11 of the real image T0 and the virtual viewpoint P10 is performed. On the other hand, a combination and the contents of the transformation model M1, the reverse transformation model M1R, the depth image generation models M2A to M2C, and the viewpoint difference estimation model M3 used in the viewpoint difference estimation processing are different from those in the viewpoint difference estimation processing. Hereinafter, description of various Examples will be made with reference to
First, description of Examples 1-1 to 1-4 will be made. In Examples, training is performed in the training phase by using two virtual images for training LKP and LKQ, and the viewpoint difference ΔL is estimated in the operational phase based on two depth images D1 and D2. In this case, there is an advantage that the ground truth data is accurate and easy to prepare. In Examples 1-1 to 1-4, the virtual image for training LKP is an example of a first virtual image for training according to the present disclosure, and the virtual image for training LKQ is an example of a second virtual image for training according to the present disclosure. In addition, the depth image DP is an example of a first depth image for training according to the present disclosure, and the depth image DQ is an example of a second depth image for training according to the present disclosure.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model M1. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the second depth image D2 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the second depth image D2 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model M1. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the second depth image D2, the real-to-virtual transformed image TK0, and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model ML. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the second depth image D2, the real image T0, and the virtual image K0 to the viewpoint difference estimation model M3.
Next, description of Examples 2-1 to 2-5 will be made. In Examples, training is performed in the training phase by using two virtual images for training LKP and LKQ, and the viewpoint difference ΔL is estimated in the operational phase based on one depth image D1 or D2. In this case, there is an advantage that the ground truth data is accurate and easy to prepare. In Examples 2-1 to 2-5, the virtual image for training LKP is an example of a first virtual image for training according to the present disclosure, and the virtual image for training LKQ is an example of a second virtual image for training according to the present disclosure. In addition, the depth image DP is an example of a first depth image for training according to the present disclosure, and the depth image DQ is an example of a second depth image for training according to the present disclosure.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model M1. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model ML. The generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the real-to-virtual transformed image TK0 and the second depth image D2 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. That is, the first depth image D1 may be generated based on the pixel value of the real image T0. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model ML. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the real-to-virtual transformed image TK0, and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model M1. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the real image T0, and the virtual image K0 to the viewpoint difference estimation model M3.
Next, description of Example 3-1 to 3-3 will be made. In Examples, training is performed in the training phase by using one virtual image for training LK0 and one real image for training LT0, and the viewpoint difference ΔL is estimated in the operational phase based on two depth images D1 and D2. Since training is performed by using the same real image for training LT0 as the input (real image T0) in the operational phase, it is difficult to prepare the training data, but the accuracy of the viewpoint difference estimation is improved. In Examples 3-1 to 3-3, the depth image Dt0 is an example of a depth image for training according to the present disclosure, and the depth image Dk0 is an example of a virtual depth image for training according to the present disclosure.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the second depth image D2 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the second depth image D2, the real image T0, and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model M1. In addition, the generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the second depth image D2, the real image T0, and the virtual image K0 to the viewpoint difference estimation model M3.
First, description of Examples 4-1 to 4-4 will be made. In Examples, training is performed in the training phase by using one virtual image for training LK0 and one real image for training LT0, and the viewpoint difference ΔL is estimated in the operational phase based on one depth image D1 or D2. Since training is performed by using the same real image for training LT0 as the input (real image T0) in the operational phase, it is difficult to prepare the training data, but the accuracy of the viewpoint difference estimation is improved. In Examples 4-1 to 4-4, the depth image Dt0 is an example of a depth image for training according to the present disclosure, and the depth image Dk0 is an example of a virtual depth image for training according to the present disclosure.
In the operational phase in the present example, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the second depth image D2 and the real image T0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the real image T0 into the real-to-virtual transformed image TK0 by inputting the real image T0 to the transformation model M1. The generation unit 13 generates the first depth image D1 by inputting the real-to-virtual transformed image TK0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the virtual image K0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the second depth image D2, the real image T0, and the virtual image K0 to the viewpoint difference estimation model M3.
Next, description of Example 5-1 to 5-3 will be made. In Examples, training is performed in the training phase by using two real images for training LTP and LTQ, and the viewpoint difference ΔL is estimated in the operational phase based on two depth images D1 and D2.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. The transformation unit 12 transforms the virtual image K0 into the virtual-to-real transformed image KT0 by inputting the virtual image K0 to the reverse transformation model M1R. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual-to-real transformed image KT0 to the depth image generation model M2B or M2C. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the second depth image D2 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual image K0 to the depth image generation model M2A. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1 and the second depth image D2 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the generation unit 13 generates the first depth image D1 by inputting the real image T0 to the depth image generation model M2B or M2C. The transformation unit 12 transforms the virtual image K0 into the virtual-to-real transformed image KT0 by inputting the virtual image K0 to the reverse transformation model M1R. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual-to-real transformed image KT0 to the depth image generation model M2B or M2C. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the first depth image D1, the second depth image D2, the real image T0, and the virtual-to-real transformed image KT0 to the viewpoint difference estimation model M3.
Next, description of Examples 6-1 to 6-2 will be made. In Examples, training is performed in the training phase by using two real images for training LTP and LTQ, and the viewpoint difference ΔL is estimated in the operational phase based on one depth image.
In the operational phase in the present example, the transformation unit 12 transforms the virtual image K0 into the virtual-to-real transformed image KT0 by inputting the virtual image K0 to the reverse transformation model M1R. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual-to-real transformed image KT0 to the depth image generation model M2B or M2C. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the second depth image D2 and the real image T0 to the viewpoint difference estimation model M3.
In the operational phase in the present example, the transformation unit 12 transforms the virtual image K0 into the virtual-to-real transformed image KT0 by inputting the virtual image K0 to the reverse transformation model M1R. In addition, the generation unit 13 generates the second depth image D2 by inputting the virtual-to-real transformed image KT0 to the depth image generation model M2B or M2C. The estimation unit 14 estimates the viewpoint difference ΔL by inputting the second depth image D2, the real image T0, and the virtual-to-real transformed image KT0 to the viewpoint difference estimation model M3.
As described above with reference to Examples, the viewpoint difference estimation model M3 according to the present embodiment may be a model that has been trained through supervised learning using training data including a combination of at least one of the real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, or the depth image for training, which represents the distance for each pixel from the viewpoint of the real image for training to the interior wall of the tubular structure, at least one of the virtual image for training, which is generated based on the three-dimensional image V0 of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint P3 predetermined in the three-dimensional image, or the virtual depth image for training, which represents the distance for each pixel from the virtual viewpoint P3 to the interior wall of the tubular structure, and the viewpoint difference ΔL0 between the viewpoint of the real image for training and the virtual viewpoint P3. The virtual viewpoint P3 is an example of a third viewpoint according to the present disclosure.
It should be noted that, in the embodiment described above, the form has been described in which the second depth image D2 based on the virtual image K0 used to estimate the viewpoint difference ΔL is generated by using the depth image generation model M2 based on the pixel value of the virtual image K0, but the present disclosure is not limited to this. The second depth image based on the virtual image K0 may be generated based on the distance information from the virtual viewpoint P10 in the three-dimensional image V0 to the interior wall of the tubular structure. With such a form, it is possible to obtain a more accurate second depth image D2 including accurate scale information. In this case, the generation of the virtual image K0 by the acquisition unit 11 may be omitted.
Here, the distance information indicates, for example, a distance from the virtual viewpoint P10 in the three-dimensional image V0 to a point at which the opacity is equal to or greater than the predetermined value. In a case in which the opacity is equal to or greater than the predetermined value in the three-dimensional image V0, it is considered that the portion corresponds to the interior wall of the tubular structure. In addition, for example, in a case in which the bronchus image B0 is generated based on the three-dimensional image V0, the distance information may indicate a distance from the virtual viewpoint P10 to a surface of the bronchus image B0.
Next, description of actions of the information processing apparatus 10 according to the present embodiment will be made with reference to
In step S10, the acquisition unit 11 acquires the real image T0 captured by the endoscope 31 disposed at the predetermined viewpoint position in the bronchus. In Step S12, the acquisition unit 11 acquires the virtual image K0 which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint P10 predetermined in the three-dimensional image.
In step S14, the transformation unit 12, the generation unit 13, and the estimation unit 14 perform the viewpoint difference estimation processing of estimating the viewpoint difference ΔL between the viewpoint P11 of the real image T0 acquired in step S10 and the virtual viewpoint P10. In step S16, the controller 16 performs control of displaying the estimated position Pt of the endoscope distal end 3B estimated based on the viewpoint difference ΔL estimated in step S14 on the display 24, and terminates the present information processing.
Next, description of the training processing of the transformation model M1 by the training unit 15 according to the present embodiment will be made with reference to
In step S30, the training unit 15 acquires the input image including any one of the real image T0, which is captured by the endoscope and which represents the interior wall of the tubular structure, or a virtual image K0, which is generated based on the three-dimensional image V0 of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint P10 predetermined in the three-dimensional image V0. In step S32, the training unit 15 transforms the input image acquired in step S30 into the transformed image that is an image style that is not included in the input image.
In step S34, the training unit 15 acquires the depth image (input depth image) of the input image acquired in step S30 and the depth image (transformed depth image) of the transformed image transformed in step S32, by using the depth image generation model M2. In step S36, the training unit 15 trains the transformation model M1 by using the loss function including the degree of similarity between the input depth image and the transformed depth image acquired in step S34. In a case in which step S36 is completed, the transformation model training processing is terminated.
As described above, one aspect of the present disclosure relates to the information processing apparatus 10 comprising: at least one processor, in which the processor uses at least one of the real image, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, or the first depth image, which represents the distance for each pixel from the viewpoint of the real image to the interior wall of the tubular structure, and at least one of the virtual image, which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image, or the second depth image, which represents the distance for each pixel from the virtual viewpoint to the interior wall of the tubular structure, to perform the viewpoint difference estimation processing of estimating the viewpoint difference between the viewpoint of the real image and the virtual viewpoint, and uses at least one of the first depth image or the second depth image in the viewpoint difference estimation processing.
That is, with the information processing apparatus 10 according to one aspect of the present disclosure, the viewpoint difference ΔL between the viewpoint P11 of the real image T0 and the virtual viewpoint P10 is estimated by using at least any of the first depth image D1 or the second depth image D2. Therefore, the viewpoint difference ΔL between the viewpoint P11 of the real image T0 of the endoscope and the virtual viewpoint P10 that is virtually set can be accurately estimated.
In addition, one aspect of the present disclosure relates to the information processing apparatus 10 comprising: at least one processor, in which the processor acquires the input image including any one of the real image, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, or the virtual image, which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image, uses the transformation model, which is trained to transform any one of the input real image or virtual image into the image style of the other of the input real image or virtual image, to transform the input image into the transformed image having the image style that is not included in the input image, acquires the input depth image, which represents the distance for each pixel from the viewpoint of the input image to the interior wall of the tubular structure, and the transformed depth image, which represents the distance for each pixel from the viewpoint of the transformed image to the interior wall of the tubular structure, and trains the transformation model by using the loss function including the degree of similarity between the input depth image and the transformed depth image.
That is, with the information processing apparatus 10 according to one aspect of the present disclosure, for example, it is possible to apply, to the transformation model M1 that transforms the real image T0 into the transformed image having the virtual image style, a constraint so that the depth image does not change before and after the transformation. Therefore, it is possible to prevent the bronchus structure and the like included in the image from changing before and after transformation and to obtain a highly accurate transformed image. As a result, since the highly accurate transformed image can be input to the machine learning model used in the endoscope navigation, such as the depth image generation model M2 and the viewpoint difference estimation model M3, which are trained by using the transformed image, it is possible to contribute to the effect of accurately estimating the viewpoint difference ΔL between the viewpoint P11 of the real image T0 of the endoscope and the virtual viewpoint P10 that is virtually set.
In addition, in each embodiment, for example, as hardware structures of processing units that execute various types of processing, such as the acquisition unit 11, the transformation unit 12, the generation unit 13, the estimation unit 14, the training unit 15, and the controller 16, various processors shown below can be used. As described above, in addition to the CPU that is a general-purpose processor that executes software (program) to function as various processing units, the various processors include a programmable logic device (PLD) that is a processor of which a circuit configuration can be changed after manufacture, such as a field programmable gate array (FPGA), and a dedicated electric circuit that is a processor having a circuit configuration that is designed for exclusive use in order to execute specific processing, such as an application specific integrated circuit (ASIC).
One processing unit may be configured by using one of the various processors or may be configured by using a combination of two or more processors of the same type or different types (for example, a combination of a plurality of FPGAs or a combination of a CPU and an FPGA). In addition, a plurality of the processing units may be configured by one processor.
A first example of the configuration in which the plurality of processing units are configured by using one processor is a form in which one processor is configured by using a combination of one or more CPUs and the software and this processor functions as the plurality of processing units, as represented by computers, such as a client and a server. Second, as represented by a system on chip (SoC) or the like, there is a form in which the processor is used in which the functions of the entire system which includes the plurality of processing units are realized by a single integrated circuit (IC) chip. In this way, as the hardware structure, the various processing units are configured by using one or more of the various processors described above.
Further, the hardware structure of these various processors is, more specifically, an electric circuit (circuitry) in which circuit elements such as semiconductor elements are combined.
In addition, in the embodiment described above, an aspect has been described in which the information processing program 27 of the information processing apparatus 10 is stored in the storage unit 22 in advance, but the present disclosure is not limited to this. The information processing program 27 may be provided in a form of being recorded in a recording medium, such as a compact disc read only memory (CD-ROM), a digital versatile disc read only memory (DVD-ROM), and a universal serial bus (USB) memory. In addition, a form may be adopted in which the information processing program 27 is downloaded from an external apparatus via the network. Further, the technique of the present disclosure extends to a storage medium that non-transitorily stores a program in addition to the program.
In the technique of the present disclosure, the embodiments and the examples described above can be combined as appropriate. The above-described contents and the above-shown contents are detailed description for parts according to the technique of the present disclosure, and are merely examples of the technique of the present disclosure. For example, the above description related to the configuration, the function, the action, and the effect is the description related to the examples of the configuration, the function, the action, and the effect of the parts according to the technique of the present disclosure. As a result, it is needless to say that unnecessary parts may be deleted, new elements may be added, or replacements may be made with respect to the above-described contents and the above-shown contents within a range that does not deviate from the gist of the technique of the present disclosure.
Regarding the embodiment described above, the following supplementary notes are further disclosed.
An information processing apparatus comprising: at least one processor, in which the processor acquires an input image including any one of a real image, which is captured by an endoscope inserted into a tubular structure of a subject and which represents an interior wall of the tubular structure, or a virtual image, which is generated based on a three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint predetermined in the three-dimensional image, uses a transformation model, which is trained to transform any one of the input real image or virtual image into an image style of the other of the input real image or virtual image, to transform the input image into a transformed image having an image style that is not included in the input image, acquires an input depth image, which represents a distance for each pixel from a viewpoint of the input image to the interior wall of the tubular structure, and a transformed depth image, which represents a distance for each pixel from a viewpoint of the transformed image to the interior wall of the tubular structure, and trains the transformation model by using a loss function including a degree of similarity between the input depth image and the transformed depth image.
The information processing apparatus according to supplementary note 1, in which the transformation model is obtained by a generative adversarial network that has been trained by using training data including a real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, and a virtual image for training, which is generated based on a three-dimensional image of a subject that is the same as or different from the subject of the real image for training and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image.
The information processing apparatus according to supplementary note 1 or 2, in which the input image includes the real image, the transformation model is a learning model trained to receive input of the real image, transform the input real image into a virtual image style, and output the transformed real image, and the processor transforms the real image into a transformed image having the virtual image style by using the transformation model.
The information processing apparatus according to supplementary note 1 or 2, in which the input image includes the virtual image, the transformation model is a learning model trained to receive input of the virtual image, transform the input virtual image into a real image style, and output the transformed virtual image, and the processor transforms the virtual image into a transformed image having the real image style by using the transformation model.
The information processing apparatus according to any one of supplementary notes 1 to 4, in which the processor uses a depth image generation model that has been trained in advance to receive input of an image, which represents the interior wall of the tubular structure, and output a depth image, which represents a distance for each pixel from the viewpoint of the input image to the interior wall of the tubular structure, to generate the input depth image based on the input image and the transformed depth image based on the transformed image.
The information processing apparatus according to supplementary note 5, in which the depth image generation model is a model that has been trained in advance through supervised learning using training data including a combination of a virtual image for training, which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image, and a depth image for training, which is generated based on distance information from the virtual viewpoint in the three-dimensional image to the interior wall of the tubular structure and which represents a distance for each pixel from the virtual viewpoint to the interior wall of the tubular structure.
The information processing apparatus according to supplementary note 5, in which the depth image generation model is a model that has been trained in advance through supervised learning using training data including a combination of an image obtained by transforming a virtual image for training, which is generated based on the three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from the virtual viewpoint predetermined in the three-dimensional image, into a real image style by using a transformation model that has been trained in advance to receive input of the virtual image, transform the input virtual image into the real image style, and output the transformed virtual image, and a depth image for training, which is generated based on distance information from the virtual viewpoint in the three-dimensional image to the interior wall of the tubular structure and which represents a distance for each pixel from the virtual viewpoint to the interior wall of the tubular structure.
The information processing apparatus according to supplementary note 5, in which the depth image generation model is a model that has been trained through supervised learning using training data including a combination of a real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, and a depth image for training, which is generated based on distance information from a viewpoint corresponding to a viewpoint at which the real image for training is captured, in the three-dimensional image of the subject, to the interior wall of the tubular structure and which represents a distance for each pixel from the viewpoint corresponding to the viewpoint at which the real image for training is captured to the interior wall of the tubular structure.
The information processing apparatus according to supplementary note 5, in which the depth image generation model is a model that has been trained in advance through supervised learning using training data including a combination of a real image for training, which is captured by the endoscope inserted into the tubular structure of the subject and which represents the interior wall of the tubular structure, and a depth image for training, which is generated based on an actual measurement value of a distance from a viewpoint at which the real image for training is captured to the interior wall of the tubular structure, the actual measurement value being obtained by a distance-measuring sensor mounted on the endoscope, and which represents a distance for each pixel from the viewpoint at which the real image for training is captured to the interior wall of the tubular structure.
An information processing method including: acquiring an input image including any one of a real image, which is captured by an endoscope inserted into a tubular structure of a subject and which represents an interior wall of the tubular structure, or a virtual image, which is generated based on a three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint predetermined in the three-dimensional image; using a transformation model, which is trained to transform any one of the input real image or virtual image into an image style of the other of the input real image or virtual image, to transform the input image into a transformed image having an image style that is not included in the input image; acquiring an input depth image, which represents a distance for each pixel from a viewpoint of the input image to the interior wall of the tubular structure, and a transformed depth image, which represents a distance for each pixel from a viewpoint of the transformed image to the interior wall of the tubular structure; and training the transformation model by using a loss function including a degree of similarity between the input depth image and the transformed depth image.
An information processing program for causing a computer to execute a process including: acquiring an input image including any one of a real image, which is captured by an endoscope inserted into a tubular structure of a subject and which represents an interior wall of the tubular structure, or a virtual image, which is generated based on a three-dimensional image of the subject and which represents, in a pseudo manner, the interior wall of the tubular structure as viewed from a virtual viewpoint predetermined in the three-dimensional image; using a transformation model, which is trained to transform any one of the input real image or virtual image into an image style of the other of the input real image or virtual image, to transform the input image into a transformed image having an image style that is not included in the input image; acquiring an input depth image, which represents a distance for each pixel from a viewpoint of the input image to the interior wall of the tubular structure, and a transformed depth image, which represents a distance for each pixel from a viewpoint of the transformed image to the interior wall of the tubular structure; and training the transformation model by using a loss function including a degree of similarity between the input depth image and the transformed depth image.
Number | Date | Country | Kind |
---|---|---|---|
2023-052392 | Mar 2023 | JP | national |