The present disclosure generally relates to estimating dimensions (e.g., depth) in a video or image of a surgical site based on images captured by an endoscope. The endoscope may be used in a surgical robotic system having one or more modular arm carts each of which supports a robotic arm, and a surgical console for controlling the carts and their respective arms. The endoscope may be held by one of the robotic arms allowing for viewing of the surgical site.
Surgical robotic systems are currently being used in minimally invasive medical procedures. Some surgical robotic systems include a surgical console controlling a surgical robotic arm and a surgical instrument having an end effector (e.g., forceps or grasping instrument) coupled to and actuated by the robotic arm. In operation, the robotic arm is moved to a position over a patient and then guides the surgical instrument into a small incision via a surgical port or a natural orifice of a patient to position the end effector at a work site within the patient's body.
As minimally invasive surgery and surgical robotics advance, there is a strong desire to incorporate artificial intelligence and analytics into the surgical procedure to decrease risk and improve patient outcomes. This includes using artificial intelligence to analyze video data. Given video data from a stereoscopic endoscope, there are a number of existing techniques to generate a depth map or a 3D point cloud. However, there are benefits and drawbacks to each of the conventional techniques. Thus, there is a need to produce accurate depth maps.
In many procedures including robot-assisted surgery, there is a strong demand to develop artificial intelligence capabilities that will assist the surgeon and improve patient outcomes. The terms “artificial intelligence,” “data models,” or “machine learning” may include, but are not limited to, neural networks, convolutional neural networks (CNN), recurrent neural networks (RNN), generative adversarial networks (GAN), Transformers, Bayesian Regression, Naive Bayes, nearest neighbors, least squares, means, and support vector regression, among other data science and artificial science techniques.
The present disclosure provides a method for generating a depth map or a 3D point cloud using recent analytical and deep learning-based algorithms operating on a stereoscopic endoscope video stream. Deep learning-based algorithms are capable of providing dense depth estimates (i.e., a depth value is associated with every pixel) and approximating depth even for feature-less regions of the image through context. However, these algorithms are more susceptible to errors due to inferring depth only from context as well as to training/test mismatch when the input is sufficiently different than training data. Some algorithms may also be harder to verify accuracy due to the black-box nature of many neural networks.
In various embodiments, the neural network may include a temporal convolutional network, with one or more fully connected layers, or a feed forward network. In various embodiments, training of the neural network may happen on a separate system, e.g., graphic processor unit (“GPU”) workstations, high performing computer clusters, etc., and the trained algorithm would then be deployed on the video processing device.
Analytical, i.e., classical, reconstruction algorithms operate on optical principles, making the algorithms analyzable and trustworthy and not susceptible to training/test mismatch. Suitable analytical reconstruction algorithms include dense stereo reconstruction techniques and dense matching between two stereoscopic camera views to reconstruct the 3D scene. However, such algorithms may sometimes match features incorrectly due to the inherent local nature of these algorithms. Specifically, most of these classical algorithms only integrate information from limited local neighborhood and do not take the larger context into account. Some algorithms also fail to provide dense estimates and only provide estimates at certain pixels due to key points and features (i.e., sparse estimates). These algorithms also struggle in smooth feature-less regions of image. Some analytical reconstruction algorithms produce dense results by matching two stereoscopic camera views to reconstruct the 3D scene.
The present disclosure combines several depth-mapping techniques together, to produce a depth map that is more reliable than one produced from any single algorithm alone. The algorithm according to the present disclosure uses a calibrated stereoscopic endoscope. Initially, the left and right images are rectified. Next, an algorithm estimates the disparity (left-to-right difference at each pixel location). Two of the above-described algorithms were used: (1) an analytical approach that frames disparity as a global variational optimization, and (2) a machine learning-based approach that uses a convolutional neural network and loss function designed for 3D reconstruction. The disparity in pixels is then converted to a depth in meters via triangulation and is then converted to an equivalent 3D point cloud of the surgical scene. The resulting depth map may be used in plurality of applications that rely on a detailed 3D representation of the surgical scene. A non-exhaustive, representative enumeration of these applications includes: placing “virtual walls” around critical structures; automated instrument control, such as suturing; registering a pre-operative 3D model with intra-operative endoscopic video; fusing a separate imaging modality, for example, ultrasound, with endoscope video; creating, updating, and tracking a non-rigid SLAM model of the surgical scene to aid in situational awareness during surgery. A virtual wall acts as an movement limit beyond which the robotic arm and the instrument attached thereto. Thus, any inputs that would result in movement of the robotic arm and/or the instrument beyond the virtual wall are ignored.
Classical analytical reconstruction algorithms may be prone to inaccuracy and artifacts at locations with high depth discontinuities, yet they may provide reasonable bounds on plausible depth estimates when larger spatial windows or coarser process scales (i.e., lower resolutions) are considered.
When tested on novel “unfamiliar” scenes, which may be visually dissimilar to the training dataset, machine learning algorithms may fail to generalize well, resulting in disparity maps with multi-modal error distributions. In this case, one mode may correspond to typical or expected minor deviations from ground truth due to a combination of factors (e.g., uncompensated lens distortion, resolution limitations, feature ambiguity). Additional modes in the error distribution may correspond to gross disparity failures stemming from more consequential machine learning shortcomings (e.g., limitations in datasets, overfitting, etc.).
In scenes with favorable conditions (e.g., distinct textures, strong features, etc.) the deep learning algorithms may produce more accurate results on the whole. A heuristic fusion scheme could be crafted to exploit these complementary tendencies: accurate but less robust machine learning models paired with more robust but less accurate output from classical methods. In cases where there is agreement between classical and deep learning methods, preference may be given to deep learning disparity values. In cases where they disagree, the more robust disparity from the classical algorithm may be preferred. At strong depth discontinuities, where disagreement is expected, the deep learning values may be opportunistically selected if they are deemed plausible based on bounds established from the classical algorithm.
The present disclosure may also employ a fusion scheme providing additional smoothness constraint to avoid discontinuities resulting from the final hybrid disparity map. Furthermore, the fusion scheme may also be achieved through a machine learning process, where multiple complementary depth maps are considered as inputs to a deep learning model. These complementary depth maps may also be created by training multiple models on distinct complementary datasets.
According to one embodiment of the present disclosure, a surgical robotic system is disclosed. The surgical robotic system includes an endoscopic camera configured to output a stereoscopic video stream. The system also includes a video processing unit coupled to the endoscopic camera, the video processing unit configured to process the stereoscopic video stream using a first algorithm to obtain a first depth map. The video processing unit is also configured to process the stereoscopic video stream using a second algorithm to obtain a second depth map. The video processing unit is further configured compare the first depth map to the second depth map to determine accuracy of the first depth map.
Implementations of the above embodiment may include one or more of the following features. According to one aspect of the above embodiment, the first algorithm may be a deep learning image processing algorithm. The second algorithm may be an analytical reconstruction algorithm. The deep learning image processing algorithm may be adjusted based on the second depth map.
According to another embodiment of the present disclosure a method for processing video data of a surgical scene is disclosed. The method includes outputting a stereoscopic video stream from an endoscopic camera to a video processing unit; processing the stereoscopic video stream using a first algorithm to obtain a first depth map; processing the stereoscopic video stream using a second algorithm to obtain a second depth map; and comparing the first depth map to the second depth map to determine accuracy of the first depth map.
According to one aspect of the above embodiment, the method further includes generating a virtual wall based on the first depth map; and limiting movement of a robotic arm based on the virtual wall. The first algorithm may be a deep learning image processing algorithm. The second algorithm may be an analytical reconstruction algorithm. The method may further include adjusting the deep learning image processing algorithm based on the second depth map.
Processing the stereoscopic video stream using the second algorithm may further include receiving sensor feedback from at least one torque sensor corresponding to physical contact by a robotic instrument.
Various embodiments of the present disclosure are described herein with reference to the drawings wherein:
Embodiments of the presently disclosed surgical robotic system are described in detail with reference to the drawings, in which like reference numerals designate identical or corresponding elements in each of the several views. As used herein the term “distal” refers to the portion of the surgical robotic system and/or the surgical instrument coupled thereto that is closer to the patient, while the term “proximal” refers to the portion that is farther from the patient.
The term “application” may include a computer program designed to perform functions, tasks, or activities for the benefit of a user. Application may refer to, for example, software running locally or remotely, as a standalone program or in a web browser, or other software which would be understood by one skilled in the art to be an application. An application may run on a controller, or on a user device, including, for example, a mobile device, a personal computer, or a server system.
As will be described in detail below, the present disclosure is directed to a surgical robotic system, which includes a surgical console, a control tower, and one or more movable carts having a surgical robotic arm coupled to a setup arm. The surgical console receives user input through one or more interface devices, which are interpreted by the control tower as movement commands for moving the surgical robotic arm. The surgical robotic arm includes a controller, which is configured to process the movement command and to generate a torque command for activating one or more actuators of the robotic arm, which would, in turn, move the robotic arm in response to the movement command.
With reference to
The surgical instrument 50 is configured for use during minimally invasive surgical procedures. In embodiments, the surgical instrument 50 may be configured for open surgical procedures. In embodiments, the surgical instrument 50 may be an endoscope, such as an endoscopic camera 51, configured to provide a video feed for the user. In further embodiments, the surgical instrument 50 may be an electrosurgical forceps configured to seal tissue by compressing tissue between jaw members and applying electrosurgical current thereto. In yet further embodiments, the surgical instrument 50 may be a surgical stapler including a pair of jaws configured to grasp and clamp tissue while deploying a plurality of tissue fasteners, e.g., staples, and cutting stapled tissue.
One of the robotic arms 40 may include the endoscopic camera 51 configured to capture video of the surgical site. The endoscopic camera 51 may be a stereoscopic endoscope configured to capture two side-by-side (i.e., left and right) images of the surgical site to produce a video stream of the surgical scene. The endoscopic camera 51 is coupled to a video processing device 56, which may be disposed within the control tower 20. The video processing device 56 may be any computing device as described below configured to receive the video feed from the endoscopic camera 51 perform the image processing based on the depth estimating algorithms of the present disclosure and output the processed video stream.
The surgical console 30 includes a first display 32, which displays a video feed of the surgical site provided by camera 51 of the surgical instrument 50 disposed on the robotic arms 40, and a second display 34, which displays a user interface for controlling the surgical robotic system 10. The first and second displays 32 and 34 are touchscreens allowing for displaying various graphical user inputs.
The surgical console 30 also includes a plurality of user interface devices, such as foot pedals 36 and a pair of handle controllers 38a and 38b which are used by a user to remotely control robotic arms 40. The surgical console further includes an armrest 33 used to support clinician's arms while operating the handle controllers 38a and 38b.
The control tower 20 includes a display 23, which may be a touchscreen, and outputs on the graphical user interfaces (GUIs). The control tower 20 also acts as an interface between the surgical console 30 and one or more robotic arms 40. In particular, the control tower 20 is configured to control the robotic arms 40, such as to move the robotic arms 40 and the corresponding surgical instrument 50, based on a set of programmable instructions and/or input commands from the surgical console 30, in such a way that robotic arms 40 and the surgical instrument 50 execute a desired movement sequence in response to input from the foot pedals 36 and the handle controllers 38a and 38b.
Each of the control tower 20, the surgical console 30, and the robotic arm 40 includes a respective computer 21, 31, 41. The computers 21, 31, 41 are interconnected to each other using any suitable communication network based on wired or wireless communication protocols. The term “network,” whether plural or singular, as used herein, denotes a data network, including, but not limited to, the Internet, Intranet, a wide area network, or a local area networks, and without limitation as to the full scope of the definition of communication networks as encompassed by the present disclosure. Suitable protocols include, but are not limited to, transmission control protocol/internet protocol (TCP/IP), datagram protocol/internet protocol (UDP/IP), and/or datagram congestion control protocol (DCCP). Wireless communication may be achieved via one or more wireless configurations, e.g., radio frequency, optical, Wi-Fi, Bluetooth (an open wireless protocol for exchanging data over short distances, using short length radio waves, from fixed and mobile devices, creating personal area networks (PANs), ZigBee® (a specification for a suite of high level communication protocols using small, low-power digital radios based on the IEEE 122.15.4-2003 standard for wireless personal area networks (WPANs)).
The computers 21, 31, 41 may include any suitable processor (not shown) operably connected to a memory (not shown), which may include one or more of volatile, non-volatile, magnetic, optical, or electrical media, such as read-only memory (ROM), random access memory (RAM), electrically-erasable programmable ROM (EEPROM), non-volatile RAM (NVRAM), or flash memory. The processor may be any suitable processor (e.g., control circuit) adapted to perform the operations, calculations, and/or set of instructions described in the present disclosure including, but not limited to, a hardware processor, a field programmable gate array (FPGA), a digital signal processor (DSP), a central processing unit (CPU), a microprocessor, and combinations thereof. Those skilled in the art will appreciate that the processor may be substituted for by using any logic processor (e.g., control circuit) adapted to execute algorithms, calculations, and/or set of instructions described herein.
With reference to
The setup arm 62 includes a first link 62a, a second link 62b, and a third link 62c, which provide for lateral maneuverability of the robotic arm 40. The links 62a, 62b, 62c are interconnected at joints 63a and 63b, each of which may include an actuator (not shown) for rotating the links 62b and 62b relative to each other and the link 62c. In particular, the links 62a, 62b, 62c are movable in their corresponding lateral planes that are parallel to each other, thereby allowing for extension of the robotic arm 40 relative to the patient (e.g., surgical table). In embodiments, the robotic arm 40 may be coupled to the surgical table (not shown). The setup arm 62 includes controls 65 for adjusting movement of the links 62a, 62b, 62c as well as the lift 61.
The third link 62c includes a rotatable base 64 having two degrees of freedom. In particular, the rotatable base 64 includes a first actuator 64a and a second actuator 64b. The first actuator 64a is rotatable about a first stationary arm axis which is perpendicular to a plane defined by the third link 62c and the second actuator 64b is rotatable about a second stationary arm axis which is transverse to the first stationary arm axis. The first and second actuators 64a and 64b allow for full three-dimensional orientation of the robotic arm 40.
The actuator 48b of the joint 44b is coupled to the joint 44c via the belt 45a, and the joint 44c is in turn coupled to the joint 46c via the belt 45b. Joint 44c may include a transfer case coupling the belts 45a and 45b, such that the actuator 48b is configured to rotate each of the links 42b, 42c and the holder 46 relative to each other. More specifically, links 42b, 42c, and the holder 46 are passively coupled to the actuator 48b which enforces rotation about a pivot point “P” which lies at an intersection of the first axis defined by the link 42a and the second axis defined by the holder 46. Thus, the actuator 48b controls the angle θ between the first and second axes allowing for orientation of the surgical instrument 50. Due to the interlinking of the links 42a, 42b, 42c, and the holder 46 via the belts 45a and 45b, the angles between the links 42a, 42b, 42c, and the holder 46 are also adjusted in order to achieve the desired angle θ. In embodiments, some or all of the joints 44a, 44b, 44c may include an actuator to obviate the need for mechanical linkages.
The joints 44a and 44b include an actuator 48a and 48b configured to drive the joints 44a, 44b, 44c relative to each other through a series of belts 45a and 45b or other mechanical linkages such as a drive rod, a cable, or a lever and the like. In particular, the actuator 48a is configured to rotate the robotic arm 40 about a longitudinal axis defined by the link 42a.
With reference to
The robotic arm 40 also includes a plurality of manual override buttons 53 (
With reference to
The computer 41 includes a plurality of controllers, namely, a main cart controller 41a, a setup arm controller 41b, a robotic arm controller 41c, and an instrument drive unit (IDU) controller 41d. The main cart controller 41a receives and processes joint commands from the controller 21a of the computer 21 and communicates them to the setup arm controller 41b, the robotic arm controller 41c, and the IDU controller 41d. The main cart controller 41a also manages instrument exchanges and the overall state of the movable cart 60, the robotic arm 40, and the IDU 52. The main cart controller 41a also communicates actual joint angles back to the controller 21a.
The setup arm controller 41b controls each of joints 63a and 63b, and the rotatable base 64 of the setup arm 62 and calculates desired motor movement commands (e.g., motor torque) for the pitch axis and controls the brakes. The robotic arm controller 41c controls each joint 44a and 44b of the robotic arm 40 and calculates desired motor torques required for gravity compensation, friction compensation, and closed loop position control of the robotic arm 40. The robotic arm controller 41c calculates a movement command based on the calculated torque. The calculated motor commands are then communicated to one or more of the actuators 48a and 48b in the robotic arm 40. The actual joint positions are then transmitted by the actuators 48a and 48b back to the robotic arm controller 41c.
The IDU controller 41d receives desired joint angles for the surgical instrument 50, such as wrist and jaw angles, and computes desired currents for the motors in the IDU 52. The IDU controller 41d calculates actual angles based on the motor positions and transmits the actual angles back to the main cart controller 41a.
The robotic arm 40 is controlled in response to a pose of the handle controller controlling the robotic arm 40, e.g., the handle controller 38a, which is transformed into a desired pose of the robotic arm 40 through a hand eye transform function executed by the controller 21a. The hand eye function, as well as other functions described herein, is/are embodied in software executable by the controller 21a or any other suitable controller described herein. The pose of one of the handle controller 38a may be embodied as a coordinate position and role-pitch-yaw (“RPY”) orientation relative to a coordinate reference frame, which is fixed to the surgical console 30. The desired pose of the instrument 50 is relative to a fixed frame on the robotic arm 40. The pose of the handle controller 38a is then scaled by a scaling function executed by the controller 21a. In embodiments, the coordinate position is scaled down and the orientation is scaled up by the scaling function. In addition, the controller 21a also executes a clutching function, which disengages the handle controller 38a from the robotic arm 40. In particular, the controller 21a stops transmitting movement commands from the handle controller 38a to the robotic arm 40 if certain movement limits or other thresholds are exceeded and in essence acts like a virtual clutch mechanism, e.g., limits mechanical input from effecting mechanical output.
The desired pose of the robotic arm 40 is based on the pose of the handle controller 38a and is then passed by an inverse kinematics function executed by the controller 21a. The inverse kinematics function calculates angles for the joints 44a, 44b, 44c of the robotic arm 40 that achieve the scaled and adjusted pose input by the handle controller 38a. The calculated angles are then passed to the robotic arm controller 41c, which includes a joint axis controller having a proportional-derivative (PD) controller, the friction estimator module, the gravity compensator module, and a two-sided saturation block, which is configured to limit the commanded torque of the motors of the joints 44a, 44b, 44c.
The video processing device 56 is configured to process the video feed from the endoscope camera 51 and to output a processed video stream on the first displays 32 of the surgical console 30 and/or the display 23 of the control tower 20. According to one embodiment, the video processing device 56 is configured to execute two image processing algorithms, namely an analytical reconstruction algorithm and a deep learning algorithm. In particular, the video processing device 56 uses an analytical reconstruction algorithm as a cross-check/validation of the deep learning algorithm. Both algorithms would be running in real time, processing the same endoscope images. The deep learning algorithm would produce a dense depth map as shown in
According to another embodiment of the present disclosure, the video processing device 56, rather than validate the output of the deep learning algorithm, the video processing device 56 utilizes the data from the analytical reconstruction algorithm to correct in real-time the deep learning algorithm. If the deep learning and analytical reconstruction algorithm produce disagreeing depth estimates for certain key points, the dense deep learning algorithm output may be locally scaled, averaged, or spatially warped by adjusting its parameters to better match the analytical reconstruction algorithm, which may be more reliable for those key points. It may also be possible to incorporate “correction inputs” into the deep learning network itself to accommodate some of these corrections.
In further embodiments, other algorithms may be used to check depth map plausibility, to rule out strange or unexpected depth maps. A neural network could be trained for this purpose. Other simpler algorithms may also be used to detect sudden unexpected depth jumps in tissue-like regions that are expected to be smooth. Such algorithms could identify regions of anomalous depth maps to assess reliability.
According to yet another embodiment, the video processing device 56 may receive physical parameter data from the instrument 50, and the robotic arm 40 holding the instrument 50. In particular, robotic “touch”, e.g., recorded as environmental torque by torque sensors of the robotic arm 40, may be used to refine or validate the depth map. The robotic arm 40 is calibrated to a known hand-eye matrix (i.e., the relationship between the 3D position of the robotic arm 40 and where the instrument 50 held by the robotic arm 40 appears on the screen is known). Thus, when instrument 50 is touching or grasping tissue or another object in the surgical scene, this contact is inferred via force or torque sensors. Touch may also be determined visually based on deformation of the tissue. Touch implies that the depth of the instrument tip is approximately equal to the depth of the surgical scene, allowing the position of the instrument 50, which is known from the robotic arm 40 torque sensors to be used as a proxy for depth in that location. These position estimates may be used as a cross-check or refinement for the optically-estimated depth.
The generated depth map may be combined with other 3D data such as various imaging scans (e.g., CAT scans, MRI, ultrasound, etc.). Such 3D data may be overlayed over the depth map and may be used to identify critical structures. The depth map may then be used by the computer 21 to generate virtual walls around critical structures, which would prevent movement of the instrument 50 beyond the virtual walls, thus limiting operating space of the robotic arms 40. In addition, the depth map may be used to adjust the color in the base color based on the change in angle from the depth map.
Depth mapping may also be used for estimation of axial distortion (e.g., image elongation/shrink in the depth direction). In conventional endoscopy, the aspect ratio of the objects being observed is unknown in the axial and transverse planes. Thus, depth mapping may be used to correct images in post-processing with respect to aspect ratios and other imaging distortions.
With reference to
In cases where there is agreement between classical and deep learning methods, preference is given to deep learning disparity values. In cases where the algorithms disagree, the more robust disparity from the classical algorithm is preferred. At strong depth discontinuities, where disagreement is expected, the deep learning values are selected if they are deemed plausible based on bounds established from the classical algorithm.
A fusion scheme method of
In further embodiments, multiple different networks—each optimized or trained to do well at estimating depth for certain things—some networks good at depth on tools, others good at depth for background tissue, some for smoke, etc. may be used. Heuristics (or even other neural networks) could be used to select which networks to use and how to “weigh” the outputs in generating a depth map.
The method of
Confidence may also be used in blending of the surgical video from the endoscopic camera 51 and depth map. The resolution of depth map may degrade as the distance of the objects to the endoscopic camera 51 increases, therefore the blending of the video and depth map may be adjusted using a more weighted depth map at closer distances and less weighted depth map at further distances.
With reference to
In the areas of low local visual information, the disparity may be more reliably measured by the first image processing algorithm, i.e., deep learning-based algorithm, that can integrate larger context. These areas of low visual information may be detected by local measures that compute the amount of high frequency content using a sliding window method. Examples of the phenomenon that generate areas of low visual information include shadows, specular reflection, smoke, etc.
During a typical surgical procedure, there are times when the surgical site exhibits long periods of smoke and blood in the images captured through stereo endoscope. When events such as excessive blood or smoke are detected in a surgical scene, the first algorithm, i.e., deep learning algorithm, that has been trained on images exhibiting presence of blood and smoke. These conditions may be detected by separate image processing means that trigger the presence of blood and smoke in the scene. These image processing triggers may be further verified by inputs from the control software based on activations of cutting and energy tools.
In one of the embodiments, at step 200, a pixel-wise mask for the areas of low local visual information is generated. A further morphological dilation step 202 may be employed to extend the mask representing these areas of low local visual information. The first algorithm is used to generate the depth estimation for these low local visual information regions at step 204, where the integration of global context is more important. The second algorithm is then used at step 206 to generate depth estimation for the remaining pixels in the image where high local visual information is observed.
In the embodiments where a classical algorithm is used as one of the two algorithms, plurality of pixels labeled as ‘inconsistent pixels’ are routinely observed as output by the classical algorithm. A classical algorithm, working in the local pixel neighborhood, marks the pixels that cannot be reliably matched between the left and right stereo image as “inconsistent pixels.” An exemplary scenario where large patches of inconsistent pixels appears is the case of large low contrast areas in the surgical scene. The areas of surgical scene that are larger than the disparity comparison block size of the classical algorithm with no high-contrast spots or unique edges in the region often get marked as belonging to inconsistent pixels. A possible solution to this problem may use semi-global block matching' (SGBM) algorithms. However, the classical methods from the SGBM family of algorithms increase the processing cost by a factor of 10× to 20× and still suffer from this problem in the case of large low contrast areas with gradual depth changes. In these embodiments, the absent depth values of inconsistent pixels, from the classical algorithm, may be filled in with the depth estimation values from the first algorithm, where the first algorithm is able to combine global context and to compute correct depth estimate in the large areas of low contrast.
The classical and machine learning algorithms of the present disclosure are configured to generate distinct disparity maps. The video processing device 56 may then combine these disparity maps to generate the final consistent disparity map and depth estimation image. The final disparity map may be represented as a point cloud showing the depth value for each pixel. A desirable application for this depth information may be used to display 3D visual information on a 2D screen such as the second display 34. In further embodiments, a pre-operative imaging model registered with intra-operative stereo endoscope image may also be displayed on the second display 34. Since a colorless and texture-less point cloud shown on the second display 34 fails to convey visually coherent information from surgical scene in a way that is instantaneously useful to the surgeon intra-operatively, the final disparity map may be used to provide color/texture for each pixel in the point cloud generated from this method. The color/texture value for each pixel may be a combination of the color/texture values according to the first algorithm and the second algorithm.
In some surgical procedures, an initial phase commonly known as “first look”, may be employed wherein the surgeon moves the endoscopic camera 51 around the surgical scene to get a better understanding of the patient-specific internal anatomy and plan a route to the organ for surgery. During this phase, the stereo endoscopic camera 51 is moved and panned around to get a better look at the surgical site. The main challenge for the stereo algorithms to reliably infer depth estimation during this phase is excessive motion artifacts caused by motion blur.
In one embodiment of this disclosure, a method for depth estimation in blurry images due to motion artifacts is shown in
At step 300, a first depth estimates for a set of points between successive frames is calculated by the video processing device 56 using the best algorithm for depth estimation. This depth estimate from multiple key-points from successive images is combined to generate a first global change in depth estimate from successive image frames. At step 302, a second global depth estimates from kinematics is computed with a suitable sampling rate aligned with video frame acquisition rate. At step 304, a difference between the first global change in depth estimate from imaging modality and the second global depth estimate from kinematics modality is computed. The difference is then compared to a threshold. If the difference between the first global change in depth estimate from imaging modality and the second global change in depth estimate from kinematics modality is larger than a threshold, then at step 306 the video processing device 56 switches over to the depth estimation estimate that is more suitable to scenes with excessive motion. The video processing device 56 uses the estimate of inter-frame endoscope movement based on kinematics or other means to select the most suitable depth estimation algorithm is being used.
The present disclosure also provides for tissue specific modeling (liver, stomach, lung, etc.) based on temporal conditions and imposing explicit model-based constraints in domain specific depth mapping applications. Due to basic physiologic constraints, there is a high degree of similarity in the structure of the surgical field observed from multiple patients undergoing the same surgical procedure. In the surgical domain, bio-mechanical models, designed through analytical and data driven techniques, may be used to inform and improve depth maps created from both classical and machine learning modules. For example, stereo endoscope views of soft tissue scenes often contain extreme saturation, specular reflection, and numerous other confounding factors that prevent reliable depth estimates. Human organs are often homogeneous in texture and appearance, especially when viewed at the coarser scale associated with the wider field of view used for early-stage surgical planning. In combination, these factors can make accurate depth estimation challenging. Use of explicit bio-mechanical models allows accumulation of both local and global cues when creating a final depth map. A deformable bio-mechanical organ model may be used to address deficiencies in an initial depth map. Thus, if the initial depth map is sparse and/or noisy, the estimated intraoperative surface may contain enough unique structure to drive a deformable organ registration step. After alignment, the surface of the aligned organ may be used to refine and complete the initial depth map, using various strategies.
Machine learning models with a complex representational capacity, such as deep learning models, could potentially learn a similar implicit bio-mechanical constraint if provided with enough diverse and domain specific data, however, there are several advantages to using an explicit model. While it may be possible to learn an implicit model “inside” a CNN (i.e., in latent space), it will most likely be data limited and difficult to control or tune. In the case of real training data, it may be challenging to acquire enough samples. Synthetic images may be used to generate larger and more diverse datasets containing more variations in organ pose, but this path will likely introduce some degree of “unrealism” that may hinder generalization of the learned model itself. An explicit model may be designed and verified independently (e.g., through finite element methods) and combined with one or more independent depth estimation modules through a real time fusion strategy. Thus, any of the disclosed depth mapping methods may be based on organ tissue specific models.
To a large extent, the quality of depth maps resulting from stereo reconstruction modules is directly proportional to the quality of the input images. In addition to all the common attributes of typical high quality monocular images (e.g., focus, uniform illumination, etc.), stereo image pairs must contain unique visual structure (or features) in both left and right images, since it is this common structure that is exploited to determine the pixelwise correspondence that is encoded in the output disparity maps. If this common structure is not visible, then stereo reconstruction processing will suffer. In human tissue, different structures are revealed by different wavelengths of light. For example, visible wavelengths will primarily reveal surface structure, whereas near-infrared wavelengths may reveal slightly deeper structures.
Near infrared (NIR) techniques can therefore produce images that are complementary to standard visible wavelength imaging. In cases where tissue surface is homogeneous in visible wavelength imaging, resulting in poor stereo reconstructions, near-infrared sensors may reveal sufficient common structure (e.g., subdermal micro-vasculature) to improve depth maps in these scenarios. This information can be combined in both late-stage fusion (i.e., wavelength specific depth maps) and early-stage fusion (i.e., multi-spectral input images) strategies. In some cases, depth maps resulting from NIR imaging, may deviate slightly from the true surface, and fusion strategies can use prior physiologic knowledge to account for this. Thus, one the image processing algorithms used in the methods of the present disclosure may also include NIR light and image sources, such that in addition to the visible spectrum depth maps, depth maps may also be computed from NIR images to supplement or verify depth maps generated using machine learning and/or classical algorithms using visible spectrum depth maps.
In addition, to NIR imaging, other lighting and illumination may be used to enhance depth mapping. Spot and/or gradient or colored lighting may be used to enhance contours of the tissue surfaces. If illumination spot is not showing up in the proper location, then a second algorithm may be used to confirm/verify the mismatch. Since the location of the light source is known, the position may then be used to triangulate distances to orthogonal tissue surfaces. Illumination spot causes specular reflection on tissue surface closer to the endoscope with surface normal in the direction of the endoscope camera, which may be used in depth mapping.
The system 10 may use endoscope instrument interface, i.e., communication between the controller 21a and the video processing device 56, to query in real-time the illumination level. The illumination level determines the amount of specular reflection to be expected in the image. With reference to
It will be understood that various modifications may be made to the embodiments disclosed herein. In embodiments, the sensors may be disposed on any suitable portion of the robotic arm. Therefore, the above description should not be construed as limiting, but merely as exemplifications of various embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the claims appended thereto.
The present application claims the benefit of and priority to U.S. Provisional Application No. 63/175,285, filed on Apr. 15, 2021. The entire disclosure of the foregoing application is incorporated by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/024953 | 4/15/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63175285 | Apr 2021 | US |