Examples described herein relate generally to machine learning models and neural networks, and more specifically, systems and methods for keypoint detection and tracking-by-prediction for multiple surgical instruments.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
Robust visual tracking for multiple agents empowers many autonomous applications. Despite the importance, attempts in building reliable trackers are suffering due to the high complexity in de-coupling spatial-temporal correlations using the neural network. For some other applications, such as robotic assisted surgery, tracking the pose of one or more instruments benefits from a high-fidelity model. Compared to a general Multiple Object Tracking (MOT) scenario, surgical instruments used during surgery usually interact with each other. The dynamic modeling of instrument interactions for surgery-specific actions such as dissection, retraction, and suturing directly affects the performance of tracking.
Based on modeling the social behavior of crowd pedestrians, researchers introduced the social long short-term memory network(“LSTM”) as a kernel that describes the dynamics of pedestrians' interactions, where the latent motions represented with the hidden states of LSTMs are shared by the mechanism of “social-pooling.” Advanced in the social pooling design, individual pedestrians are not treated as isolated entities, but are grouped together at the pooling based on defined “neighborhood” relations. Soft attention is also utilized to establish the relative influence among the pedestrians. The attention model calculates a weight matrix that assigned unequal importance to the neighboring pedestrians. It increases the flexibility of the model to understand the crowd behavior based on the spatial interactions. However, social-LSTM models each pedestrian equally using a LSTM. It may not be applicable to a complex entity with an obvious hierarchical structure.
Various features may improve keypoint detection and tracking-by-prediction for multiple surgical instruments. The following presents a simplified summary of various examples described herein and is not intended to identify key or critical elements or to delineate the scope of the claims.
Consistent with some examples a method for detecting a location of a plurality of keypoints of a surgical instrument is provided. The method includes receiving, at a first neural network model, a video input including a plurality of video frame images of a surgical procedure. The method further includes generating, using the first neural network model, a first output image including a first output location of each keypoint of the plurality of keypoints annotated on a first output image of the surgical instrument. The method further includes receiving, at a second neural network model, the first output image generated by the first neural network model. The method further includes receiving, at the second neural network model, historic keypoint trajectory data including a historic trajectory for each keypoint of the plurality of keypoints. The method further includes determining, using the second neural network model, a trajectory for each keypoint of the plurality of keypoints. The method further includes generating, using the second neural network model, a second output image including a second output location of each keypoint of the plurality of keypoints annotated on a second output image of the surgical instrument.
Consistent with other examples, a system for detecting a location of a plurality of keypoints of a surgical instrument is provided. The system includes a memory configured to store a first neural network model and a second neural network model. The system further includes a processor coupled to the memory. The processor is configured to receive, at the first neural network model, a video input including a plurality of video frame images of a surgical procedure. The processor is further configured to generate, using the first neural network model, a first output image including a first output location of each keypoint of the plurality of keypoints annotated on a first output image of the surgical instrument. The processor is further configured to receive, at the second neural network model, the first output image generated by the first neural network model. The processor is further configured to receive, at the second neural network model, historic keypoint trajectory data including a historic trajectory for each keypoint of the plurality of keypoints. The processor is further configured to determine, using the second neural network model, a trajectory for each keypoint of the plurality of keypoints. The processor is further configured to generate, using the second neural network model, a second output image including a second output location of each keypoint of the plurality of keypoints annotated on a second output image of the surgical instrument.
Consistent with other examples, a non-transitory machine-readable medium having stored thereon machine-readable instructions executable to cause a machine to perform operations that detect a location of a plurality of keypoints of a surgical instrument is provided. The operations include receiving, at a first neural network model, a video input including a plurality of video frame images of a surgical procedure. The operations further include generating, using the first neural network model, a first output image including a first output location of each keypoint of the plurality of keypoints annotated on a first output image of the surgical instrument. The operations further include receiving, at a second neural network model, the first output image generated by the first neural network model. The operations further include receiving, at the second neural network model, historic keypoint trajectory data including a historic trajectory for each keypoint of the plurality of keypoints. The operations further include determining, using the second neural network model, a trajectory for each keypoint of the plurality of keypoints. The operations further include generating, using the second neural network model, a second output image including a second output location of each keypoint of the plurality of keypoints annotated on a second output image of the surgical instrument.
Other examples include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of any one or more methods described below.
It is to be understood that both the foregoing general description and the following detailed description are illustrative and explanatory in nature and are intended to provide an understanding of the various examples described herein without limiting the scope of the various examples described herein. In that regard, additional aspects, features, and advantages of the various examples described herein will be apparent to one skilled in the art from the following detailed description.
Various examples described herein and their advantages are described in the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures for purposes of illustrating but not limiting the various examples described herein.
The theory of graph neural networks inspires advanced modeling using graph representations for un-structured data. The examples described herein illustrate a Spatial-Temporal Graph hierarchy, where the spatial and temporal interactions among surgical instruments are encoded, respectively. The spatial interactions at one time-step are captured by the graph attention scheme, which models over all the surgical instruments in the clinical operation. After assigning a different importance on keypoints, an extra LSTM is used to capture the temporal correlations of interactions between the surgical instruments. By aggregating all the spatial-and-temporal interactions among all the keypoints and instrument entities, the future trajectories are generated by a sequence-to-sequence (seq2seq) translation. To model the diverse motion patterns, the intra-instrument keypoints are modeled in the lower level and the inter-instrument interactions are also modeled by defining root nodes and connecting them in higher level graph topology.
In some examples, besides modeling the multiple instruments interactions, object association is another aspect that may additionally or alternatively affect the tracking performance. Conventional models utilize trajectories only to understand the dynamics of tracked objects in the past, but one directional temporal encoding ignores the future movement. This results in inconsistent tracking association due to the imperfection of localization error accumulated over time. Tracking-by-prediction is a sophisticated scheme that bridges the gap. Instead of temporal correlation from historic traces, the tracked objects in the next frame are associated with targets by considering their predicted short-and long-term motions. The bi-directional continuity enforces the smoothness and correction of the tracking associations in potentially ambiguous scenarios.
The following description will further describe surgical instrument tracking-by-prediction with an inductive hierarchical spatial-temporal graph network. In some examples, a graph hierarchy is generated to represent the spatial-temporal complexity embedded in the multiple surgical instrument tracking problem. The model explicitly extends the graph to predict trajectories of multiple surgical tools in the surgery. The model explicitly encodes the spatial-temporal correlation with emphasis on the instruments' interaction. Additionally or alternatively, the tracking may be re-framed by aggregating the predictions to mitigate any data association inaccuracy.
Localizing instrument keypoints and tool parts in video-assisted surgeries is an attractive and open problem in computer vision. A working algorithm may be useful in computer-aided interventions in the operating room with, for example, a robotically-assisted surgical system. Knowing the location of tool parts may help virtually augment visual faculty of surgeons, assess skills of novice surgeons, and increase autonomy of surgical robots. Additionally, knowing the location of tool parts may assist with generating haptic feedback, evaluating user skills, and automating endoscopic motion based on the tracked location of the tools.
The method 300 will be described with continuing reference to
At a process 310, video input 410 (e.g., the video input 110) is received at the spatial neural network model 420. As discussed above, the video input 410 may include one or more video frame images. In some examples, the spatial neural network model 420 may include an hourglass module 425. For example, a multiple-layered neural network may be used as a spatial information encoder.
In some examples, the spatial neural network model 420 may include a cascaded hourglass network structure. Scale invariant capability may be needed in the spatial information encoding for keypoint localization. To enhance such a capability, multiple hourglass networks may be stacked together in some examples to hierarchically process the video input 410. The stacked feature maps may be aggregated to incorporate multiple-scale spatial modeling. At the last layer of the spatial neural network model 420, the output feature vector may be regressed to infer the location of the heatmap. There is a duplicated output used to infer the tags of the parts at the same time. Then, clustering may be conducted using the estimated tags and distance metric to group the individual gaussian maps into an instrument.
The spatial neural network model 420 may output a heatmap 430. A heatmap, such as the heatmap 430, may include a graphical representation of data with values, such as locations of one or more keypoints, depicted by color. The heatmap 430 includes one or more icons 432 that illustrate estimated locations of the keypoints for the surgical instrument(s). In some examples, the keypoints are estimated using a pre-defined Gaussian distribution. The spatial neural network model 420 may learn a regression model that maps the input video frame image 410 into the heatmap 430. The icons 432 may vary in brightness depending on the confidence with which each icon represents an accurate location of a keypoint. For example, the pixel value of an icon may increase in brightness as the confidence with which the icon represents an accurate location of a keypoint increases.
In some examples, the spatial neural network model 420 may additionally or alternatively output an output image 440. The output image 440 includes icons 442 that indicate an output location of one or more keypoints of one or more surgical instruments. The output image 440 may be generated based on the heatmap 430. For example, the spatial neural network model 420 may utilize the estimated keypoint locations in the heatmap 430 to identify a final location of the keypoint(s). The spatial neural network model 420 may then cause the icons 442 to be overlaid on the output image 440 to identify the final locations of the keypoint(s).
With reference to
Keypoint localization techniques that use only a spatial neural network model may result in an output image with keypoint locations that may not accurately match the actual location of the keypoint(s) of the surgical instrument(s). False positive keypoint identifications and/or failing to detect a keypoint may occur. The keypoint localization process of only a spatial neural network model may be improved by interconnecting the spatial neural network model with a temporal neural network model to refine the results output by the spatial neural network model. In some examples, utilizing the temporal encoding of keypoint sequences may improve the performance of keypoint localization. The temporal neural network model 530 may learn from historic keypoint trajectories (e.g., the historic trajectories 520A, 520B, 520C) and may determine the future dynamics of the tracked keypoint(s). Given the historic trajectories of the keypoint(s), the temporal neural network model 530 may efficiently learn the dynamics of the keypoint(s) and may determine the future locations of the keypoint(s). In one example, the temporal neural network model 530 may be a layered neural network using concatenated long short-term memory as the basic units. The temporal neural network model 530 may be trained off-line using ground-truth trajectories of the surgical instrument.
In some examples, the output image 440 generated by the spatial neural network model 420 may be received by the temporal neural network model 530 along with the historic trajectories 510. The temporal neural network model 530 may then iteratively match the current observation of the keypoint locations in the current output image 440 to the historic trajectories 510. In one exemplary process, the data association and iterative matching may be performed in the following manner. It is to be understood that the algorithms described below and/or the order in which the algorithms below are presented may be varied depending on specific circumstances in which the temporal neural network model 530 is used.
For historical traces T−n, T1−n, . . . T−1, the predictor P may be applied to process these n-step information and predict the next m-step possible locations where the object is going to move.
The current observation Ob0 (from the output image 440 of the spatial neural network model 420) may be assigned to associate with the historical trajectories 510. Again, the same predictor P may be utilized to process n-step information with associated observation and predict the next m-step possible locations of the object:
The prediction 540, which is a prediction based only on the historic trajectories 510, may include a segment overlapped with the prediction 550, which is a prediction based on the combination of the historic trajectories 510 and the output image 440, which is from time t=1 to t=m−1. The prediction segment 540 (eq (1)) may be defined as {circumflex over (T)} and the prediction segment 550 (eq (2)) is defined as {tilde over (T)}h. The final association may be determined by minimizing the metrics over all the predictions:
where, D is a defined distance metric to calculate the distance between {circumflex over (T)} and {tilde over (T)}h. One example of distance metrics is a Mahalanobis distance. Based on the optimization results, the prediction with the lowest distance error may be chosen for associating the locations of the keypoint(s) in the output image 440 with the historic trajectories 510. Using the above process, the temporal neural network model 530 may determine a future trajectory for each keypoint at a process 350.
In some examples, when the input includes frames n=5, the output generates frames m=5. The predictions 540 may include predictions using only the historic trajectories 510. The predictions 550 may include predictions with one association of the current locations of the keypoint(s) in the output image 440 with the historic trajectories 510. The predictions 540, 550 have four frames overlapped, as shown in
After merging the current locations of the keypoint(s) in the output image 440 with the historic trajectories 510, a smoothness filter may be applied to the predicted trajectories 550 to further reduce any noise from the raw detection. At a process 360, the temporal neural network model 530 may generate a refined output image 560, which includes icons 562 illustrating a refined location of the keypoint(s) of the surgical instrument in that particular video frame image. The temporal neural network model 530 may output multiple refined output images corresponding to each video frame image in the video input 110 to illustrate the changing locations of the keypoint(s) over the course of the surgical procedure. In some examples, the refined output image 560 may be displayed on a display system, such as the display system 610. In some examples, the predicted trajectories 550 may be displayed on a display system, such as the display system 610.
In some examples, if a particular video frame image does not include one or more of the keypoints (e.g., if a part of the surgical instrument is occluded from view), the trajectory of the missing keypoint may be interpolated. The missing keypoint may be interpolated based on the application of the smoothness filter.
One or more of the keypoint detection processes described above may be used to evaluate a performance of the surgical procedure based on the refined output image(s) 560 corresponding to each video frame image of the video input 110. For example, the locations of the keypoint(s) may be used to calculate and/or evaluate one or more objective performance indicators (OPI's). The OPI's may help evaluate how effectively and/or how efficiently a surgeon performed the surgical procedure. For example, a Time OPI may track the amount of time each surgical instrument was active and the amount of time each surgical instrument was inactive during the surgical procedure.
An Instrument Movement OPI may track one or more of: a speed of instrument movement (median and/or mean); a smoothness of instrument movement (e.g., normalized speed and/or speed peaks); a range of motion of the instrument; a distance of the instrument from a camera (e.g., an endoscopic camera); an economy of instrument motion (e.g., the total distance travelled by the instrument); and bimanual dexterity (e.g., proportion of instrument distances, difference in instrument distances). A Wrist and Grip Metric OPI may track instrument wrist and instrument grip angles (mean, median, total) and/or instrument wrist angular velocity.
One or more of the keypoint detection processes described above may additionally or alternatively be used to determine whether the surgical instrument will exceed its range of motion based on the determined trajectory 550 of each keypoint. In some examples, if a determination is made that the surgical instrument will exceed its range of motion, a warning may be generated to indicate that the surgical instrument will exceed its range of motion. The warning may be audible, haptic, textual, or any combination thereof. The determinations may be made by one or more processors of a control system (e.g., the control system 612).
One or more of the keypoint detection processes described above may additionally or alternatively be used to determine whether the surgical instrument will collide with another surgical instrument based on the determined trajectory 550 of each keypoint. In some examples, if a determination is made that the surgical instrument will collide with another surgical instrument, a warning may be generated to indicate that the surgical instrument will exceed its range of motion. The warning may be audible, haptic, textual, or any combination thereof. The determinations may be made by one or more processors of a control system (e.g., the control system 612).
In some examples, the keypoint detection techniques of this disclosure, such as those discussed in relation to method 300 of
Robot-assisted medical system 600 also includes a display system 610 for displaying an image or representation of the surgical site and medical instrument system 604 generated by a sensor system 608 and/or an endoscopic imaging system 609. Display system 610 and master assembly 606 may be oriented so operator O can control medical instrument system 604 and master assembly 606 with the perception of telepresence.
In some examples, medical instrument system 604 may include components for use in surgery, biopsy, ablation, illumination, irrigation, or suction. Optionally medical instrument system 604, together with sensor system 608 may be used to gather (e.g., measure) a set of data points corresponding to locations within anatomical passageways of a patient, such as patient P. In some examples, medical instrument system 604 may include components of the imaging system 609, which may include an imaging scope assembly or imaging instrument that records a concurrent or real-time image of a surgical site and provides the image to the operator or operator O through the display system 610. The concurrent image may be, for example, a two or three-dimensional image captured by an imaging instrument positioned within the surgical site. In some examples, the imaging system components that may be integrally or removably coupled to medical instrument system 604. However, in some examples, a separate endoscope, attached to a separate manipulator assembly may be used with medical instrument system 604 to image the surgical site. The imaging system 609 may be implemented as hardware, firmware, software or a combination thereof which interact with or are otherwise executed by one or more computer processors, which may include the processors of the control system 612.
The sensor system 608 may include a position/location sensor system (e.g., an electromagnetic (EM) sensor system) and/or a shape sensor system (e.g., an optical fiber shape sensor system) for determining the position, orientation, speed, velocity, pose, and/or shape of the medical instrument system 604. In some examples, the sensor system 608 includes a shape sensor. The shape sensor may include an optical fiber extending within and aligned with the medical instrument system 604 (e.g., an elongate device). In one example, the optical fiber has a diameter of approximately 200 μm. In other examples, the dimensions may be larger or smaller. The optical fiber of the shape sensor forms a fiber optic bend sensor for determining the shape of the elongate device. In one alternative, optical fibers including Fiber Bragg Gratings (FBGs) are used to provide strain measurements in structures in one or more dimensions. Various systems and methods for monitoring the shape and relative position of an optical fiber in three dimensions are described in U.S. patent application Ser. No. 11/180,389 (filed Jul. 13, 2005) (disclosing “Fiber optic position and shape sensing device and method relating thereto”); U.S. patent application Ser. No. 12/047,056 (filed on Jul. 16, 2004) (disclosing “Fiber-optic shape and relative position sensing”); and U.S. Pat. No. 6,389,187 (filed on Jun. 17, 1998) (disclosing “Optical Fiber Bend Sensor”), which are all incorporated by reference herein in their entireties. Sensors in some examples may employ other suitable strain sensing techniques, such as Rayleigh scattering, Raman scattering, Brillouin scattering, and Fluorescence scattering. In some examples, the shape of the catheter may be determined using other techniques. For example, a history of the distal end pose of the elongate device can be used to reconstruct the shape of the elongate device over the interval of time.
Robot-assisted medical system 600 may also include control system 612. Control system 612 includes at least one memory 616 and at least one computer processor 614 for effecting control between medical instrument system 604, master assembly 606, sensor system 608, endoscopic imaging system 609, and display system 610. Control system 612 also includes programmed instructions (e.g., a non-transitory machine-readable medium storing the instructions) to implement some or all of the methods described in accordance with aspects disclosed herein, including instructions for providing information to display system 610.
Control system 612 may optionally further include a virtual visualization system to provide navigation assistance to operator O when controlling medical instrument system 604 during an image-guided surgical procedure. Virtual navigation using the virtual visualization system may be based upon reference to an acquired preoperative or intraoperative dataset of anatomical passageways. The virtual visualization system processes images of the surgical site imaged using imaging technology such as computerized tomography (CT), magnetic resonance imaging (MRI), fluoroscopy, thermography, ultrasound, optical coherence tomography (OCT), thermal imaging, impedance imaging, laser imaging, nanotube X-ray imaging, and/or the like.
The singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context indicates otherwise. And the terms “comprises,” “comprising,” “includes,” “has,” and the like specify the presence of stated features, steps, operations, elements, and/or components but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups. Components described as coupled may be electrically or mechanically directly coupled, or they may be indirectly coupled via one or more intermediate components. Components described as coupled may be directly or indirectly communicatively coupled. The auxiliary verb “may” likewise implies that a feature, step, operation, element, or component is optional.
In the description, specific details have been set forth describing some examples. Numerous specific details are set forth in order to provide a thorough understanding of the examples. It will be apparent, however, to one skilled in the art that some examples may be practiced without some or all of these specific details. The specific examples disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure.
Elements described in detail with reference to one example, implementation, or application optionally may be included, whenever practical, in other examples, implementations, or applications in which they are not specifically shown or described. For example, if an element is described in detail with reference to one example and is not described with reference to a second example, the element may nevertheless be claimed as included in the second example. Thus, to avoid unnecessary repetition in the following description, one or more elements shown and described in association with one example, implementation, or application may be incorporated into other examples, implementations, or aspects unless specifically described otherwise, unless the one or more elements would make an example or implementation non-functional, or unless two or more of the elements provide conflicting functions.
Any alterations and further modifications to the described devices, instruments, methods, and any further application of the principles of the present disclosure are fully contemplated as would normally occur to one skilled in the art to which the disclosure relates. In addition, dimensions provided herein are for specific examples and it is contemplated that different sizes, dimensions, and/or ratios may be utilized to implement the concepts of the present disclosure. To avoid needless descriptive repetition, one or more components or actions described in accordance with one illustrative example can be used or omitted as applicable from other illustrative examples. For the sake of brevity, the numerous iterations of these combinations will not be described separately. For simplicity, in some instances the same reference numbers are used throughout the drawings to refer to the same or like parts.
The systems and methods described herein may be suited for navigation and treatment of anatomic tissues, via natural or surgically created connected passageways, in any of a variety of anatomic systems, including the lung, colon, the intestines, the kidneys and kidney calices, the brain, the heart, the circulatory system including vasculature, and/or the like. Although some of the examples described herein refer to surgical procedures or instruments, or medical procedures and medical instruments, the techniques disclosed apply to non-medical procedures and non-medical instruments. For example, the instruments, systems, and methods described herein may be used for non-medical purposes including industrial uses, general robotic uses, and sensing or manipulating non-tissue work pieces. Other example applications involve cosmetic improvements, imaging of human or animal anatomy, gathering data from human or animal anatomy, and training medical or non-medical personnel. Additional example applications include use for procedures on tissue removed from human or animal anatomies (without return to a human or animal anatomy) and performing procedures on human or animal cadavers. Further, these techniques can also be used for surgical and nonsurgical medical treatment or diagnosis procedures.
Further, although some of the examples presented in this disclosure discuss robotic-assisted systems or remotely operable systems, the techniques disclosed are also applicable to computer-assisted systems that are directly and manually moved by operators, in part or in whole.
Additionally, one or more elements in examples of this disclosure may be implemented in software to execute on a processor of a computer system such as a control processing system. When implemented in software, the elements of the examples of the present disclosure are essentially the code segments to perform the necessary tasks. The program or code segments can be stored in a processor readable storage medium (e.g., a non-transitory storage medium) or device that may have been downloaded by way of a computer data signal embodied in a carrier wave over a transmission medium or a communication link. The processor readable storage device may include any medium that can store information including an optical medium, semiconductor medium, and magnetic medium. Processor readable storage device examples include an electronic circuit, a semiconductor device, a semiconductor memory device, a read only memory (ROM), a flash memory, an erasable programmable read only memory (EPROM); a floppy diskette, a CD-ROM, an optical disk, a hard disk, or other storage device. The code segments may be downloaded via computer networks such as the Internet, Intranet, etc. Any of a wide variety of centralized or distributed data processing architectures may be employed. Programmed instructions may be implemented as a number of separate programs or subroutines, or they may be integrated into a number of other aspects of the systems described herein. In some examples, the control system may support wireless communication protocols such as Bluetooth, Infrared Data Association (IrDA), HomeRF, IEEE 802.11, Digital Enhanced Cordless Telecommunications (DECT), ultra-wideband (UWB), ZigBee, and Wireless Telemetry.
A computer is a machine that follows programmed instructions to perform mathematical or logical functions on input information to produce processed output information. A computer includes a logic unit that performs the mathematical or logical functions, and memory that stores the programmed instructions, the input information, and the output information. The term “computer” and similar terms, such as “processor” or “controller” or “control system”, are analogous.
Note that the processes and displays presented may not inherently be related to any particular computer or other apparatus, and various systems may be used with programs in accordance with the teachings herein. The required structure for a variety of the systems discussed above will appear as elements in the claims. In addition, the examples of the present disclosure are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
While certain example examples of the present disclosure have been described and shown in the accompanying drawings, it is to be understood that such examples are merely illustrative of and not restrictive to the broad disclosed concepts, and that the examples of the present disclosure not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art.
This application claims priority to and the benefit of U.S. Provisional Application No. 63/488,257, filed Mar. 3, 2023, and entitled “Systems and Methods for Keypoint Detection and Tracking-By-Prediction for Multiple Surgical Instruments,” which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63488257 | Mar 2023 | US |