Various of the disclosed embodiments relate to systems and methods for deriving sensor-based data from surgical video.
Recent advances in data acquisition within surgical theaters, such as the introduction of surgical robotics, have provided a plenitude of opportunities for improving surgical outcomes. Sensors may be used to monitor tool usage and patient status, robotic assistants may track operator movement with greater precision, cloud-based storage may allow for the retention of vast quantities of surgical data, etc. Once acquired, such data may be used for a variety of purposes to improve outcomes, such as to train machine learning classifiers to recognize various patterns and to provide feedback to surgeons and their teams.
Unfortunately, further improvements and innovation may be limited by a variety of factors affecting data acquisition from the surgical theater. For example, legal and institutional restrictions may limit data availability, as when hospitals or service providers are reluctant to release comprehensive datasets which may inadvertently disclose sensitive information. Similarly, data acquisition may be impeded by technical limitations, as when different institutions implement disparate levels of technical adoption, consequently generating surgical datasets with differing levels and types of detail. Often, if any surgical data is collected, such data is only in the form of endoscopic video.
Accordingly, there exists a need for robust data analysis systems and methods to facilitate analysis even when the available data is limited or incomplete. Even where more complete data is available, there remains a need to corroborate data of one type based upon data of another type.
Various of the embodiments introduced herein may be better understood by referring to the following Detailed Description in conjunction with the accompanying drawings, in which like reference numerals indicate identical or functionally similar elements:
The specific examples depicted in the drawings have been selected to facilitate understanding. Consequently, the disclosed embodiments should not be restricted to the specific details in the drawings or the corresponding disclosure. For example, the drawings may not be drawn to scale, the dimensions of some elements in the figures may have been adjusted to facilitate understanding, and the operations of the embodiments associated with the flow diagrams may encompass additional, alternative, or fewer operations than those depicted here. Thus, some components and/or operations may be separated into different blocks or combined into a single block in a manner other than as depicted. The embodiments are intended to cover all modifications, equivalents, and alternatives falling within the scope of the disclosed examples, rather than limit the embodiments to the particular examples described or depicted.
The visualization tool 110b provides the surgeon 105a with an interior view of the patient 120, e.g., by displaying visualization output from a camera mechanically and electrically coupled with the visualization tool 110b. The surgeon may view the visualization output, e.g., through an eyepiece coupled with visualization tool 110b or upon a display 125 configured to receive the visualization output. For example, where the visualization tool 110b is an endoscope, the visualization output may be a color or grayscale image. Display 125 may allow assisting member 105b to monitor surgeon 105a's progress during the surgery. The visualization output from visualization tool 110b may be recorded and stored for future review, e.g., using hardware or software on the visualization tool 110b itself, capturing the visualization output in parallel as it is provided to display 125, or capturing the output from display 125 once it appears on-screen, etc. While two-dimensional video capture with visualization tool 110b may be discussed extensively herein, as when visualization tool 110b is an endoscope, one will appreciate that, in some embodiments, visualization tool 110b may capture depth data instead of, or in addition to, two-dimensional image data (e.g., with a laser rangefinder, stereoscopy, etc.). Accordingly, one will appreciate that it may be possible to apply the two-dimensional operations discussed herein, mutatis mutandis, to such three-dimensional depth data when such data is available. For example, machine learning model inputs may be expanded or modified to accept features derived from such depth data.
A single surgery may include the performance of several groups of actions, each group of actions forming a discrete unit referred to herein as a task. For example, locating a tumor may constitute a first task, excising the tumor a second task, and closing the surgery site a third task. Each task may include multiple actions, e.g., a tumor excision task may require several cutting actions and several cauterization actions. While some surgeries require that tasks assume a specific order (e.g., excision occurs before closure), the order and presence of some tasks in some surgeries may be allowed to vary (e.g., the elimination of a precautionary task or a reordering of excision tasks where the order has no effect). Transitioning between tasks may require the surgeon 105a to remove tools from the patient, replace tools with different tools, or introduce new tools. Some tasks may require that the visualization tool 110b be removed and repositioned relative to its position in a previous task. While some assisting members 105b may assist with surgery-related tasks, such as administering anesthesia 115 to the patient 120, assisting members 105b may also assist with these task transitions, e.g., anticipating the need for a new tool 110c.
Advances in technology have enabled procedures such as that depicted in
Similar to the task transitions of non-robotic surgical theater 100a, the surgical operation of theater 100b may require that tools 140a-d, including the visualization tool 140d, be removed or replaced for various tasks as well as new tools, e.g., new tool 165, introduced. As before, one or more assisting members 105d may now anticipate such changes, working with operator 105c to make any necessary adjustments as the surgery progresses.
Also similar to the non-robotic surgical theater 100a, the output form the visualization tool 140d may here be recorded, e.g., at patient side cart 130, surgeon console 155, from display 150, etc. While some tools 110a, 110b, 110c in non-robotic surgical theater 100a may record additional data, such as temperature, motion, conductivity, energy levels, etc. the presence of surgeon console 155 and patient side cart 130 in theater 100b may facilitate the recordation of considerably more data than is only output from the visualization tool 140d. For example, operator 105c's manipulation of hand-held input mechanism 160b, activation of pedals 160c, eye movement within display 160a, etc. may all be recorded. Similarly, patient side cart 130 may record tool activations (e.g., the application of radiative energy, closing of scissors, etc.), movement of end effectors, etc. throughout the surgery.
This section provides a foundational description of machine learning model architectures and methods as may be relevant to various of the disclosed embodiments. Machine learning comprises a vast, heterogeneous landscape and has experienced many sudden and overlapping developments. Given this complexity, practitioners have not always used terms consistently or with rigorous clarity. Accordingly, this section seeks to provide a common ground to better ensure the reader's comprehension of the disclosed embodiments' substance. One will appreciate that exhaustively addressing all known machine learning models, as well as all known possible variants of the architectures, tasks, methods, and methodologies thereof herein is not feasible. Instead, one will appreciate that the examples discussed herein are merely representative and that various of the disclosed embodiments may employ many other architectures and methods than those which are explicitly discussed.
To orient the reader relative to the existing literature,
The conventional groupings of
Supervised learning models receive input datasets accompanied with output metadata (referred to as “labeled data”) and modify the model architecture's parameters (such as the biases and weights of a neural network, or the support vectors of an SVM) based upon this input data and metadata so as to better map subsequently received inputs to the desired output. For example, an SVM supervised classifier may operate as shown in
Semi-supervised learning methodologies inform their model's architecture's parameter adjustment based upon both labeled and unlabeled data. For example, a supervised neural network classifier may operate as shown in
Finally, the conventional groupings of
As mentioned, while many practitioners will recognize the conventional taxonomy of
In particular,
For clarity, one will appreciate that many architectures comprise both parameters and hyperparameters. An architecture's parameters refer to configuration values of the architecture, which may be adjusted based directly upon the receipt of input data (such as the adjustment of weights and biases of a neural network during training). Different architectures may have different choices of parameters and relations therebetween, but changes in the parameter's value, e.g., during training, would not be considered a change in architecture. In contrast, an architecture's hyperparameters refer to configuration values of the architecture which are not adjusted based directly upon the receipt of input data (e.g., the K number of neighbors in a KNN implementation, the learning rate in a neural network training implementation, the kernel type of an SVM, etc.). Accordingly, changing a hyperparameter would typically change an architecture. One will appreciate that some method operations, e.g., validation, discussed below, may adjust hyperparameters, and consequently the architecture type, during training. Consequently, some implementations may contemplate multiple architectures, though only some of them may be configured for use or used at a given moment.
In a similar manner to models and architectures, at a high level, methods 220d may be seen as species of their genus methodologies 220e (methodology I having methods I.1, I.2, etc.; methodology II having methods II.1, II.2, etc.). Methodologies 220e refer to algorithms amenable to adaptation as methods for performing tasks using one or more specific machine learning architectures, such as training the architecture, testing the architecture, validating the architecture, performing inference with the architecture, using multiple architectures in a Generative Adversarial Network (GAN), etc. For example, gradient descent is a methodology describing methods for training a neural network, ensemble learning is a methodology describing methods for training groups of architectures, etc. While methodologies may specify general algorithmic operations, e.g., that gradient descent take iterative steps along a cost or error surface, that ensemble learning consider the intermediate results of its architectures, etc., methods specify how a specific architecture should perform the methodology's algorithm, e.g., that the gradient descent employ iterative backpropagation on a neural network and stochastic optimization via Adam with specific hyperparameters, that the ensemble system comprise a collection of random forests applying AdaBoost with specific configuration values, that training data be organized into a specific number of folds, etc. One will appreciate that architectures and methods may themselves have sub-architecture and sub-methods, as when one augments an existing architecture or method with additional or modified functionality (e.g., a GAN architecture and GAN training method may be seen as comprising deep learning architectures and deep learning training methods). One will also appreciate that not all possible methodologies will apply to all possible models (e.g., suggesting that one perform gradient descent upon a PCA architecture, without further explanation, would seem nonsensical). One will appreciate that methods may include some actions by a practitioner or may be entirely automated.
As evidenced by the above examples, as one moves from models to architectures and from methodologies to methods, aspects of the architecture may appear in the method and aspects of the method in the architecture as some methods may only apply to certain architectures and certain architectures may only be amenable to certain methods. Appreciating this interplay, an implementation 220c is a combination of one or more architectures with one or more methods to form a machine learning system configured to perform one or more specified tasks, such as training, inference, generating new data with a GAN, etc. For clarity, an implementation's architecture need not be actively performing its method, but may simply be configured to perform a method (e.g., as when accompanying training control software is configured to pass an input through the architecture). Applying the method will result in performance of the task, such as training or inference. Thus, a hypothetical Implementation A (indicated by “Imp. A”) depicted in
The close relationship between architectures and methods within implementations precipitates much of the ambiguity in
For clarity, one will appreciate that the above explanation with respect to
In the above example SVM implementation, the practitioner determined the feature format as part of the architecture and method of the implementation. For some tasks, architectures and methods which process inputs to determine new or different feature forms themselves may be desirable. Some random forests implementations may, in effect, adjust the feature space representation in this manner. For example,
Tree depth in a random forest, as well as different trees, may facilitate the random forest model's consideration of feature relations beyond direct comparisons of those in the initial input. For example, if the original features were pixel values, the trees may recognize relationships between groups of pixel values relevant to the task, such as relations between “nose” and “ear” pixels for cat/dog classification. Binary decision tree relations, however, may impose limits upon the ability to discern these “higher order” features.
Neural networks, as in the example architecture of
where wi is the weight parameter on the output of ith node in the input layer, ni is the output value from the activation function of the ith node in the input layer, b is a bias value associated with node 315c, and A is the activation function associated with node 315c. Note that in this example the sum is over each of the three input layer node outputs and weight pairs and only a single bias value b is added. The activation function A may determine the node's output based upon the values of the weights, biases, and previous layer's nodes' values. During training, each of the weight and bias parameters may be adjusted depending upon the training method used. For example, many neural networks employ a methodology known as backward propagation, wherein, in some method forms, the weight and bias parameters are randomly initialized, a training input vector is passed through the network, and the difference between the network's output values and the desirable output values for that vector's metadata determined. The difference can then be used as the metric by which the network's parameters are adjusted, “propagating” the error as a correction throughout the network so that the network is more likely to produce the proper output for the input vector in a future encounter. While three nodes are shown in the input layer of the implementation of
One will recognize that many of the example machine learning implementations so far discussed in this overview are “discriminative” machine learning models and methodologies (SVMs, logistic regression classifiers, neural networks with nodes as in
P(output|input) (2)
That is, these models and methodologies seek structures distinguishing classes (e.g., the SVM hyperplane) and estimate parameters associated with that structure (e.g., the support vectors determining the separating hyperplane) based upon the training data. One will appreciate, however, that not all models and methodologies discussed herein may assume this discriminative form, but may instead be one of multiple “generative” machine learning models and corresponding methodologies (e.g., a Naïve Bayes Classifier, a Hidden Markov Model, a Bayesian Network, etc.). These generative models instead assume a form which seeks to find the following probabilities of Equation 3:
P(output),P(input|output) (3)
That is, these models and methodologies seek structures (e.g., a Bayesian Neural Network, with its initial parameters and prior) reflecting characteristic relations between inputs and outputs, estimate these parameters from the training data and then use Bayes rule to calculate the value of Equation 2. One will appreciate that performing these calculations directly is not always feasible, and so methods of numerical approximation may be employed in some of these generative models and methodologies.
One will appreciate that such generative approaches may be used mutatis mutandis herein to achieve results presented with discriminative implementations and vice versa. For example,
Returning to a general discussion of machine learning approaches, while
Many different feature extraction layers are possible, e.g., convolutional layers, max-pooling layers, dropout layers, cropping layers, etc. and many of these layers are themselves susceptible to variation, e.g., two-dimensional convolutional layers, three-dimensional convolutional layers, convolutional layers with different activation functions, etc. as well as different methods and methodologies for the network's training, inference, etc. As illustrated, these layers may produce multiple intermediate values 320b-j of differing dimensions and these intermediate values may be processed along multiple pathways. For example, the original grayscale image 320a may be represented as a feature input tensor of dimensions 128×128×1 (e.g., a grayscale image of 128 pixel width and 128 pixel height) or as a feature input tensor of dimensions 128×128×3 (e.g., an RGB image of 128 pixel width and 128 pixel height). Multiple convolutions with different kernel functions at a first layer may precipitate multiple intermediate values 320b from this input. These intermediate values 320b may themselves be considered by two different layers to form two new intermediate values 320c and 320d along separate paths (though two paths are shown in this example, one will appreciate that many more paths, or a single path, are possible in different architectures). Additionally, data may be provided in multiple “channels” as when an image has red, green, and blue values for each pixel as, for example, with the “x3” dimension in the 128×128×3 feature tensor (for clarity, this input has three “tensor” dimensions, but 49,152 individual “feature” dimensions). Various architectures may operate on the channels individually or collectively in various layers. The ellipses in the figure indicate the presence of additional layers (e.g., some networks have hundreds of layers). As shown, the intermediate values may change in size and dimensions, e.g., following pooling, as in values 320e. In some networks, intermediate values may be considered at layers between paths as shown between intermediate values 320e, 320f, 320g, 320h. Eventually, a final set of feature values appear at intermediate collection 320i and 320j and are fed to a collection of one or more classification layers 320k and 320l, e.g., via flattened layers, a SoftMax layer, fully connected layers, etc. to produce output values 320m at output nodes of layer 3201. For example, if N classes are to be recognized, there may be N output nodes to reflect the probability of each class being the correct class (e.g., here the network is identifying one of three classes and indicates the class “cat” as being the most likely for the given input), though some architectures many have fewer or have many more outputs. Similarly, some architectures may accept additional inputs (e.g., some flood fill architectures utilize an evolving mask structure, which may be both received as an input in addition to the input feature data and produced in modified form as an output in addition to the classification output values; similarly, some recurrent neural networks may store values from one iteration to be inputted into a subsequent iteration alongside the other inputs), may include feedback loops, etc.
TensorFlow™, Caffe™, and Torch™, are examples of common software library frameworks for implementing deep neural networks, though many architectures may be created “from scratch” simply representing layers as operations upon matrices or tensors of values and data as values within such matrices or tensors. Examples of deep learning network architectures include VGG-19, ResNet, Inception, DenseNet, etc.
While example paradigmatic machine learning architectures have been discussed with respect to
In the example of
Just as one will appreciate that ensemble model architectures may facilitate greater flexibility over the paradigmatic architectures of
For example, at block 330c a new incoming feature vector (a new facial image) may be converted to the unsupervised form (e.g., the principal component feature space) and then a metric (e.g., the distance between each individual's facial image group principal components and the new vector's principal component representation) or other subsequent classifier (e.g., an SVM, etc.) applied at block 330d to classify the new input. Thus, a model architecture (e.g., PCA) not amenable to the methods of certain methodologies (e.g., metric based training and inference) may be made so amenable via method or architecture modifications, such as pipelining. Again, one will appreciate that this pipeline is but one example—the KNN unsupervised architecture and method of
Some architectures may be used with training methods and some of these trained architectures may then be used with inference methods. However, one will appreciate that not all inference methods perform classification and not all trained models may be used for inference. Similarly, one will appreciate that not all inference methods require that a training method be previously applied to the architecture to process a new input for a given task (e.g., as when KNN produces classes from direct consideration of the input data). With regard to training methods,
At block 405b, the training method may adjust the architecture's parameters based upon the training data. For example, the weights and biases of a neural network may be updated via backpropagation, an SVM may select support vectors based on hyperplane calculations, etc. One will appreciate, as was discussed with respect to pipeline architectures in
When “training,” some methods and some architectures may consider the input training feature data in whole, in a single pass, or iteratively. For example, decomposition via PCA may be implemented as a non-iterative matrix operation in some implementations. An SVM, depending upon its implementation, may be trained by a single iteration through the inputs. Finally, some neural network implementations may be trained by multiple iterations over the input vectors during gradient descent.
As regards iterative training methods,
As mentioned, the wide variety of machine learning architectures and methods include those with explicit training and inference steps, as shown in
The operations of
Many architectures and methods may be modified to integrate with other architectures and methods. For example, some architectures successfully trained for one task may be more effectively trained for a similar task rather than beginning with, e.g., randomly initialized parameters. Methods and architecture employing parameters from a first architecture in a second architecture (in some instances, the architectures may be the same) are referred to as “transfer learning” methods and architectures. Given a pre-trained architecture 440a (e.g., a deep learning architecture trained to recognize birds in images), transfer learning methods may perform additional training with data from a new task domain (e.g., providing labeled data of images of cars to recognize cars in images) so that inference 440e may be performed in this new task domain. The transfer learning training method may or may not distinguish training 440b, validation 440c, and test 440d sub-methods and data subsets as described above, as well as the iterative operations 440f and 440g. One will appreciate that the pre-trained model 440a may be received as an entire trained architecture, or, e.g., as a list of the trained parameter values to be applied to a parallel instance of the same or similar architecture. In some transfer learning applications, some parameters of the pre-trained architecture may be “frozen” to prevent their adjustment during training, while other parameters are allowed to vary during training with data from the new domain. This approach may retain the general benefits of the architecture's original training, while tailoring the architecture to the new domain.
Combinations of architectures and methods may also be extended in time. For example, “online learning” methods anticipate application of an initial training method 445a to an architecture, the subsequent application of an inference method with that trained architecture 445b, as well as periodic updates 445c by applying another training method 445d, possibly the same method as method 445a, but typically to new training data inputs. Online learning methods may be useful, e.g., where a robot is deployed to a remote environment following the initial training method 445a where it may encounter additional data that may improve application of the inference method at 445b. For example, where several robots are deployed in this manner, as one robot encounters “true positive” recognition (e.g., new core samples with classifications validated by a geologist; new patient characteristics during a surgery validated by the operating surgeon), the robot may transmit that data and result as new training data inputs to its peer robots for use with the method 445d. A neural network may perform a backpropagation adjustment using the true positive data at training method 445d. Similarly, an SVM may consider whether the new data affects its support vector selection, precipitating adjustment of its hyperplane, at training method 445d. While online learning is frequently part of reinforcement learning, online learning may also appear in other methods, such as classification, regression, clustering, etc. Initial training methods may or may not include training 445e, validation 445f, and testing 445g sub-methods, and iterative adjustments 445k, 445l at training method 445a. Similarly, online training may or may not include training 445h, validation 445i, and testing sub-methods, 445j and iterative adjustments 445m and 445n, and if included, may be different from the sub-methods 445e, 445f, 445g and iterative adjustments 445k, 445l. Indeed, the subsets and ratios of the training data allocated for validation and testing may be different at each training method 445a and 445d.
As discussed above, many machine learning architectures and methods need not be used exclusively for any one task, such as training, clustering, inference, etc.
As mentioned, each surgical operation may include groups of actions, each group forming a discrete unit referred to herein as a task. For example, surgical operation 510b may include tasks 515a, 515b, 515c, and 515e (ellipses 515d indicating that there may be more intervening tasks). Note that some tasks may be repeated in an operation or their order may change. For example, task 515a may involve locating a segment of fascia, task 515b involves dissecting a first portion of the fascia, task 515c involves dissecting a second portion of the fascia, and task 515e involves cleaning and cauterizing regions of the fascia prior to closure.
Each of the tasks 515 may be associated with a corresponding set of frames 520a, 520b, 520c, and 520d and device datasets including operator kinematics data 525a, 525b, 525c, 525d, patient-side device data 530a, 530b, 530c, 530d, and system events data 535a, 535b, 535c, 535d. For example, for video acquired from visualization tool 140d in theater 100b, operator-side kinematics data 525 may include translation and rotation values for one or more hand-held input mechanisms 160b at surgeon console 155. Similarly, patient-side kinematics data 530 may include data from patient side cart 130, from sensors located on one or more tools 140a-d, 110a, rotation and translation data from arms 135a, 135b, 135c, and 135d, etc. System events data 535 may include data for parameters taking on discrete values, such as activation of one or more of pedals 160c, activation of a tool, activation of a system alarm, energy applications, button presses, camera movement, etc. In some situations, task data may include one or more of frame sets 520, operator-side kinematics 525, patient-side kinematics 530, and system events 535, rather than all four.
One will appreciate that while, for clarity and to facilitate comprehension, kinematics data is shown herein as a waveform and system data as successive state vectors, one will appreciate that some kinematics data may assume discrete values over time (e.g., an encoder measuring a continuous component position may be sampled at fixed intervals) and, conversely, some system values may assume continuous values over time (e.g., values may be interpolated, as when a parametric function may be fitted to individually sampled values of a temperature sensor).
In addition, while surgeries 510a, 510b, 510c and tasks 515a, 515b, 515c are shown here as being immediately adjacent so as to facilitate understanding, one will appreciate that there may be gaps between surgeries and tasks in real-world surgical video. Accordingly, some video and data may be unaffiliated with a task. In some embodiments, these non-task regions may themselves be denoted as tasks, e.g., “gap” tasks, wherein no “genuine” task occurs.
The discrete set of frames associated with a task may be determined by the tasks' start point and end point. Each start point and each endpoint may be itself determined by either a tool action or a tool-effected change of state in the body. Thus, data acquired between these two events may be associated with the task. For example, start and end point actions for task 515b may occur at timestamps associated with locations 550a and 550b respectively.
Additional examples of tasks include a “2-Hand Suture”, which involves completing 4 horizontal interrupted sutures using a two-handed technique (i.e., the start time is when the suturing needle first pierces tissue and the stop time is when the suturing needle exits tissue with only two-hand, e.g., no one-hand suturing actions, occurring in-between). A “Uterine Horn” task includes dissecting a broad ligament from the left and right uterine horns, as well as amputation of the uterine body (one will appreciate that some tasks have more than one condition or event determining their start or end time, as here, when the task starts when the dissection tool contacts either the uterine horns or uterine body and ends when both the uterine horns and body are disconnected from the patient). A “1-Hand Suture” task includes completing four vertical interrupted sutures using a one-handed technique (i.e., the start time is when the suturing needle first pierces tissue and the stop time is when the suturing needle exits tissue with only one-hand, e.g., no two-hand suturing actions occurring in-between). The task “Suspensory Ligaments” includes dissecting lateral leaflets of each suspensory ligament so as to expose ureter (i.e., the start time is when dissection of the first leaflet begins and the stop time is when dissection of the last leaflet completes). The task “Running Suture” includes executing a running suture with four bites (i.e., the start time is when the suturing needle first pierces tissue and the stop time is when the needle exits tissue after completing all four bites). As a final example, the task “Rectal Artery/Vein” includes dissecting and ligating a superior rectal artery and vein (i.e. the start time is when dissection begins upon either the artery or the vein and the stop time is when the surgeon ceases contact with the ligature following ligation).
One may wish to process raw data 510, e.g., to provide real-time feedback to an operator during surgery, to monitor multiple active surgeries from a central system, to process previous surgeries to assess operator performance, to generate data suitable for training a machine learning system to recognize patterns in surgeon behavior, etc. Unfortunately, there may be many situations where only video frames 520 are available for processing, but not accompanying kinematics data 525, 530 or system events data 535 (while per-surgery and per-task sets of data were discussed with respect to
Thus, as shown in the schematic block diagram of
Example derived data to be inferred from the video may include, e.g., visualization tool movement (as a system event or corresponding to a kinematic motion), energy application (possibly including a type or amount of energy applied and the instrument used), names of in-use tools, arm swap events, master clutch events at the surgeon console, surgeon hand movement, etc. Visualization tool movements may refer to periods during surgery wherein the visualization tool is moved within the patient. Camera focus adjustment and calibration may also be captured as events in some embodiments. Energy application may refer to the activation of end effector functionality for energy application. For example, some forceps or cauterization tools may include electrodes designed to deliver an electrical charge. Recognizing frames wherein specific sets of tools are in use may be helpful in later inferring at what task a surgery is involved. “Arm swap” events refer to when the operator swaps handheld input control 160b between different robotic arms (e.g., assigning a left hand control from a first robotic arm to a second robotic arm, as the operator can only control two such arms, one with each of the operator's hands, at a time). In contrast, “instrument exchange” events, where the instrument upon an arm is introduced, removed, or replaced, may be inferred from instrument name changes (reflected in the UI, on the tool itself in the frame, etc.) associated with the same robotic arm. Though the “arm” may be a robotic arm as in theater 100b, such tool swapping events can also be inferred in theater 100a in some embodiments. “Master clutch events” may refer to the operator's usage of pedals 160c (or on some systems to operation of a clutch button on hand manipulators 160b), e.g., where such pedals are configured to move the visualization tool, reassign the effect of operating hand-held input mechanism 160b, etc. Hand movement events may include operating hand-held input mechanism 160b or when the surgeon 105a of theater 100a moves a tool 110a.
In the first pipeline 615a, the system may attempt to derive data from a UI visible in the frames, based, e.g., upon icons and text appearing in the UI, at block 625 if such UI is determined to be visible at block 620. In some embodiments, consideration of the UI may suffice to derive visualization tool movement data (e.g., where the system seeks only to discern that the endoscope was moved, without considering a direction of movement, the appearance of a camera movement icon in the UI may suffice for data derivation). However, where the UI is not visible, or where the system wishes to estimate a direction or a velocity of camera movement not discernible from the UI, the system may employ block 630 (e.g., using optical flow methods described herein) to derive visualization tool movement data (kinematics or system data).
In the second tool detection and tracking pipeline 615b, the system may detect and recognize tools in a frame at block 640 and then track the detected tools across frames at block 645 to produce derived data 650 (e.g., kinematics data, tool entrance/removal system events data, etc.). Tools tracked may include, e.g., needle drivers, monopolar curved scissors, bipolar dissectors, bipolar forceps (Maryland or fenestrated), force bipolar end effectors, ProGrasp™ forceps, Cadiere forceps, small grasping retractors, tip-up fenestrated graspers, vessel sealers, Harmonic Ace™, clip appliers, staplers (such as a SureForm™ 60, SureForm™ 45, or EndoWrist™ 45), permanent cautery hook/spatulas, etc.
While
Once derived data 635 and 650 have been generated, the processing system may consolidate these results into consolidated derived data 660. For example, the system may reconcile redundant or overlapping derived data between pipelines 615a and 615b as discussed herein with respect to
To facilitate understanding, this section discusses the application of various features of some embodiments to specific GUIs shown in
Similarly, introduction of the Cadiere forceps on the second arm may have precipitated the presentation of overlay 710d and the monopolar curved scissors on the fourth arm may precipitate presentation of overlay 710c. The visualization tool itself may be affixed to the third arm and be represented by overlay 710b. Thus one will appreciate that overlays may serve as proxy indications of tool attachment or presence. Recognizing an overlay via, e.g., a template method, or text recognition method, as described herein may thus allow the data derivation system to infer the attachment or presence of a specific tool to an arm (e.g., text recognition identifying the arm numeral within the overlay and the tool identity in the text of the overlay, such as recognizing “1” and “Large Needle Driver” text in the lower left region of the frame indicates that the needle driver is affixed to the first robotic arm). Activation of tools may be indicated by opacity changes, color changes, etc. in the overlays 710a, 710b, 710c, 710d (e.g., if a tool is controlled by the surgeon the icon is light blue, and if it is not controlled by the surgeon, the icon may be gray; thus when the visualization tool moves, camera icon 710b may, e.g., turn light blue).
In some embodiments, recognition of the same overlays as are presented to the surgeon may not be necessary, as the UI designer, anticipating such video-only based data derivation, may have inserted special icons (e.g., bar codes, Quick Response codes, conventional symbols, text, etc.) conveying the information during or after the surgery for ready recognition by data derivation processing system 605b. As older video, or video from different providers, is not likely to always include such fortuitous special icons with the desired data readily available, however, it is often important that data derivation processing system 605b not be dependent upon such pre-processing, but be able to infer data values based upon the original UI, absent such special icons. In some embodiments, data derivation processing system 605b may initially check to see if the frames include such pre-processed data conveying icons and, only in their absence, fall back upon data derivation from the original “raw” UI, using the methods discussed herein (or use data derivation from the raw UI to complement data derived from such symbols).
Returning to
Similar to the unique overlay features for the camera, the monopolar curved scissors may have unique functionality, such as the ability to apply electrical charge. Consequently, corresponding overlay 710c may include indication 730a whether cutting energy electrode or an indication 730b that a coagulating energy electrode is active. Detecting either of these icons in an “active” state may result in corresponding event data.
As a surgeon may only be able to control some of the tools at a time, tools not presently subject to the user's control may be indicated as such using the corresponding overlay. For example, the overlay 710d is show in a lower opacity than overlays 710a, 710b, and 710c, represented here with dashed outlines. Where a tool is selected, but has been without input following its attachment, overlay 715 may appear over the corresponding tool, inviting the operator to match the tool with the input by moving hand-held input mechanism 160b. Icon 720 may appear in some embodiments to help associate a robot arm with a tool in the operator's field of view (and may indicate a letter to indicate whether it is associated with the operator's right or left hand controls). One will recognize that such icons and overlays may inform data derivation processing system 605b whether a tool is present, is selected by the operator, is in motion, is employing any of its unique functionality, etc. Thus, the system may make indirect inferences regarding derived data from the presented displays. For example, if the overlay 715 is visible, the system may infer that the tool below it has not moved in any preceding frames since the tool's time of attachment (consequently, contrary indications from pipeline 615b may be suppressed or qualified). Similarly, when a tool is indicated as not selected, as in overlay 710d, the system may infer that the tool is not moving during the period it is not selected. Where the overlays 710a, 710b, and 710c appear in a finite set of locations, template matching as discussed herein may suffice to detect their presence. Thus, in the same way that UI 700 communicates a plethora of information to the operator during the surgery, where the UI 700 is available in the video data the processing system may similarly infer the various states of tools and the robotic system.
Activation of tool functionality associated with the operator's left and right hands may be indicated by changing the color of a first activation region and a second activation region, respectively. Specifically, the second activation region is shown here with the darkened region 830 corresponding to its being colored a specific color during activation. Naturally, once the data derivation system recognizes this UI, looking at pixel values in this region may facilitate the data derivation system's recognition of a system event (or its absence), such as energy activation. Active arms controlled by each of the operator's left and right hands, respectively, may be shown by the numerals in the positions of icons 815a and 815b (e.g., if the operator's left hand takes control of arm 3, icons 815b and 815c may exchange places). An intervening icon 845 may bisect the first activation region into a first portion 825a and a second portion 825b. Intervening icon 845 may indicate that the Prograsp™ forceps 805a are attached to the arm. Swapping icons 820a and 840 may indicate that left-hand control can be switched from the second arm (indicated by icon 815b) to the third arm (indicated by icon 815c). Icon 815a presently indicates that the monopolar curved scissors 805b reside on the first arm. One will appreciate that an intervening icon may appear in the right side corresponding to intervening icon 845 where it is instead the right hand of the operator able to be reassigned.
Pedal region 835 may indicate which pedals 160c are activated and to what function they are assigned. Here, for example, the top right pedal is assigned to the “mono cut” function of the monopolar curved scissors, and is shown as activated in accordance with its being a different color from the other pedals. Energy activation may be depicted in this region by color coding, e.g., blue indicates that the operator's foot is on top of the energy pedal before pressing, while yellow indicates that the energy pedal is being pressed. Again, one will appreciate that recognizing text and pixel values in these regions in a frame may readily allow the processing system to infer derived data for system events. Text, both within the various overlays and, in some embodiments, appearing in the field of view (e.g., upon tools as in the case of identifiers 735, 860), facilitates inferences regarding, e.g., event occurrence and tool presence/location.
Camera icon 855a may indicate that the field of view is being recorded and/or may indicate that the endoscope is in motion. In some systems, an indication 855c may indicate that the full field of view is captured.
As before, an overlay 850 may appear when an instrument is not yet matched, in this case, Prograsp™ forceps 805a. As depicted here, overlay 850 may occlude various of the tools in the field of view (here Prograsp™ forceps 805a). Such occlusions may be anticipated during tracking as discussed in greater detail herein (e.g., as discussed in
Supplemental icon region 865a, though not displaying any icons in this example may take on a number of different values. For example, as shown in example supplemental output 865b, a left hand, right hand, or, as shown here, both hands, may be displayed to show activation of the clutch. As another example, example supplemental output 865c, shows a camera movement notification (one will appreciate that output 865b and 865c will appear in the region 865a when shown, and are depicted here in
Invitations to move and associate tools with hand controls may be shown via icons 950a and 950b as previously described. Lack of internet connectivity may be shown by icon 970a (again, detecting this icon my itself be used to identify a system event). Additional icons, such as icon 915a, not present in the previous GUIs may occlude significant portions of the field of view, e.g., portions of tools 905a and 905c as shown here. As discussed, when such occlusion adversely affects data derivations in one set of video frame data, the system may rely upon reconciliation from data derived from another complementary video frame set (e.g., data derived from the GUI of
As mentioned, in some embodiments, GUI information from both the display 150 of electronics/control console 145 and the display 160a of surgeon console 155 may be considered together by processing system 605b. For example, the information displayed at each location may be complementary, indicating system or kinematic event occurrence at one of the locations but not the other. Accordingly, derived data from both of the interfaces depicted in both
For example, one will appreciate that camera icon 975a and text indication 980a in
In addition, one will appreciate that while many of the icons discussed with respect to
Detection or non-detection of a specific type of UI in the frames may facilitate different modes of operation in some embodiments. Different brands of robotic systems and different brands of surgical tools and recording systems may each introduce variants in their UI or icon and symbol presentation. Accordingly, at a high level, various embodiments implement a process 1020 as shown in
As an example implementation of the process 1020,
For example, as discussed with respect to
Thus, the system may determine whether the frames are associated with an Xi™ system at block 1005b or an Si™ system at block 1005c. Though only these two considerations are shown in this example for clarity, one will appreciate that different and more or less UI types may be considered, mutatis mutandis (e.g., the system may also seek to determine upon which robotic arm the visualization tool was attached based upon the UI configuration). For Xi™ detected frames, sampling may be performed at block 1005d, e.g., down sampling from a framerate specific to that device to a common frame rate used for data derived recognition. Regions of the frames unrelated to the Xi™ UI (the internal field of view of the patient) may be excised at block 1005e.
Different system types may implicate different pre-processing steps prior to UI extraction. For example, as discussed above, video data may be acquired at the Si™ system from either the surgeon console or from the patient side cart display, each presenting a different UI. Thus, where the Si frame type was detected at block 1005c, after sampling at block 1005i (e.g., at a rate specific to the Si™ system), at block 1005j, the system may seek to distinguish between the surgeon and patient side UI, e.g., using the same method of template matching (e.g., recognizing some icons or overlays which are only present in one of the U Is). Once the type is determined, then the appropriate corresponding regions of the GUI may be cropped at blocks 1005k and 1005l respectively.
At block 1005f, the system may seek to confirm that the expected UI appears in the cropped region. For example, even though the data may be detected as being associated with an Xi™ device at block 1005f, the UI may have been disabled by an operator or removed in a previous post-processing operation. Indeed, throughout the course of a surgery, the UI may be visible in some frames, but not others.
If the type cannot be recognized during type identification 1005n or if the UI is not present at block 1005g, then the system may initiate UI-absent processing at block 1005m, as described elsewhere herein. For example, rather than rely upon icon identification to detect camera or tool movement, the system may rely upon optical flow measurements (again, the two need not be mutually exclusive in some embodiments). Conversely, where the UI is present and identified, data derivation processing based upon the identified UI may then be performed at block 1005h.
At block 1010g the system may check for an arm swap event in the frame. An arm swap and instrument exchanges may be explicitly noted in the UI, or may be inferred by successively identified instruments at block 1010e, e.g., associated with a same input hand control. The master clutch state may be assessed at block 1010h, though this may only occur for those system types wherein the clutch state is apparent from the UI. One will appreciate that the locations of icons associated with the clutch may vary between systems.
At block 1010i, camera movement, as evidenced by the GUI, may be detected. For example, an icon may be displayed during motion, as when supplemental output 865c appears in the supplemental icon region 865a, or based on a feature of icon 855a (corresponding changes may occur in icons 950a and 950b as they change to a camera logo; one will appreciate that images of just icons 950a and 950b may thus be used as templates during template matching).
As the frames are considered, the system may update the derived data record at block 1010j, indicating start and stop times of the data events detected within the frames under consideration and the corresponding parameters and values. As events may be represented across frames, it may be necessary to maintain a temporary, frame-by-frame record of detected icons, values, etc. The system may consolidate entries from this temporary record into a single derived data entry, e.g., at block 1010b, once all the frames have been considered.
One will appreciate that a variety of different logical operations and machine learning models may be used to accomplish the operations described above. For example,
Specifically, the model may be used, e.g., during preliminary detection at block 1005a. A two-dimensional convolutional layer 1105k may be configured to receive all or a cropped portion of an image frame 1105a (e.g., the portion know to contain UI distinguishing features, such as the region 1015b). For example, in Keras™ commands as shown in code lines 2 and 3 of
Two-dimensional convolutional layer 1105k and pooling layer 11051 may form an atomic combination 1105b. Embodiments may include one or more of this atomic unit, thereby accommodating the recognition of higher order features in the image 1105a. For example, here, four such successive combinations 1105b, 1105c, 1105d, 1105e (with corresponding lines 2-10 of
The final output may be fed to a flattening layer 1105f (
Thus, the number of outputs in the final layer may correspond to the number of classes, e.g., using a SoftMax activation to ensure that all the outputs fall within a cumulative range of 0 to 1. In this example, the classifier recognizes four GUI types (e.g., corresponding to each of the four possible arm placements of an endoscope, each placement producing a different UI arrangement) or indicates that no GUI is present (construed as a fifth GUI “type”). Specifically, the first GUI type was detected with probability 0.1, the second GUI type was detected with probability 0.45, the third GUI type was detected with probability 0.25, the fourth GUI type was detected with probability 0.05, and “no GUI” with probability 0.15. Thus, the classifier would classify the frame as being associated with GUI-Type 2. One may train such a model via a number of methods, e.g., as shown in
Consider, for example, a camera icon appearing in the region 1110d (or changing color if present) of the GUI frame 1110a during camera movement and absent otherwise. Some embodiments may perform template matching upon all or a portion of the frame using a template 1110c corresponding to the icon of interest. One will appreciate multiple ways to perform such matching. For example, some embodiments directly iterate 1110b the template 1110c across all or a portion of the frame and note if a similarity metric, e.g., the cosine similarity, exceeds a threshold. Alternatively, one will appreciate that Fourier, wavelet, and other signal processing representations may likewise be used to detect regions of the image corresponding to the template above a threshold. If no region of the frame exceeds such a similarity threshold, then the system may infer that the icon is absent in the frame. Absence of such an icon in this example may be used infer that the camera is not experiencing movement in the frame, but absence of icons may also indicate, e.g., that the UI is not of a particular type, that UI is or is not in an expected configuration, that an operation is or is not being performed, the character of such an operation, etc.
Optical flow methods may be useful at block 630 or at block 645, e.g., to assess camera movement events, including the direction and magnitude of such movement. However, correctly interpreting optical flow may involve some knowledge of the surgical environment. For example, as shown in
Various embodiments consider a number of factors to distinguish camera movement from these other moving artifacts. For example,
flow=cv2.calcOpticalFlowFarneback(frame_previous,frame_next,None,0.5,3,15,3,5,1.2,0) (C1)
Metrics for the flow may then be determined at the collection of blocks 1220d. For example, metric determinations may include converting the flow determination to a polar coordinate form at block 1220e. For example, following the command of code line listing C1, one may use the command of code line listing C2:
flow=cv2.cartToPolar(flow[ . . . ,0],[ . . . ,1]) (C2)
Specifically, at block 1220f, the processing system may determine the percentage of pixels included in the optical flow (i.e., the number of the pixels relative to all the pixels in the image, associated with optical flow vectors having a magnitude over a threshold). For these pixels above the threshold magnitude, at block 1220g the system may additionally determine the standard deviation of their corresponding vector magnitudes (i.e., magnitude 1225e).
At block 1220h the processing system may then determine whether these optical flow metrics satisfy conditions indicating camera movement, rather than alternative sources of movement such as that depicted in
large_op=np.where(mag>=mag_lb)[1] (C3)
total=mag.shape[0]*mag.shape[1] (C4)
pixel_ratio=len(large_op)/total (C5)
mag_std=np.std(mag) (C6)
Where mag_Ib refers to the lower bound in the magnitude (e.g., mag_Ib may be 0.7). One will recognize the commands “np.where”, “np.std”, etc. as standard commands from the NumPY™ library.
The condition for camera movement may then be taken as shown in the code line listing C7:
if (pixel_ratio>=pixel_ratio_lb) and (mag_std<=mag_std_ub): (C7)
where “pixel_ratio_lb” is a lower bound on the pixel ratio and mag_std_ub is an upper bound on the magnitude standard deviation (e.g., pixel_ratio_lb may be 0.8 and mag_std_ub may be 7). Where these conditions are satisfied, the frame may be marked as indicative of camera movement at block 1220j (one will appreciate that, in some embodiments, the peer frames may not themselves be so marked, and further, that in some embodiments the final frames of the video, which may lack their own peer frames, may not themselves be considered for movement). Otherwise, no action may be taken or a corresponding recordation made at block 1220i. Where movement is noted at block 1220j, some embodiments may also record the direction, magnitude, or velocity of the movement (e.g., by considering the average direction and magnitude of the optical flow vectors).
After identifying frames from which data may be derived, such as camera movement directions in accordance with the process 1220, some embodiments may perform a post-processing method to smooth and consolidate the selection of frames from which derived data will be generated. For example,
Generally, frame selection post-processing may involve two operations based upon regions 1305a, 1305b, 1305c, 1305d, 1305e, and 1305f. Specifically, a first set of operations 1320a may seek to isolate the regions of frames of interest into discrete sets. Such operations may thus produce sets 1325a, wherein the frames from each region associated with the derived data now appear in their own set, e.g., frames of region 1305a in set 1315a, frames of region 1305b in set 1315b, frames of region 1305c in set 1315c, and frames of region 1305f in set 1315e. As indicated, some of these operations may identify regions of frames very close to one another in time and merge them. For example, regions 1305d and 1305e follow so closely in time, that they and their intermediate frames (which did not originally appear in a region) are merged into set 1315d. Intuitively, regions of frames marked as unaffiliated with derived data sandwiched between reasonably sized regions of frames providing derived data were likely falsely classified by the preceding process, e.g., process 1220, as being unaffiliated. This may not be true for all types of derived data, but for some types, such as camera movement or tool movement, this may often be the case (one will appreciate that reasonable ranges for joining or dividing regions may depend upon the original framerate and any down sampling applied to the frames 1310).
In some embodiments, operations 1320b may also be performed to produce further refined sets 1325b, in this case, removing sets of frames so short in duration that they are unlikely to genuinely represent events producing derived data (again, symptomatic of a false classification in a process such as process 1220). For example, the region 1305c may correspond to so few a number of frames, that it is unlikely that a movement or energy application event would have occurred for such a short duration. Accordingly, in these embodiments the operations 1320b may remove the set 1315c corresponding to the region 1305c from the final group of sets 1320b. While the operations are depicted in a particular order in
As an example implementation of the frame post-processing depicted in
Accordingly. locating such larger differences by comparing them to a threshold at block 1330c, may facilitate dividing the array of all the frames in video 1310 into sets at block 1330d (again, one will appreciate that the original framerate, down sampling, and the nature of the derived data, may each influence the selection of the thresholds T1, T2 at blocks 1330d and 1330h). For example, at block 1330c a difference exceeding the threshold would have been identified between the last frame of the region 1305b and the first frame of the region 1305c. A difference beyond the threshold would also have been identified between the last frame of the region 1305c and the first frame of the region 1305d. Thus, at block 1330d the system may produce set 1315c from region 1305c. One will appreciate that the first of all the considered frames and the last of all the considered frames in the regions will themselves be counted as set boundaries at block 1330d. One will also note that the operation of blocks 1330c and 1330d may precipitate the joinder of regions 1305d and 1305e into set 1305d, as the space between regions 1305d and 1305e would not be larger than the threshold T1.
Once the indices have been allocated into sets following block 1330d, the system may iterate through the sets and perform the filtering operations of block 1320b to remove sets of unlikely small durations. Specifically, at blocks 1330e and 1330g, the system may iterate through the sets of indices and consider each of their durations at block 1330h (the length of the set or the difference between the timestamps of the first and last frames of the set). For those sets with lengths below a threshold T2, they may be removed at block 1330i (corresponding to such removal of the set 1315c by operations 1320b). In contrast, if the set is longer than T2, the system may generate a corresponding derived data entry at block 1330j. For example, in some embodiments, camera movement events may be represented by three components, e.g.: a start time, a stop time, and a vector corresponding to the direction of camera motion. Such components may be readily inferred form the available information. For example, the start time may be determined from the video timestamp corresponding to the index of the first frame in a set, the stop time from video timestamp corresponding to the index of the last frame in the set, and the vector may be discerned from the optical flow measurements (e.g., the vector addition of the average flow vectors across each frame of the set).
Once the derived data has been prepared and all the sets considered, then the system may provide all the derived data results at block 1330f (e.g., for consideration and consolidation with derived data from other pipelines and processes).
To produce the outputs 1405e, tool tracking system 1405b may include one or more detection components 1405c, such as a You Only Look Once (YOLO) based machine learning model, and one or more tracking components 1405d, such as a channel and spatial reliability tracking (CSRT) tracker. In some embodiments, the detection components 1405c may include a text recognition component (e.g., for recognizing text in a UI, on a tool, etc.). Again, some embodiments may have only one of detection components 1405c or tracking components 1405d (e.g., where only tool detection derived data is desired). Where both components are present, they may complement one another's detection and recognition as described herein.
The tracking component 1410e may itself have a tracking model component 1410f and, in some embodiments, may also, or instead, have an optical flow tracking component 1410g. These components may follow a tool's motion development frame-by-frame following an initial detection of the tool by detection component 1410b.
Tool tracking system 1410a may produce an output record indicating, e.g., what tools were recognized, in which frames, or equivalently at what times, and at what locations. In some embodiments, tool location may be the corresponding pixel locations in the visualization tool field of view. However, one will appreciate variations, as when frame-inferred location is remapped to a three dimensional position relative to the visualization tool, within the patient body, within the surgical theater, etc. Such re-mappings may be performed in post-processing, e.g., to facilitate consideration with data from pipeline 615a.
Here, the output has taken the form of a plurality of data entries, such as JSON entries, for each recognized tool. For example, the entry 1410h may include an identification parameter 1410j indicating that the “Bipolar forceps” tool was detected in connection with an array of entries 1410k, 14101, 1410m, each entry indicating the frame (or corresponding timestamp) and location of the detected tool (here, the boundary of the tool in the frame may be represented as a polygon within the frame, e.g., B1 being a first polygon, B2 as second polygon, etc.). Similar entries may be produced for other recognized tools, e.g., entry 1410i, wherein the ID parameter 1410n indicates the “Small grasping retractor” tool is associated with entries 1410o, 1410p, 1410q. One will appreciate that the entries 1410k, 14101, 1410m may not be temporally continuous. For example, some embodiments may recognize that the surgery includes no more than one instance of each type of tool. Thus, any recognition of a tool type may be the “same” tool and all the corresponding frames included in a single entry, e.g., 1410k, 14101, 1410m, even though there may be temporal gaps in the detected frames. However, some embodiments may recognize that two instances of the same tool may be used in the surgical operation (e.g. during suturing, two needle drivers may be used and tracked separately with two different object IDs). These may be treated as distinct tools with two distinct entries in the output (i.e., another entry like 1410h and 1410i, but with the same ID parameter as when the tool was previously recognized). As another example, in some embodiments it may be desirable to distinguish between tools as they are applied at different portions of the surgery. Accordingly, a temporal threshold may be used to split a single entry into multiple entities, as when frames and tool locations associated with a task in an early portion of the surgery are to be distinguished from a task performed near the end of the surgery.
Similarly, one will appreciate that tools which were not detected may be noted in a variety of forms. For example, the output may simply omit entries for tools which were not detected, may list such non-detected tools separately, may include entries for the tools but mark such entries as “not detected”, etc.
For each frame, the system may then consider any active trackers at block 1520. Trackers may be created in response to tool detections in a previous frame. Specifically, at a previous iteration, the system may attempt to detect tools in the frame field of view at block 1550, e.g., by applying a YOLO detection model to the frame to determine both tool identifies and locations in the frame.
At block 1560, the system may pair each of the detection results (e.g., bounding polygons) with each of the trackers (e.g., if there were three detection results, and two trackers, six pairs would result). For each of these pairs, at block 1565, the system may generate Intersection Over Union (IOU) scores (e.g., the area each of the pair's members' bounding polygons overlap divided by an area of the union of the bounding polygons) for each pair. The system may then remove pairs associated with an IOU score below a lower bound (e.g. 0.3) at block 1570.
Some embodiments may employ combinatorial optimization algorithms to select pairs at blocks 1565 and 1570, e.g., selecting pairs by employing algorithmic solutions to the linear assignment problem when minimizing a cost matrix. Specifically, continuing the above hypothetical of 2 detections and 3 trackers, the system may form a 2×3 matrix of IOU values (“IOU_matrix”) corresponding to each respective pair. The matched indices after minimizing the negative of the IOU matrix may then be acquired from the highest overall IOU score, e.g., using the SciPy™ library as shown in code line listing C8.
det_id,trk_id=scipy.optimize.linear_sum_assignment(−IOU_matrix) (C8)
Here, the output provides indices to match detections with trackers, ensuring that each detection is associated with only one tracker and that each tracker is associated with only one detection. If there is one more tracker than detection, as in the hypothetical with two detections and three trackers, only two trackers will have matched detections (and vice versa where there are more detections than trackers). Pairs with IOU values below a threshold (e.g. 0.3, mentioned above) may then be removed (corresponding to block 1570).
Thus, surviving pairs may reflect detections associated with existing trackers for a same object. In some embodiments, these associations may then be noted and recorded for each pair at blocks 1575 and 1576. At blocks 1577 and 1578, each of the detections determined at block 1550 which are no longer paired with a tracker (following the pair removals at block 1570) may precipitate the creation of a new tracker. Conversely, trackers unassociated with a detection in a frame may or may not be removed immediately. For example, the system may iterate through each of the active trackers without a surviving pair at block 1579, and increment an associated “presence time” counter for that tracker at block 1580 (the counter thus indicating the number of times none of the detection results were associated with the tracker, i.e., having sufficiently large IOU scores). When a detection is paired with the tracker, the counter may be reset to 0 at block 1576. However, if a tracker does not receive an associated detection for a long time (e.g., if the counter increments exceed 10 seconds), as indicated by block 1581, the system may remove the tracker at block 1582.
One will appreciate that detection may not be performed at every frame (trackers may be able to interpolate across frames). For example, as indicated by block 1595, the system may consider whether an interval has passed since a last detection, all possible tools are accounted for (and consequently detection may be unnecessary), trackers have been lost, etc., before initiating detection, as detection may be temporally or computationally expensive. If every frame were to be considered, Kalman filters may be applied, though this may be slower and more resource intensive than the process 1500. Thus, one will appreciate that tracker removal at block 1582 may occur in lieu of, or complementary to, removal at block 1540, which results from the tracker's failure to track. Where both blocks 1582 and 1540 are present, block 1540 may refer to failures to track inherent to the tracker's operation (appreciating that trackers may be updated more frequently than detections are performed, i.e., block 1530 occurs more frequently than block 1550) as opposed to removal at block 1540, which occurs when the tracker repeatedly fails to associate with a detection.
Returning to block 1520, one will appreciate that based upon the trackers created at block 1578, the system may then iterate through each such created tracker at blocks 1520 and 1525. The tracker may be provided with the newly considered frame from block 1515 when updated at block 1530. Where the tracker is successful in continuing to track its corresponding tool in the frame at block 1535 the tracker may log the tracked tool information at block 1545, e.g., noting the position, bounding box or collection of pixels, detection score, tracker identifier, tool name, IOU scores (as discussed below), etc. associated with the tool by the tracker in the most recently considered frame. Where the tracker fails to continue tracking its tool, the tracker may be removed at block 1540 (again, in some embodiments tracker removal may only occur at block 1582). In some embodiments, tolerances may be included, wherein one or more failed trackings are permitted before the tracker is removed. As discussed, some embodiments may consider information from pipeline 615a to augment a tracker's functionality, decide whether to retain a tracker, to supplement tracker management, etc. For example, the tool's last known position and UI information may be used to distinguish tracker loss resulting from tool movement under a UI overlay or from smoke following energy application, from lost tracking resulting from the tool leaving the field of view.
As indicated, the detection operations at block 1550 may be supplemented at block 1555 with reference to other gathered data. For example, if UI recognition operations at 625 detected the introduction of a tool based on text appearing in a UI at a time corresponding to the currently considered frame, then the system may favor detections at block 1555 even if they were not the most probable prediction. For example, if the UI indicates that only a forceps is present onscreen, but a YOLO model indicates that curved scissors are present but with only a slightly higher prediction probability than forceps, then the system may document the detection as being for the forceps. Additional examples of such derived data reconciliation are discussed with greater detail with respect to
Once all the frames have been considered at blocks 1510 and 1515, the system may post-process the tracked tool logs at block 1585 and output the derived data results at block 1590. For example, just as the post-processing operations discussed with respect to
fs=cv2.FileStorage(“PARAMs.json”,cv2.FileStorage_READ) (C9)
tracker.read(fs.getFirstTopLevelNode( )) (C10)
The parameter “psr_threshold” was found to achieve good results at the 0.075 value indicated in an example reduction to practice of an embodiment. A higher “psr_threshold” value may increase the robustness of the tracker, especially when the object moves fast, but if the value is too high the tracker may persist upon the image even when tracking fails. In some embodiments, logic may balance these outcomes, periodically checking the existing tracker and removing the tracker when it persists beyond a reasonable period (e.g., when the detection module cannot verify the tool's presence for multiple frames, despite the tracker's insistence upon the tool's presence) and lowering the psr_threshold value in subsequent tracker creations. As discussed, psr_threshold may be modified in response to smoke, overlay obstructions, etc. and tracking rerun.
In some embodiments, to initiate the tracker, a video frame and the corresponding bounding box “bbox_trk_new” of the surgical tool (e.g., as detected by YOLO), may be provided to the tracker, e.g., as shown in code line listing C11:
success_ini=trk[0].init(frame,tuple(bbox_trk_new)) (C11)
The system may similarly provide the tracker with each new video frame at each update. An example of this updating process is illustrated in the code line listings C12 and C13
for ind_tracker,trk in enumerate(trackers): (C12)
success,bbox_trk=trk[0].update(frame) (C13)
specifically, where line C12 is a for loop iterating over each of the trackers, line C13 updates the currently considered tracker, and “frame” is the video frame under consideration after, e.g., cropping out black borders and downsizing to 640*512 to increase computational efficiency in some embodiments.
Following the first tool detection (e.g., by YOLO) additional such detections may not be necessary during tracking (though, as mentioned, subsequent detections may be used to verify the tracker's behavior). As indicated in line C13, after initialization, the tracker will output estimated bounding box locations and size (found in the “bbox_trk” output). If the tracker fails during one of these updates, some embodiments may initiate a new detection result (e.g., with YOLO) and, if detection is successful, reinitialize the tracker with this detection result.
The use of a corpus of trackers may allow the system to avail itself of complementary features between the trackers. For example, a CSRT tracker may be slower but more accurate than other trackers, such as KCF and be more resilient to erratic motion. CSRT trackers may also be trained upon a single patch and adapt to scale, deformation and rotation. However CSRT trackers may not recover well from failures due to full occlusion and so other trackers may provide suitable complements, particularly in environments where reconciliation with the UI may not be computationally feasible.
Thus, at blocks 1520, 1525, 1530, 1535, 1540 and 1545, where only a single tracker was associated with each detected tool, various embodiments consider instead the operations of process 1600 managing a corpus of trackers for each detected tool. Specifically, at block 1605a, the system may apply each of the trackers in the corpus to the frame (corresponding to the single tracker update at block 1530). At block 1605b the system may apply a condition to determine whether the tracker corpus agrees upon a result. For example, if more than half of the trackers track the tool, outputting a center point position within a tolerance (e.g., less than 5% of the frame width), than those results may be reconciled and consolidate into a recorded result at block 1605c (corresponding to block 1545 in the single tracker embodiments, using, e.g., methods such as non-maximum suppression).
In some embodiments, where less than a majority agrees, the system may immediately remove the trackers at block 1605g (corresponding to block 1540). However, as depicted here, in some embodiments, the system may still consider whether a minority of the trackers in the corpus agree with supplemental tracking data at block 1605e. For example, if at UI detection 625, text detection, or template detection, indicated that the UI indicates that a specific tool (e.g., forceps) are in use, and a minority of the trackers provide a response consistent with that indication (e.g., the responses correspond to that tool and each have center points within 5% of the frame width of one another) at block 1605e, then at block 1605f the system may instead log the consolidated values of the minority tracker results.
In each case, for corpuses of trackers with at least one failed tracker, the failed tracker may be “reset” at block 1605d. Some trackers may need no action for use in a future frame, however, some trackers may be modified so that they may be used in a subsequent frame at block 1605a, e.g., by modifying their parameters e.g., with synthetic values, to suggest that, like their successful peers, they also tracked the tool as identified at block 1605c or 1605f. Such modification may occur in lieu of removing trackers in some embodiments.
While some embodiments may employ a custom machine learning model topology for tool detection (e.g., a model analogous to, or the same as, the network topology of
For example,
Where the detection model architecture is Yolo v3, the model weights may be initialized using the Common Object in Context (COCO) detection training dataset (e.g., the 2014 COCO dataset with 80 classes in total). The dataset used for transfer learning may include human annotated video frames and/or annotation via system events/kinematics of surgical images.
Pretrained networks such as that depicted in
One will appreciate that the division between “head” and “non-head” portions may not always be rigorous, as the stochastic nature of model training may spread feature creation and classification operations throughout the network. Accordingly, in some embodiments, the entire Yolov3 architecture is frozen (i.e., all the weights including those in head portion 1710c) and one or more new layers (e.g., fully connected layers) with a final SoftMax layer appended with weights allowed to vary in each of the new and SoftMax layers during training. In the depicted example, however, as employed in some embodiments for tool detection, the final DBL 1750a, 1750b, 1750c and convolutional layers 1750d, 1750e, 1750f producing each of the three respective outputs 1710d, 1710e, 1710f of the Yolov3 network are construed as the “head” and their weights allowed to vary during tool-specific training (though shown here to include layers 1750a, 1750b, 1750c in some embodiments, the head portion comprises only layers 1750d, 1750e, and 1750f). In some embodiments, only one or two of the outputs 1710d, 1710e, 1710f may be used for detection and so the other output paths in the head may be ignored.
In some embodiments, however, each of the three outputs 1710d, 1710e, 1710f may be used. The YOLO head may predict bounding boxes for objects at three different scales at outputs 1710d, 1710e, 1710f. Non-max suppression may be used to merge these outputs into one output. Between the YOLO head's output and the non-max suppression step, the outputs may be converted to bounding boxes, as YOLO may not directly predict boundary box location in each cell/grid of the image, instead predicting the coordinate offset and width/height difference relative to a predefined dimension (e.g., anchor boxes). One will appreciate that sigmoid and exponential functions may be used to compute the final bounding box coordinates and size.
With a “head” portion identified for the network, various embodiments may train the network via the process of
While the YOLOv3 architecture has been extensively represented and discussed herein to facilitate clarity of understanding, one will appreciate that YOLOv3 merely represents one possible choice of pretrained neural network that may be used in various embodiments (e.g., Faster R-CNN, SSD, etc.). ResNet, DenseNet, VGG16, etc. are all examples of neural networks trained for an initial image task, which may be retrained as described herein to facilitate surgical tool detection in a video frame 1710b.
In some embodiments, the above transfer learning may apply an Adam optimizer with a learning rate of 0.001, batch size 32 for a total of 50 epochs at block 1720d. In each epoch, the surgical GUI video images may be randomly shuffled with a buffer size of 1000. As some tools appear more frequently than others during surgery, they may likewise be overrepresented in the trained data. One may use Synthetic Minority Oversampling Technique (SMOTE) (e.g., using the Imblearn™ library function imblearn.over_sampling.SMOTE) or similar methods to compensate for such imbalance. Alternatively or in addition, some embodiments may employ a random blackout augmentation technique to black out the more frequent classes given the class distribution probability. For example, in some contexts, a stapler will be a minority class (e.g., rarely present in the video data) and mostly appear along with bipolar forceps, which will be a majority class (e.g., more frequently present in the video data). The augmentation method may randomly black out the bipolar forceps in the image with a given probability while retaining the stapler label. This may facilitate improved recognition of the minority class tools. Additional augmentation methods used during training may include random brightness, random rotation, horizontal flip and the addition of Gaussian noise to the data.
Depending upon the detection and tracking methods employed, one will appreciate that tool location information within a frame may be represented in a variety of manners. For example,
Similarly, some detection systems may provide more granular assessments, indicating the actual frame pixels corresponding to their recognized tool (various flood-fill algorithms may likewise determine such regions from a given center point). Thus, as shown in
Similarly, as will be discussed with respect to
At block 1905 the system may decide whether to operate in “full image” or “known region” modes. For example, if text is known to appear only in certain locations (e.g., overlay locations for a given UI type), the system may limit its search to sub-images at those locations at blocks 1910 and 1920. In contrast, absent such contextual reference, the system may simply run a recognition algorithm over the entire image at block 1915.
One will recognize a variety of algorithms that may be run at blocks 1920 or 1915. For example, the Pytesseract™ library may be used in some embodiments, e.g., following brightness and contrast adjustment, as shown in code line listing C14:
candidate_text=pytesseract.image_to_string(image) (C14)
In this example, the library applies a pre-trained neural network to the image to detect characters. In some embodiments a preliminary geometric remapping transformation may be applied to the image before applying such a text recognition algorithm, as discussed herein. For example, when recognizing text in a UI (e.g., as discussed above with respect to block 1010e in the process of
As indicated at blocks 1925 and 1930 the system may consider all the instances of text identified by the algorithm in the image or sub-images. An initial filter may be applied at block 1935, e.g., to see if the recognized text is merely a garbled collection of letters (as may be caused, e.g., by various surface textures within the human body). Similarly, if the recognized text is less than the length of any candidate tool name, or tool name identified, the system may transition back to block 1925. For those instances surviving the filtering of 1935, at blocks 1940 and 1945 the system may iterate through the possible tool names and identifiers to see if the candidate text identified by the algorithm is sufficiently similar at block 1950 that a recognition should be recorded at block 1955. For example, the Hamming distance between the candidate text and a tool identifier may be compared to a threshold to determine if the text is sufficiently similar. In such embodiments, ties may be resolved by looking for corroborating recognitions, e.g., by the tool recognition system in the same or nearby frames.
At block 2020, the system may reconcile 2050n the collections 2050d and 2050e, as indicated by arrows 2050n to produce the collection 2050g. This collection 2050g may include the previously derived data, e.g., events D6 and D7. However the system may also remove some derived data in favor of other data during reconciliation. For example, both derived data D1 and D9 may refer to the same event (e.g., camera movement detected based upon optical flow and movement detected based upon an icon appearing in the UI) and so the system retain only one of the derived data records (in some embodiments modifying the retained record with complementary information from the other data item). Similarly, where some events are mutually exclusive, one event may be dropped in favor of a dominant event (e.g., D4 may have been removed as it is dominated by D8, as when more granular optical flow movement results are favored over binary UI movement icon data alone). Similarly, derived data records may be joined to create new derived data records (e.g., derived data D16 is recorded based upon the existence of derived data D10, D2, as when camera movement and the camera tool name are joined). Though the order of this example considers UI and motion reconciliation, then tracking reconciliation, one will appreciate that the reconciliation order may instead begin with tracking and UI results, tracking and motion results, etc.
At block 2025, the system may perform tool tracking based detection 2050c to produce 2050k collection of derived data 2050f (e.g., performing tool tracking over the entire video and/or specific periods of interest, as when energy is applied or the camera has moved). Thus, tool tracking 2050c may consider 2050m the results from previous operations (either post-consolidation, as shown here, or in their original forms) in its own assessment. At block 2030, the collection 2050f may be reconciled with the collection 2050g, as evidenced by arrows 2050o to produce a final collection 2050h, again, adding, removing, or retaining derived data. At block 2035, the set 2050h may be output as the final set of derived data detected by the system. During consolidation, tool tracking at block 2025 may be re-performed at particular times of interest, e.g., at a specific clustering of events, as when energy application events (determined, e.g., from the UI) suggest that smoke may have blurred the field of view and so more effective tracking for these periods may be performed again with more suitable tracker parameters (e.g. a different psr_threshold, etc.).
In some embodiments, the system may give precedence to derived data generated based upon the UI, over those generated by motion detection or event tool detection, as UI based recognition may be more consistent. Indeed, in some embodiments only UI recognition may be performed to derive data. In situations where the UI is given preference, in the event of overlap or conflict between the derived data, the UI-based derived data may dominate. Similarly, reconciliation may also resolve logical inconsistencies as when the presence of one event makes impossible the presence of another event.
In some embodiments, various performance metrics may be employed to determine whether results from one source are high or low quality and should take precedence over, or be dominated by, other sources. For example, a “tracked percentage” metric may indicate the number of video frames having a specific tracked instrument in view divided by the total frame range that the tool is being detected/tracked. If the metric falls below a threshold, e.g., 10%, UI-based tool results 2050i may be favored over tool-tracked results 2050c. Similarly, an event occurrence rate may be used to determine whether outliers/false detections are present. If the rate value of a particular time period is significantly large (for example 20 times larger) than the average rate computed over the entire time period, it may suggest that one of sources 2050a or 2050b should be dominated by source 2050c.
To facilitate clarity of reader comprehension,
In this example, the “data” object contains five derived data entries. A camera movement event (lines 2-7) may indicate a plurality of frame indices at which camera movement was detected. This may be accomplished based upon the appearance of an icon in the GUI and using, e.g., the methods of
Frames for various energy application events, “energy blue USM2” (lines 8-12), “energy blue usm3” (lines 13-14), and “energy yellow USM1” (lines 15-17) are also indicated. These frames may likewise have been detected from the UI as discussed herein, or alternatively, or complementary to, tool detection and recognition (e.g., as in
Similarly, one or both of UI monitoring and tool tracking may be used to recognize the frames at which an “arm swap” event occurred at lines 18 and 19. For example, a UI indication, such as a pedal activation icon, or a change in tool name text at a specific location, may imply such a swap event. Tool tracking may be used to corroborate such assessments, as discussed herein. For example, given a tool and right/left tag (e.g., as discussed with respect to the icon 720) over a series of frames, one may readily discern periods where a first of two tools that was active becomes static, while the other two tools, which was static, becomes active. Where the two tools may only be controlled by a single input (e.g., the left surgeon console control), this may imply an arm swap transfer of control event between the two periods.
Though this example output simply notes the frame index at which an event occurred, one will appreciate that other information and parameters may be readily included in the output than are depicted in this example. For example, using the text recognition techniques discussed wherein, the “arm swap” parameter may indicate which tools are affected and the tools' locations with the frame index. Similarly, energy application events may include parameters for each frame indicating where the energy was applied (based upon tool tracking), which tool applied the energy (e.g., based upon the UI and/or tool tracking), and in what amount. For example, where the UI does not indicate the amount of energy, but only whether energy is being applied or not, the amount of energy may be inferred from the energy activation duration (e.g., the number of consecutive frames) in conjunction with the tool type applying the energy.
An example reduction to practice of an embodiment has demonstrated the effectiveness of the systems and methods disclosed herein. Specifically,
For “both hand clutch” events in plot 2205d, missing clutch events from the surgical theater sample recorded system data (i.e., genuine clutch events the system data failed to record) were identified by the video-based approach, which indicates that video-based approach may derive events (e.g. hand clutch) that were possibly missing even from a system data recorder. As mentioned, this may be beneficial for corroborating traditionally acquired data.
Plots 2205e and 2205f compare video-based derived tool data from an da Vinci Xi™ system with system recorded data. A total of 6 surgical tasks from 6 procedures were used to compare the linear distance traveled (or economy of motion EOM) by the right and left hand tools obtained from derived data and surgical theater recorded tool kinematics. The unit of video-derived data along the vertical axis of the plots 2205e and 2205f is in pixels and the unit of recorded system along the horizontal axis is in meters.
To compare surgical theater kinematics data with video-derived kinematics data, kinematics data and video-derived data generated using the example implementation from two different surgical procedures were considered. Both the three dimensional kinematics data and the video data derived kinematics results were projected upon a two dimensional pixel space to facilitate review (i.e., U, V coordinates where U ranges from 0 to 650 and V ranges from 0 to 512; camera calibration parameters were used to project the kinematics data). Schematic representations of the trajectories resulting from this projection are shown in
The one or more processors 2410 may include, e.g., an Intel™ processor chip, a math coprocessor, a graphics processor, etc. The one or more memory components 2415 may include, e.g., a volatile memory (RAM, SRAM, DRAM, etc.), a non-volatile memory (EPROM, ROM, Flash memory, etc.), or similar devices. The one or more input/output devices 2420 may include, e.g., display devices, keyboards, pointing devices, touchscreen devices, etc. The one or more storage devices 2425 may include, e.g., cloud based storages, removable USB storage, disk drives, etc. In some systems memory components 2415 and storage devices 2425 may be the same components. Network adapters 2430 may include, e.g., wired network interfaces, wireless interfaces, Bluetooth™ adapters, line-of-sight interfaces, etc.
One will recognize that only some of the components, alternative components, or additional components than those depicted in
In some embodiments, data structures and message structures may be stored or transmitted via a data transmission medium, e.g., a signal on a communications link, via the network adapters 2430. Transmission may occur across a variety of mediums, e.g., the Internet, a local area network, a wide area network, or a point-to-point dial-up connection, etc. Thus, “computer readable media” can include computer-readable storage media (e.g., “non-transitory” computer-readable media) and computer-readable transmission media.
The one or more memory components 2415 and one or more storage devices 2425 may be computer-readable storage media. In some embodiments, the one or more memory components 2415 or one or more storage devices 2425 may store instructions, which may perform or cause to be performed various of the operations discussed herein. In some embodiments, the instructions stored in memory 2415 can be implemented as software and/or firmware. These instructions may be used to perform operations on the one or more processors 2410 to carry out processes described herein. In some embodiments, such instructions may be provided to the one or more processors 2410 by downloading the instructions from another system, e.g., via network adapter 2430.
The drawings and description herein are illustrative. Consequently, neither the description nor the drawings should be construed so as to limit the disclosure. For example, titles or subtitles have been provided simply for the reader's convenience and to facilitate understanding. Thus, the titles or subtitles should not be construed so as to limit the scope of the disclosure, e.g., by grouping features which were presented in a particular order or together simply to facilitate understanding. Unless otherwise defined herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, this document, including any definitions provided herein, will control. A recital of one or more synonyms herein does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any term discussed herein is illustrative only and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term.
Similarly, despite the particular presentation in the figures herein, one skilled in the art will appreciate that actual data structures used to store information may differ from what is shown. For example, the data structures may be organized in a different manner, may contain more or less information than shown, may be compressed and/or encrypted, etc. The drawings and disclosure may omit common or well-known details in order to avoid confusion. Similarly, the figures may depict a particular series of operations to facilitate understanding, which are simply exemplary of a wider class of such collection of operations. Accordingly, one will readily recognize that additional, alternative, or fewer operations may often be used to achieve the same purpose or effect depicted in some of the flow diagrams. For example, data may be encrypted, though not presented as such in the figures, items may be considered in different looping patterns (“for” loop, “while” loop, etc.), or sorted in a different manner, to achieve the same or similar effect, etc.
Reference herein to “an embodiment” or “one embodiment” means that at least one embodiment of the disclosure includes a particular feature, structure, or characteristic described in connection with the embodiment. Thus, the phrase “in one embodiment” in various places herein is not necessarily referring to the same embodiment in each of those various places. Separate or alternative embodiments may not be mutually exclusive of other embodiments. One will recognize that various modifications may be made without deviating from the scope of the embodiments.
This application claims the benefit of, and priority to, U.S. Provisional Application No. 63/117,993, filed upon Nov. 24, 2020, entitled “SURGICAL SYSTEM DATA DERIVATION FROM SURGICAL VIDEO” and which is incorporated by reference herein in its entirety for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/060200 | 11/19/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63117993 | Nov 2020 | US |