The disclosure here generally relates to automated or computerized techniques for processing digital video of a surgery, to detect what frames of the video have an instrument present (that is used in the surgery.)
Temporally locating and classifying instruments in surgical video is useful for analysis and comparison of surgical techniques. Several machine learning models have been developed to do so which can detect where in the video (which video frames) have the presence of a hook, grasper, scissors, etc.
One aspect of the disclosure here is a machine learning model that has an action segmentation network preceded with an EfficientNetV2 featurizer, as a technique (a method or apparatus) that temporally locates and classifies instruments (recognizes them) in surgical videos. The technique may perform better in mean average precision than any previous approaches to this task on the open source Cholec80 dataset of surgical videos. When using ASFormer as the action segmentation network, the model outperforms LSTM and MS-TCN architectures while using the same featurizer. The recognition results may then be added as metadata associated with the analyzed surgical video, for example inserted into the corresponding surgical video file or by annotating the surgical vide file. The model reduces the need for costly human review and labeling of surgical video and could be applied to other action segmentation tasks, driving the development of indexed surgical video libraries and instrument usage tracking. Examples of these applications are included with the results to highlight the power of this modeling approach.
The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the disclosure includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the Claims section. Such combinations may have advantages not specifically recited in the above summary.
Several aspects of the disclosure here are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
Video-based assessment (VBA) involves assessing a video recording of a surgeon's performance, to then support surgeons in their lifelong learning. Surgeons upload their surgical videos to online computing platforms which analyze and document the surgical videos using a VBA system. A surgical video library is an important feature of online computing platforms because it can help surgeons document and locate their cases efficiently.
To enable indexing through a surgical video library, video-based surgical workflow analysis with Artificial Intelligence (AI) is an effective solution. Video-based surgical workflow analysis involves several technologies including surgical phase recognition, surgical gesture and action recognition, surgical event recognition, and surgical instrument segmentation and recognition, along with others. This disclosure focuses on surgical instrument recognition. It can help to document surgical instrument usage for surgical workflow analysis as well as index through the surgical video library.
In this disclosure, long video segment temporal modeling techniques are applied to achieve surgical instrument recognition. In one aspect, a convolutional neural network called EfficientNetV2 (Tan and Le 2021) is applied to capture the spatial information from video frames. Instead of using Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) or Multi-Stage Temporal Convolutional Network (MS-TCN) (Farha and Gall 2019) for full video temporal modeling, a Transformer for Action Segmentation (ASFormer) (Yi et al. 2021) is used to capture the temporal information in the full video to improve performance. This version of the machine learning model is also referred to here as EfficientNetV2-ASFormer. It outperforms previous state-of-the-art designs for surgical instrument recognition and may be promising for instrument usage documentation and surgical video library indexing.
For feature extraction, the EfficientNetV2 developed by Tan and Le (2021) may be used. The EfficientNetV2 technique is based on EffientNetV1, a family of models optimized for FLOPs and parameter efficiency. It uses Neural Architecture Search (NAS) to search for a baseline architecture that has a better tradeoff between accuracy and FLOPs. The baseline model is then scaled up with a compound scaling strategy, scaling up network width depth and resolution with a set of fixed scaling coefficients.
EfficientNetV2 was developed by studying the bottlenecks of EfficientNetV1. In the original V1, training with very large image sizes was slow, so V2 progressively adjusts the image size. EfficientNetV2 implements Fused-MBConv in addition to MBConv to improve training speed. EfficientNetV2 also implements a non-uniform scaling strategy to gradually add more layers to later stages of the network. Finally, EfficientNetV2 implements progressive learning: data regularization and augmentation are increased along with image size.
In one aspect of the machine learning model here, the action segmentation network of the model is MS-TCN which is depicted by an example block diagram in
In another aspect of the machine learning model here, the action segmentation network is a natural language processing (NLP) module that performs spatial-temporal feature learning. In one instance, the NLP module is based on a transformer model, for example a vision transformer. Transformers (Vaswani et al. 2017) are utilized for natural language processing tasks. Recent studies showed the potential of utilizing transformers or redesigning them for computer vision tasks. Vision Transformer (ViT) (Dosovitskiy et al. 2020) which is designed for image classification may be used as the vision transformer. Video Vision Transformer (ViViT) (Arnab et al. 2021) is designed and implemented for action recognition. For the action segmentation network here, Transformer for Action Segmentation (ASFormer) (Yi et al. 2021) was found to outperform several state-of-the-art algorithms and is depicted in
The first layer of the ASFormer encoder is a fully connected layer that helps to adjust the dimension of the input feature. It is then followed by serials of encoder blocks as shown in
The dilation rate in the feed-forward layer increases accordingly as the local window size increases. The decoder of ASFormer contains serials of decoder blocks. As shown in FIG. 3b, each decoder block contains a feed-forward layer and a cross-attention layer. Like the self-attention layer, dilated temporal convolution is utilized in the feed-forward layer. Different from the self-attention layer, the query Q and key K in the cross-attention layer are obtained from the concatenation of the output from the encoder and the previous layer. This cross-attention mechanism can generate attention weights to enable every position in the encoder to attend to all positions in the refinement process. In each decoder, a weighted residual connection is utilized for the output of the feed-forward layer and the cross-attention layer:
out=alpha×cross-attention(feed_forward_out)+feed_forward_out (2)
where feed_forward_out is the output from the feed-forward layer, alpha is the weighted parameter. Set the number of decoders to 1 and set alpha to 1 for our study on the Cholec80 surgical video dataset.
Some applications of the above-described two-stage machine learning-based method for surgical instrument recognition in surgical videos are now described, as follows.
Another application is an AI-based intelligent video search whose keywords can be entered into a dialog box, as shown in
A third application is an instrument usage documentation and comparison tool having a graphical user interface, for example as shown in
The methods described above are for the most part performed by a computer system which may have a general purpose processor or other programmable computing device that has been configured, for example in accordance with instructions stored in memory, to perform the functions described herein.
While operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the various aspects described in this document should not be understood as requiring such separation in all cases. Only a few implementations and examples are described, and other implementations, enhancements and variations can be made based on what is described and illustrated in this document.
This patent application claims the benefit of U.S. Provisional Patent Application No. 63/357,413, entitled “Surgical Instrument Recognition From Surgical Videos” filed 30 Jun. 2022.
Number | Date | Country | |
---|---|---|---|
63357413 | Jun 2022 | US |