1. Field of the Invention
The present invention relates generally to video analysis. More particularly, it relates to automated video analysis for improving surgical training in laparoscopic surgery.
2. Description of Related Art
Laparoscopic surgery has become popular for its potential advantages of a shorter hospital stay, a lower risk of infection, a smaller incision, etc. Compared with open surgery, laparoscopic surgery requires surgeons to operate in a small space from a small incision by watching a monitor capturing the inside of the body. Hence a new set of cognitive and motor skills are needed from a surgeon. Among others, the Fundamentals of Laparoscopic Surgery (FLS) Program was developed by the Society of American Gastrointestinal and Endoscopic Surgeons to help training qualified laparoscopic surgeons. The key tool used in this program is the FLS Trainer Box, which supports a set of predefined tasks. The box has been widely used in many hospitals/training centers across the country. Although the box has seen a lot of adoption, its functionality is limited especially in that it is mostly a passive platform for a trainee to practice on, and it does not provide any feedback to a trainee during the training process. Senior surgeons may be invited to watch a trainee's performance to provide feedback. However, that would be a costly option that cannot be readily available at any time the trainee is practicing.
In accordance with an exemplary embodiment, a method of providing training comprises receiving at least one video stream from a video camera observing a trainee's movements, processing the at least one video stream to extract skill-related attributes, and displaying the video stream and the skill-related attributes.
The skill-related attributes may be displayed on a display in real-time.
The method may also include receiving at least one data stream from a data glove and processing the at least one data stream from the data glove to extract skill-related attributes.
The method may also include receiving at least one data stream from a motion tracker and processing the at least one data stream from the motion tracker to extract skill-related attributes.
The extracted attributes may comprise motion features in a region of interest, and the motion features may comprise spatial motion, radial motion, relative motion, angular motion and optic flow.
The step of processing the at least one video stream may utilize a random forest model.
In accordance with another exemplary embodiment, an apparatus for training a trainee comprises a laparoscopic surgery simulation system having a first camera and a video monitor, a second camera for capturing a trainee's hand movement, and a computer for receiving video streams from the first and second cameras. The processor of the computer is configured to apply video analysis to the video streams to extract skill-related attributes.
The apparatus may include kinematic sensors for capturing kinematics of the hands and fingers or may include a motion tracker, such as a data glove.
The skill-related attributes may comprise smoothness of motion and acceleration.
In accordance with an exemplary embodiment, a method of providing instructive feedback comprises decomposing a video sequence of a training procedure into primitive action units, and rating each action unit using expressive attributes derived from established guidelines.
An illustrative video may be selected as a reference from a pre-stored database.
A trainee's practice sessions of the training procedure may be stored. Different trainee practice sessions of the training procedure may be compared.
The feedback may be provided live or offline.
The expressive attributes may be selected from the group consisting of hands synchronization, instrument handling, suture handling, flow of operation and depth perception.
The method may also include the steps of identifying worst action attributes of a trainee, retrieving illustrative video clips relating to the worst action attributes, and presenting the illustrative video clips to the trainee.
In the following detailed description, reference is made to the accompanying drawings, in which are shown exemplary but non-limiting and non-exhaustive embodiments of the invention. These embodiments are described in sufficient detail to enable those having skill in the art to practice the invention, and it is understood that other embodiments may be used, and other changes may be made, without departing from the spirit or scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the invention is defined only by the appended claims. In the accompanying drawings, like reference numerals refer to like parts throughout the various figures unless otherwise specified.
Following is a disclosure of a video-based skill coaching system for the domain of simulation-based surgical training The system is aimed at providing automated feedback that has the following three features: (i) specific (locating where the errors and defects are); (ii) instructive (explaining why they are defects and how to improve); and (iii) illustrative (providing good examples for reference). Although the focus of the disclosure is on the specific application of simulation-based surgical training, the above features are important to effective skill coaching in general, and thus the proposed method may be extended to other video-based skill coaching applications.
Certain embodiments of the present invention accomplish the following technical tasks and utilize a suite of algorithms developed for addressing these tasks:
A challenge in these tasks is to map computable visual features to semantic concepts that are meaningful to a trainee. Recognizing the practical difficulty of lacking sufficient amount of exactly labeled data for learning an explicit mapping, we utilize the concept of relative attribute learning for comparing the videos based on semantic attributes designed using domain knowledge.
In an embodiment of the system as shown in
One component of the proposed system in
The original FLS box is only a “pass-through” system without memory. The disclosed system stores a trainee's practice sessions, which can be used to support many capabilities including comparison of different sessions, enabling a trainee to review his/her errors, etc. The system may allow users to register so that their actions are associated with their user identification. One example of a registration screen is shown in
In training mode, a processor in the system may be employed to process the captured streams in real-time and display key skill-related attributes (for example, smoothness of motion and acceleration) on the monitor.
At any time, a user may choose to record a training session by pressing a record button 308. The system may provide a short pause before beginning to record to allow a user to prepare (e.g., 5 seconds), and a message or other visible alert may be displayed to indicate that recording is in process. A stop recording button may be provided. It may be a provided by changing the record button into a stop button while recording is in process, and reverting it to a record button when recording is stopped.
Once completed, the training session records are associated with the user and stored for future retrieval.
The system may allow a trainee to compare his/her performance between different practice sessions to provide insights as to how to improve by providing “offline feedback” to the user. This goes beyond simply providing videos from two sessions to the trainee, since computational tools can be used to analyze performance and deliver comparative attributes based on the video.
The system may also allow a user to compare performance against a reference video.
Automatic evaluation of surgical performance of the trainees has been a topic of research for many years. For example, the work has discussed various aspects of the problem, where the criterion for surgical skill evaluation is mostly based on data streams from kinematic devices including data glove and motion tracker. Also, according to studies, there is a high correlation between the proficiency level and the kinematic measurements. While these studies provide the foundation for building a system for objective evaluation of surgical skills, many of the metrics that have been discussed are difficult to obtain from video streams on an FLS box. In other words, there is no straightforward way of applying the criterion directly on the videos captured in an FLS box, which records the effect of the subject's action inside the box (but not directly the hand movements of the trainee).
In accordance with certain embodiments of the present invention, video analysis may be applied to the FLS box videos to extract the skill-defining features. One visual feature for movement analysis is the computed optical flow. Unfortunately, raw optical flow is not only noisy but also merely a 2-D projection of true 3-D motion, which is more relevant for skill analysis. To this end, we define a latent space for the low-level visual feature with the goal of making the inference of surgical skills more meaningful in that space. Recognizing the fact that both the visual feature and the kinematic measurements arise from the same underlying physical movements of the subject and thus they should be strongly correlated, we employ Canonical Correlation Analysis (CCA) to identify the latent space for the visual data.
With the above system, we collected data for the peg transform operation (discussed in more detail below) based on student participants who had no prior experience on the system (and hence it was reasonably assumed that each participants started as a novice). The data collection lasted for four weeks. In each week, the participants were asked to practice for 3 sessions on different days, and in each session each participant was required to perform the surgical simulation three times consecutively. All subjects were required to attend all the sessions. The synchronized multi-modal streams were recorded for the sessions. The subsequent analysis of the data is based on the recorded streams from 10 participants.
For each subject, the three streams of recorded data in one session are called a record. Due to subject variability, the records may not start with the subjects doing exactly the same action. To alleviate this issue, we first utilized the cycles of the data glove stream to segment each record into sub-records. For the same operation, this turned out to be very effective in segmenting the streams into actions such as “picking up”. This is illustrated in
With the above preparation, we learn the latent space for the video data by applying CCA between the video stream and the motion data stream. Formally, given the extracted HoF feature matrix Sx∈Rn×d and the first-order difference of motion data Sy∈Rn×k of all the training records, we use CCA to find the projection matrices wx and wy by
where n is the number of frames of the video stream or motion stream, d is the dimension of HoF vector and k is the dimension of the motion data feature. In the latent space, we care more about the top few dimensions on which the correlation of the input streams are strong.
To demonstrate that the above approach is able to lead to a feature space that is appropriate for skill analysis, we carried out several evaluation experiments, as elaborated below. We made a reasonable assumption that the subjects would improve their skill over the 4-week period since they were required to practice for 3 sessions each week. With this, if we look at the data from the first week and that from the last week, we would observe the most obvious difference of skill if any. In the first experiment, we analyzed the acceleration computed from the records. This reflects the force the subjects applied during the surgical operation. For the video data, the acceleration was computed as the first-order difference of the original feature, and for the motion tracker data, the second order difference was computed as the acceleration. In implementation, we adopted the root-mean-square (RMS) error between adjacent frames in computing the acceleration.
We also computed the area under the curves in
Finally, we used a classification framework to demonstrate that the learned latent space supports better analysis of surgical skills. Based on our assumption, we treated the videos captured in the first week as the novice class (Class 1) and those from the last week as the expert class (Class 2). Then we used the Bag-of-Word (BoW) model to encode the HoF features for representing the videos. For classification, we experimented with kernel-SVM and Adaboost. We applied the leave-one-subject-out cross-validation: we left out both the first and last week of videos from one subject for test, and used the others for training the classifier. The results of the experiment were summarized in Table 2. The results clearly suggest that the classification accuracy in the latent space was consistently higher than that using the original space, demonstrating that the learned latent space supports better analysis of surgical skills.
There are a set of standard operations defined for the FLS training system. For clarity of presentation, the subsequent discussion (including experiments) will focus on only one operation termed “Peg Transfer” (illustrated in
The Peg Transfer operation consists of several primitive actions or ‘therbligs’ as building blocks of manipulative surgical activities, which are defined in Table 3. Ideally, these primitive actions are all necessary in order to finish one peg-transfer cycle. Since there are six objects to transfer from left to right and backwards, there are totally 12 cycles in one training session. Our experiments are based on video recordings (
Further details regarding the algorithms for the disclosed method for video-based skill coaching will now be disclosed. Suppose that a user has just finished a training session on the FLS box and a video recording is available for analysis. The system needs to perform the three tasks as discussed above order to deliver automated feedback to the user.
In the following sub-sections, we elaborate the components of the disclosed approach, organizing our presentation by the three tasks of action segmentation, action rating, and illustrative video retrieval.
From Table 3, the videos we consider should exhibit predictable motion patterns arising from the underlying actions of the human subject. Hence we adopt the hidden Markov model (HMM) in the segmentation task.
This allows us to incorporate domain knowledge into the transition probabilities, e.g. the lift action is followed by itself or by the loaded move with high probability. Following we assume that each state represents a primitive action in the HMM. The task of segmentation is then to find the optimal state path for the given video, assuming a given HMM. This can be done with the well-known Viterbi algorithm, and thus our discussion will be given only to three new algorithmic components we designed to address several practical difficulties unique to our application: noisy video data especially due to occlusion (among the tools and objects) and reflection, limited training videos with labels, and unpredictable erroneous actions breaking the normal pattern (frequent with novice trainees).
Since the FLS box is a controlled environment with strong color difference among several object labels, i.e. background, objects to move, pegs, and tools, we can use random forest (RF) to obtain the label probability Pl(x), 1≦l≦L for each pixel x based on its color, where L is the number of classes to consider. The color segmentation result is achieved by assigning each pixel with the label of highest probability. Based on the color segmentation result, we extract the tool tips and orientations of the two graspers controlled by the left and right hands. Since all surgical actions occur in the region around grasper tip, the region is defined as the ROI region to filter out other irrelevant background. We detect motion by image frame difference. Based on the comparison with the distribution of the background region, we estimate the probability that x belongs to a moving area, which is denoted as M(x).
With the assumption of independence between the label and motion, M(x)·Pl(x) is the joint distribution of motion and object label, which is deemed as important for action recognition. In fact, the multiplication with M(x) will suppress the static clutter background in the ROI so that only interested motion information will be reserved. This is illustrated in
Different observation models have been proposed for HMM, including multinomial distribution (for discrete observation only) and Gaussian mixture models. These have been shown successful in some applications such as speech recognition. Such models have some deficiency for noisy video data. In certain embodiments, we use random forest as our observation model. Random forest is an ensemble classifier with a set of decision trees. The output of the random forest is based on majority voting of the trees in the forest. We train a random forest for frame-level classification and then use the output of the random forest as the observation of the HMM states. Assume that there are N trees in the forest and ni decision trees assign label i to the input frame, we could view the random forest choose Label i with probability ni/N which can be taken as the observation probability for State i.
When the state is observable, the transition probability from State i to State j can be computed as the ratio the number of (expected) transitions from State i to State j over the total number of transitions. However, one potential issue of this method is that, in video segmentation we have limited training data, and even worse the number of transitions among different states, i.e., the number of boundary frames, is typically much less than the total number of frames of the video. This will result in a transition probability matrix, whose off-diagonal elements are near zero and diagonal elements are almost one. The resulting transition probability will degrade the benefit of using HMM for video segmentation, i.e., forcing desired transition pattern in the state path.
In certain embodiments, we use a Bayesian approach for estimating the transition probability, employing the Dirichlet distribution, which enables us to combine the domain knowledge with the limited training data for the transition probability estimation. The model is shown in
Assuming αi (Σjαi(j)=1) is our domain knowledge for the transition probabilities from State i to all states, then we can draw the transition probability vector πi as:
πi˜dir(ραi) (2)
where dir is the Dirichlet distribution as a distribution over distribution, and ρ represents our confidence of the domain knowledge. The Dirichlet distribution always output a valid probability distribution, i.e., Σiπi(j)=1.
Given the transition probability πi, the count of transition from State i to all states follows a multinomial distribution:
Because the Dirichlet distribution and multinomial distribution is a conjugate pair, the posterior probability of transition probability is just combining the count of transition among state and domain knowledge (prior) as
π˜dir(ni+ραi) (4)
When there are not enough training data, i.e., Σini(j)<<ρ, πi would be dominated by αi, i.e., our domain knowledge; as more training data become available, πi would approximate to the counting of transitions in the data and the variance of πi would be decreasing.
Segmenting the video into primitive action units only provides the opportunity of pin-pointing an error in the video, and the natural next task is to evaluate the underlying skill of an action clip. As discussed previously, high-level and abstract feedback such as a numeric score does not enable a trainee to take corrective actions. In this work, following, we define a set of attributes as listed in Table 4, and design an attribute learning algorithm for rating each primitive action with respect to the attributes. With this, the system will be able to expressively inform a trainee what is wrong in his action clip, since the attributes in Table 4 are all semantic concepts used in existing human-expert-based coaching (and thus they are well understood).
In order to cope with the practical difficult of lacking detailed and accurate labeling for the action clips, we propose to use relative attribute learning in the task of rating the clips. In this setting, we only need relative rankings of the clips with respect to the defined attributes, which is easier to obtain. Formally, for each action, we have a dataset {Vj, j=1, . . . , N} of N video clips with corresponding feature vector set {vj}. There are totally Kattributes defined as {Ak, k=1, . . . , K}. For each attribute Ak, we are given a set of ordered pairs of clips Ok={(i,j)} and a set of un-ordered pairs Sk={(i, j)}, where (i, j) ∈Ok means Vi has a better skill in terms of attribute Ak than Vj (i.e. Vi>Vj) and (i, j) ∈Sk means Vi and Vj have similar strength of Ak (i.e. (Vi˜Vj).
In relative attribute learning, the attributes Ak is computed as a linear function of the feature vector v:
r
k(v)=wkT·v, (5)
where weight wk is trained under quadratic loss function with penalties on the pairwise constraints in Ok and Sk. The cost function is quite similar to the SVM classification problem, but on pairwise difference vectors:
minimize ∥wk∥22+C·(Σεi,j2+Σγi,j2)
s.t. wkT·(vi−vj)≧1−εi,j; Λ(i,j)∈Ok, (6)
|wkT·(vi−vj)|≦γi,j; Λ(i,j)∈Sk
εi,j≧0; γi,j≧0
where C is the trade-off constant to balance maximal margin and pairwise attribute order constraints. The success of an attribute function depends on both a good weight wk and a well-designed feature v.
The features used for attribute learning are outlined below. First, we extract several motion features in the region of interest (ROI) around each the grasper tip as summarized in Table 5. Then auxiliary features are extracted as defined in Table 6. These features and the execution time are combined to form the features for each action clip.
dx(t)/dt, r(t)
d{circumflex over (x)}(t)/dt,{circumflex over (r)}(t)
∇ × m,m /∥m∥2
With the above preparation, the system will retrieve an illustrative video clip from a pre-stored dataset and present it to a trainee as a reference. As this is done on a per-action basis and with explicit reference to the potentially-lagging attributes, the user can learn from watching the illustrative clip to improve his skill. With K attributes, a clip Vi can be characterized by a K-dimensional vector Vi: [αi,1, . . . , αi,K], where αi,k=rk(vi) is the k-th attribute value of Vi based on its feature vector vi. The attribute values of all clips (of the same action) {Vj, 1≦j≦N} in the dataset forms a N×K matrix A whose column vector αk is the k-th attribute value of each clip. Similarly, from a user's training session, for the same action under consideration, we have another set of clips {V′i, 1≦i≦M} with corresponding M×K attribute matrix A′ whose column vector α′k is the user's k-th attribute values in the training session.
The best illustration video clip V*j is selected from dataset {Vj} using the following criteria:
V*
j=argmaxjΣkI(α′k;A′,αk)·U(αj,k,α′k;αk), (7)
where I (α′k; A′, αk) is the attribute importance of Ak for the user, which is introduced to assess the user in the context of his current training session and the performance of other users on the same attribute in the given dataset. U(αj,k, α′k; αk) is the attribute utility of video Vj on Ak for the user, which is introduced to assess how a video Vj may be helpful for the user on a given attribute. The underlying idea of (3) is that a good feedback video should have high utility on important attributes. We elaborate these concepts below.
Attribute importance is the importance of an attribute Ak for a user's skill improvement. According to the “buckets effect”, how much water a bucket can hold, does not depend on the highest piece of wood on the sides of casks, but rather depends on the shortest piece. So a skill attribute with lower performance level should have a higher importance. We propose to measure the attribute importance of Ak from a user's relative performance level on two aspects. The first relative performance of Ak is the distribution of a user's attribute performance (α′k) in the context of attribute values from people of different skill level, whose cumulative distribution function is Fk(α)=P (αk≦α). Since each element of αk is a sample of Ak over people of random skill levels, we can estimate Fk(α) from αk as a Normal distribution. Then the performance level of any attribute value αk of Ak is 1−Fk(αk). Since each element in α′k is a sample of Ak from a user's i-th performance, the relative performance level of user on Ak in the context of αk is defined as:
I(α′k;αk)=1−Fk(μ′k)∈[0,1] (8)
where μ′k is the mean value of α′k and Fk(α) is the Normal cumulative distribution estimated from αk. Since there are totally Kattributes, the importance of Ak should be further considered under the performance of other attributes (A′). The final attribute importance of Ak is:
I(α′k;A′,αk)=I(α′k;αk)/Σl=1KI(α′l;αl)∈[0,1] (9)
Attribute utility is the effectiveness of a video Vj for a user's skill improvement on attribute Ak. It can be measured by the difference between Vj's attribute value αj,k and a user's attribute performance α′k on Ak. Since the dynamic range of Ak may vary across attributes, some normalization may be necessary. Our definition is:
U(αj,k,α′k;αk)=(Fk(αj,k)−Fk(μ′k))/(1−Fk(μ′k)) (10)
With the above attribute analysis, the system picks 3 worst action attributes with an absolute importance above a threshold 0.4, which means that more than 60 percent of the pre-stored action clips are better in this attribute than the trainee. If all attribute importance values are lower than the threshold, we simply select the worst one. With the selected attributes, we retrieve the illustration video clips, inform the trainee about on which attributes he performed poor, and direct him to the illustration video. This process is conceptually illustrated in
It is worth noting that in the above process of retrieving an illustration video, we defined concepts that are concept dependent. That is, the importance and utility values of an attribute is dependent of the given data set. In practice, the data set could be a local database captured and updated frequently in a training center, or a fixed standard dataset, and thus the system allows the setting of some parameters (e.g., the threshold 0.4) based on the nature of the database.
Experiments have been performed using realistic training videos capturing the performance of resident surgeons in a local hospital during their routine training on the FLS platform. For evaluating the proposed methods, we selected six representative training videos, two for each of the three skill levels: novice, intermediates, and expert. Each video is a full training session consisting of twelve Peg Transfer cycles. Since each cycle should contain of the primitive actions defined previously (Table 3), there are a total of 72 video clips for each primitive action. The exact frame-level labeling (which action each frame belongs to) were manually obtained as the ground truth for segmentation. For each primitive action, we randomly select 100 pairs of video clips and then manually label them by examining all the attributes defined in Table 4 (this process manually determines which video in a given pair should have a better skill according to a given attribute).
Our action segmentation method consists of two steps. First, we use the object motion distribution descriptor and the random forest to obtain an action label for each frame. Then the output of the random forest (the probability vector instead of the action label) is used as the observation of each state in an HMM and the Viterbi algorithm is used to find the best state path as final action recognition result. The confusion matrices of the two recognition steps are presented in Table 7. It can be seen that the frame-based recognition result is already high for some actions (illustrating the strength of our object motion distribution descriptor), but overall the HMM-based method gives much-improved results, especially for actions L and P. The relatively low accuracy for actions L and P is mainly because the trainee's unsmooth operation that caused many unnecessary stops and moves, which are hard to distinguish from UM and LM. We also present the recognition accuracy for each video in Table 8, which indicates that, on average, better segmentation was obtained for subjects with better skills. This also supports the discussion that various unnecessary moves and errors by novice are the main difficulty for this task. All the above recognition results were obtained from 6-fold cross-validation with 1 video left out for testing. A comparative illustration of segmentation is also given (
Validity is an important characteristic in skill assessment. This refers to the extent to which a test measures the trait that it purports to measure. The validity of our learned attribute evaluator can be measured by its classification accuracy on attribute order. Based on the cost function in Eqn. (6), we take the attribute Ak order between video pair Vi and Vj as Vi>Vj (or Vj>Vi) if wkT·(vi−vj) is ≧1 or ≦−1), and Vi˜Vj if |wkT·(vi−vj)|<1. The classification accuracy of each attribute is derived by 10-fold cross validation on the 100 labeled pairs in each primitive action, as given in Table 9. The good accuracy in the table demonstrates that our attribute evaluator, albeit learned only from relative information, has a high validity. In this experiment, only 3 primitive actions were considered here, i.e. L, T, and P, since they are the main operation action and the other LM and UM actions are just preparation for the operation. Also, some attributes are ignored for some actions as they are inappropriate for skill assessment for those actions. These correspond to the “N/A” entries in Table 9.
We compared our video feedback method (Eqn. (7)) with a baseline method that randomly selects one expert video clip of the primitive action. The comparison protocol is as follows. We recruited 12 subjects who had no prior knowledge on the dataset. For each testing video, we randomly select 1 action clip for each primitive action. Then for each attribute, one feedback video is obtained by either our method or the baseline method. The subjects are asked to select which one is a better instruction video for skill improvement, for the given attribute. The subjective test result is summarized in Table 10, which shows that people think our feedback is better or comparable to the baseline feedback in 77.5% cases. The satisfactory rate can be as high as 83.3% and 80% in hand synchronization and suture handling which shows our attribute learning scheme has high validity in this two attributes. This is also consistent with the cross-validation result in Table 9. The result is especially satisfactory since the baseline method already employs an expert video (and thus our method is able to tell which expert video clip is more useful to serve as an illustrative reference).
This application claims priority to U.S. Provisional Application No. 61/761,917 filed Feb. 7, 2013, the entire contents of which is specifically incorporated by reference herein without disclaimer.
This invention was made with government support under Grant No. IIS-0904778 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61761917 | Feb 2013 | US |