A field of the invention concerns interactive gesture acquisition systems. Example applications of the invention include cloud based training systems that compare user gestures at a user device, such as a mobile handset, to training representations, e.g. training avatars. Such systems can be useful for training users to conduct sports related or artistic movement related activities, or can be used to guide users in physical therapy related movements.
Physical therapy is a widely used type of rehabilitation in the treatment of many diseases. Normally, patients are instructed by specialists in physical therapy sessions and then expected to perform the activities at home, in most cases following paper instructions and figures they are given in the sessions. Useful feedback about at-home performance is unavailable and patients therefore have no idea how to improve their training without the supervision of professional physical therapists. To address this problem, some automatic training systems have been created to evaluate people's performance against standard or expected performance.
Some training systems provide virtual instructors generated by computing resources that are presented to a user via a user device, such as a computer, handset, game system or the like. User gestures are acquired by the end device and data about gestures is provided to the computing system. Systems can evaluate user performance based upon comparing sensed gestures to idealized movements. Various difficulties are encountered in attempting to match acquired gesture data to virtual or ideal models, and many fail to address mismatch error.
One approach for addressing such mismatch is provided by D. S. Alexiadis, et al., “Evaluating a dancer's performance using kinect-based skeleton tracking,” in Proc. of the 19th ACM international conference on Multimedia (MM'11), Scottsdale, November, 2011. This approach uses a Maximum Cross Correlation (MCC) algorithm, which assumes a constant shift between the standard/expected motion sequence and the user's motion sequence.
Another approach is provided by A. Yurtman, and B. Barshan, “Detection and evaluation of physical therapy exercises by dynamic time warping using wearable motion sensor units,” Information Sciences and Systems (SIU'14), Trabzon, April, 2014. This approach pre-defines a number of correct and incorrect templates and judges user performance by finding the best match of the user's execution among these templates.
One group proposed using the marker-based optical motion capture system Vicon and proved its effectiveness in gait analysis on subjects with hemiparesis caused by stroke. A. Mirelman, B. L. Patritti, P. Bonato, and J. E. Deutsch, “Effects of virtual reality training on gait biomechanics of individuals post-stroke,” Gait & posture, 31.4 433-437; (2010). Others demonstrated that the Microsoft Kinect sensor can provide high accuracy and convenient detection of the human skeleton compared with wearable devices. C. Y. Chang, et al., “Towards pervasive physical rehabilitation using Microsoft Kinect,” Pervasive Computing Technologies for Healthcare (PervasiveHealth'12), San Diego (May, 2012). Others developed a game-based rehabilitation system using Kinect for balance training. B. Lange, et al., “Development and evaluation of low cost game-based balance rehabilitation tool using the Microsoft Kinect sensor,” Engineering in Medicine and Biology Society (EMBC'11), Boston, (September, 2011).
The Maximum Cross Correlation (MCC) computes the time shift between the standard/expected motion sequence and the user's motion sequence. D. S. Alexiadis, et al., “Evaluating a dancer's performance using kinect-based skeleton tracking,” in Proc. of the 19th ACM international conference on Multimedia (MM'11), Scottsdale, (November, 2011). In this MCC technique, the user's motion sequence is shifted by the estimated time shift, the two sequences are aligned and their similarity is then calculated. For two discrete-time signals f and g, their cross correlation Rf,g(n) is given by:
and the time shift τ of the two sequences is estimated as the position of maximum cross correlation:
In the MCC process, when the lengths of the two sequences are very close, shifting one sequence by the estimated delay τ can align them and their similarity can be calculated. The present inventors have determined, however, that this MCC method merely calculates the overall delay for the entire sequence once it is complete (and off-line) and cannot address the problem of variant human reaction delay and network delay.
An application of dynamic time warping (DTW), normally applied to speech recognition, was proposed to align movement data where the movement data was acquired with discrete wearable sensors. See, A. Yurtman, and B. Barshan, “Detection and evaluation of physical therapy exercises by dynamic time warping using wearable motion sensor units,” Information Sciences and Systems (SIU'14), Trabzon, April, 2014. This approach involved finding the best match of the user's execution among some correct and incorrect templates to judge the user's performance and provide an indication of the type of errors committed. The need for templates and the need to work off-line after receiving a complete set of data, as in the other approaches above, limits the usefulness of this approach.
More recently, cloud based training systems have been proposed. One cloud based system is proposed by Dennis Shen, Yao Lu and Sujit Dey “Motion Data Alignment and Real-Time Guidance in Avatar Based Physical Therapy Training System.” In Proceedings of IEEE International Conference on E-health Networking, Application & Services (Healthcom), October 2015, Boston. This system enables a user to be trained by following a pre-recorded avatar instructor and getting real-time guidance using mobile device through wireless network. While matching is addressed, there is no attempt to address network latency and mismatches caused by network delays. This limits the accuracy of the technique.
The present inventors have identified the failure to address network delays in attempting matching as s a problem and also human induced delay as an issue to address. Difficulties in these types of systems include latencies. One type of latency is human reaction to a virtual instructor. Another type of latency includes data acquisition and transmission decays, which can be referred to as network delays. Inconsistency in the amount of the two types of delays causes difficulties in evaluating user performance because it is difficult to align the performance of the user's acquired gesture motion data and the virtual instructor motion data.
An embodiment of the invention is a server for virtual training that transmits avatar video data through a network for use by a display device for displaying a virtual trainer and receives user data generated by a gesture acquisition device for obtaining user responsive gesture data. The server includes a processor running code that resolves user gestures in view of network and user latencies. The code aligns subsequences in the user responsive gesture data with subsequences in the avatar video data and generates correction data to send through the network for use by the display device. The correction data can be generated and sent through the network in real time for display by the display device. The correction data can be avatar video data and/or text. The code preferably aligns subsequences via modified dynamic time warping. The modified dynamic time warping comprises pre-processing to first align two starting points by shifting a subsequence in the user responsive gesture data by a constant to align with a first point in a subsequence of the avatar video data and produce pre-processed data. The subsequences in the user responsive gesture data and the subsequences in the avatar video data can correspond to individual physical gestures in a sequence of physical gestures or can correspond to a predetermined number of frames.
The code preferably determines an optimal warping path to the preprocessed data and then applies the optimal path to subsequences in the user responsive gesture data and the avatar video data. The code preferably determines an optimal endpoint of user responsive gesture data as a frame of the data the leads to the best match between subsequences in the user responsive gesture data and the avatar video data and provides the minimum dynamic time warping distance. The code preferably estimates a global minimum point by detecting a movement transition data, determining a local minimum point for a subsequence of data between movement transition data, and then testing for a global minimum for a number of following frames via calculation of warping distances. The code further preferably estimates dynamic time warping distances for subsequent frames and calculates an error vector between this estimated warping distances and the true warping distances for the subsequent frames. The code can the code determine a global minimum when the error vector is less than a predetermined threshold.
In preferred embodiments, the code calculates two dynamic time warp vectors to test each local minimum point in subsequences. The two vectors include a true dynamic time warp distance vector and an estimated dynamic time warp distance vector and the code assigns a global minimum point when the true dynamic time warp distance vector and an estimated dynamic time warp distance vector are within a predetermined error range.
A preferred system of the invention includes a server and a client device, The client device includes a video encoder for encoding the avatar video data, the display device for displaying the virtual trainer, a gesture acquisition device for sensing user movements, and a network interface for receiving the avatar video data and transmitting the user responsive gesture data to the server.
A preferred method for aligning avatar video data with user responsive gesture data includes dividing the user responsive gesture data into subsequences by testing for local minimums in a subsequence of frames and calculating warping distances, and then testing subsequent frames to find an estimated global minimum that meets a predetermined error threshold range. Dynamic time warping is performed on subsequences in the user responsive data with subsequences in the avatar video data. Correction data is generated from the warping. Preferably, preprocessing is conducted on the user responsive gesture data by aligning the starting points of subsequences in the user responsive gesture data and the avatar video data.
An embodiment of the invention is a system for virtual training that includes a display device for displaying a virtual trainer, a gesture acquisition device for obtaining user responsive gestures, communications for communicating with a network and a processor that resolves user gestures in view of network and user latencies. Code run by the processor addresses reaction time and network delays by assessing each user gesture independently and correcting gestures individually for comparison against the training program. Errors detected in the user's performance can be corrected with feedback generated automatically by the system.
Preferred embodiment systems overcome at least two limitations in current remote training and physical therapy technologies. Presently, there exist systems which enable a remote user to follow along with a virtual therapist, repeating movements that are designed to improve strength and/or mobility. The challenge, however, is in assessing the quality and accuracy of the user's movements. Incorporating motion capture feedback, such as a Microsoft Kinect®, can provide information to the therapist as to the movements attempted by the user. Delays however, representing both user reaction time and network delays, can skew the user's data to appear out of alignment with the virtual therapist. This may cause the user to produce unsatisfactory therapy scores even though they are correctly performing the maneuvers. Likewise, incorrect feedback from the user makes it impossible to provide corrective suggestions by the therapy program.
Systems and methods of the invention correct acquired data to adjust for both the human reaction time delay and any network variability to correct the user's data prior to matching it against the therapy program. By accounting for the two forms of delay, the system allows the user's performance to be scored against the virtual therapy and corrective instructions can be sent back to the user as needed. The system can be implemented over a cloud based network to improve performance across end-user devices.
Preferred systems and methods of the invention provide gesture-based dynamic time warping to address both human reaction delay latencies and network delay latencies. The present methods and systems evaluates the user's performance, segments gestures, as well as provides detailed textual/visual guidance in real time. Compared to the approach of D. S. Alexiadis, et al., “Evaluating a dancer's performance using kinect-based skeleton tracking,” in Proc. of the 19th ACM international conference on Multimedia (MM'11), Scottsdale, November, (2011), systems of the invention can align the user and the avatar instructor's motion data with inconstant human reaction delay and network delay. Compared A. Yurtman, and B. Barshan, “Detection and evaluation of physical therapy exercises by dynamic time warping using wearable motion sensor units,” Information Sciences and Systems (SIU'14), Trabzon, (April, 2014), methods and systems of the invention do not need any pre-recorded error template to evaluate the user's performance. Systems of the invention can operate online in real time and provide real-time guidance for the user, while these prior systems can only be applied offline when the entire motion sequence of the user is obtained. Unlike the cloud-based training system of Dennis Shen, Yao Lu and Sujit Dey “Motion Data Alignment and Real-Time Guidance in Avatar Based Physical Therapy Training System.” In Proceedings of IEEE International Conference on E-health Networking, Application & Services (Healthcom) (October 2015), the present invention addresses network delay caused by the wireless network and human reaction delay.
A preferred system and method conducts dynamic gesture based time warping. Sequences are rescaled on a time axis to provide a best match via a warping path. However, this is not done directly. Preprocessing first finds on an optimal path for comparison by aligning starting points prior to warping. Real time gesture segmentation is conducted with an estimation of global minimum determination. Nonlinear rescaling and accuracy testing can be conducted.
Preferred systems and methods of the invention have the ability to effectively and efficiently train people for different types of physical therapy tasks like knee rehabilitation, shoulder stretches, etc. Real-time guidance rather than mere scores can be provided, which allows a user to adjust to the guidance and better accomplish the recommended therapy movements. The systems and methods of the invention thereby adapt to the abilities of the user and can react to the user's performance by dynamically determining the necessary adjustments to establish optimal conditions.
Methods and systems account for human reaction delay (user delay to follow avatar instructions/motion) and mobile network delay (which may delay when the cloud rendered avatar video reaches the user device) and correctly calculate the accuracy of the user's movement compared to the avatar instructor's movement. Misalignment is accounted for and corrected. In particular, the delay may cause the two motion sequences to be misaligned with each other and make it difficult to judge whether the user is following the avatar instructor correctly or not. A dynamic time warping based algorithm addresses the motion data misalignment problem. While not bound to the theory, to the knowledge of the inventors, there have been no prior methods that utilize dynamic time warping to determine alignment between frames of a training video and user sensed movement. Yurtman et al. require templates and off-line analysis. Preferred methods of the invention also apply a gesture based dynamic time warping algorithm to segment the gestures among the whole motion sequence to enable real-time visual guidance to the user.
Experiments have demonstrated a prototype avatar based real-time guidance system in accordance with the invention using mobile network profiles. The experimental results show the performance advantage of present systems methods over other evaluation methods, and the ability of present methods and systems to conduct real-time cloud-based mobile virtual training and guidance.
Those knowledgeable in the art will appreciate that embodiments of the present invention lend themselves well to practice in the form of computer program products. Accordingly, it will be appreciated that embodiments of the present invention may comprise computer program products comprising computer executable instructions stored on a non-transitory computer readable medium that, when executed, cause a computer to undertake methods according to the present invention, or a computer configured to carry out such methods. The executable instructions may comprise computer program language instructions that have been compiled into a machine-readable format. The non-transitory computer-readable medium may comprise, by way of example, a magnetic, optical, signal-based, and/or circuitry medium useful for storing data. The instructions may be downloaded entirely or in part from a networked computer. Also, it will be appreciated that the term “computer” as used herein is intended to broadly refer to any machine capable of reading and executing recorded instructions. It will also be understood that results of methods of the present invention may be displayed on one or more monitors or displays (e.g., as text, graphics, charts, code, etc.), printed on suitable media, stored in appropriate memory or storage, etc.
Preferred embodiments of the invention will now be discussed with respect to the drawings. The drawings may include schematic representations, which will be understood by artisans in view of the general knowledge in the art and the description that follows. Features may be exaggerated in the drawings for emphasis, and features may not be to scale.
In an experimental system according to
In the experimental system, a Microsoft Kinect as the movement sensor 24 captures twenty joints of the user with and x, y, z component of movement for each joint. For a given exercise, some specific body parts might be deemed important and the system can select such important body parts. For frame i, the system 10 includes joint coordinates of these important body parts as the feature vector fi. Apart from joint positions, some other quantities that are derived from the joint coordinates, like joint angles, can also be included in fi. The combination of the feature vector for each frame is the motion data F−{f1, f2, . . . , fm} for the entire exercise.
Given the motion data of the avatar instructor and the user, the accuracy analysis module 42 computes the similarity of the two sequences to evaluate the performance of the user 26. The analysis module 42 accounts for misalignment caused by two kinds of delays in the system 10: human reaction delay and network delay. Advantageously, the system 10 does not need to measure or determine the human reaction delay or the network delay. There is no need to measure either of the human reaction delay or the network delay. Instead, the analysis module 42 aligns the sequences automatically without requirement of a measured, quantified or calculated delay amount.
Human Reaction Delay
After seeing the movement of the avatar instructor on the screen 20, it may take the user 26 some time to react to this movement and then follow it. This delay is defined as the time period from when the avatar instructor starts the motion till the user starts the same motion. For training exercises including multiple separate gestures, the user's reaction delay might be different for these gestures. A gesture is defined herein as a sequence that represents the meaningful action of some body parts, for example when these body parts move and then return to the initial position, or when there is an abrupt change in direction. For example, raising one's hand and then putting it down can be considered as a gesture. As another example, a step forward can be considered gesture and a subsequent step sideways another gesture. Gestures in a training exercise can also be segmented and defined offline by physical therapist as a single movement or a sequence of a few movements.
Network Delay
Delays can be added by the network 16, and the network delay can vary in response to many factors, such as bandwidth and the network load. Under the influence of network delay, the user may 26 not only perform later than the avatar instructor, but may also appear to perform more slowly in data received by the cloud 12, depending on the amount of network delay during a gesture.
Gesture Based Dynamic Time Warping
The accuracy analysis module 42 in preferred embodiments conducts gesture based dynamic time warping. This technique is a modification of dynamic time warping, which is a technique often used in speech processing. See, D. J. Berndt, and J. Clifford, “Using Dynamic Time Warping to Find Patterns in Time Series,” KDD workshop, Vol. 10. No. 16. (1994). Dynamic time warping as applied to speech processing measures the similarity of two sequences by calculating their minimum distance. Given sequences A={a1, a2, . . . , am} and B={b1, b2, . . . , bn}, an m×n distance matrix d is defined and d(i, j) is the distance between ai and bj
d(i, j)=√{square root over (|ai−bj|2)} (3)
To find the best match or alignment between the two sequences, a continuous warping path through the distance matrix d should be found such that the sum of the distances on the path is minimized. Hence, this optimal path stands for the optimal mapping between A and B such that their distance is minimized. The path is defined as P={p1, p2, . . . , pq} where max{m,n}≦q≦m+n−1 and pk=(xk, yk) indicates that axk is aligned with byk on the path. Moreover, this path is subject to the following constraints
Boundary constraint: p1=(1,1), pq=(m,n)
Monotonic constraint: xk+1≧xk and yk+1≧yk
Continuity constraint: xk+1−xk≦1 and yk+1−yk≦1
Under the three constraints, this path should start from (1,1) and ends at (m, n). At each step, xk and yk will stay the same or increase by one.
To find this optimal path, an m×n accumulative distance matrix S is constructed where S(i, j) is the minimum accumulative distance from (1,1) to (i, j). The accumulative distance matrix S can be represented as the following.
S(m,n) is defined as the DTW distance of the two sequences; smaller DTW distance indicates that the two sequences are more similar The corresponding path indicates the best way to align the two sequences. In this way the two sequences are rescaled on the time axis to best match with each other. Time complexity of the DTW method is Θ(mn).
The accuracy analysis module 42 conducts data preprocessing and alignment to utilize dynamic time warping. The data misalignment problem caused by human reaction delay and network delay allows dynamic time warping to be used to rescale the two sequences on the time axis to align them, but only after pre-processing provided by the invention. Directly applying dynamic time warping on two sequences to evaluate their similarity is unreliable because the absolute amplitude of data may have influence on the optimal path and therefore the alignment result. An example illustrates this problem. For two sequences A={a1, a2, . . . , am} and B={b1, b2, . . . , bn}, if one applies dynamic time warping on them, the alignment result is not expected to change if a constant c is added to B. However, when computing the new distance matrix of A and B′=B+c, (3) becomes:
Therefore, the new distance matrix d′ is different from d not only for the constant c. The relative size of elements in d is changed. Consequently, the choice in (4) at each step might be different and S′≠S+c. So B′ is aligned with A in a different way.
To solve this problem, the present invention preprocesses the data before applying dynamic time warping by aligning the two starting points a1 and b1 as (6):
B′=B+(a1−b1) (6)
Applying dynamic time warping on A and B′, we can obtain the optimal path P* and the DTW distance S′(m,n) for A and B:
so the DTW distance S(m,n) between the original data A and B is
A={a1, a2, . . . , am} and B={b1, b2, . . . , bn} are the training avatar and user's motion sequence, respectively. a1 is the first point of sequence A. b1 is the first point of B. The goal is to add a constant k to B (then B becomes B′), so that the first point in B′ equals aIn this way, it is possible to first find out the optimal path P* using the preprocessed data A and B′, and then calculate the dynamic time warping distance for the original data A and B. The remaining description assumes that such preprocessing has been conducted.
Since the dynamic time warping distance S(m,n) is a similarity measurement for the two sequences, the method normalizes S(m,n) over an arbitrary range, e.g., to 0˜100 as evaluation score for the user. Smaller S(m,n) represents higher score and indicates that the two sequences are more similar and the user performs better.
In a physical training session using the present system, there are multiple ways to provide guidance to the user to help the user calibrate user movements. For example, an entire replay of the movements that the user has performed together with the avatar instructor's movements can be provided after the user has done the whole training set (˜several minutes). This can be classified as a non-real time feedback. However, the present system can provide feedback after the user finishes each gesture (˜a couple of seconds) which can be considered as a real-time feedback.
For a given physical training exercise, gestures in the avatar instructor's motion sequence have been predefined and segmented by the physical therapist. Suppose that A1={a1, a2, . . . , am1} is defined as the first gesture in the avatar instructor's sequence A={a1, a2, . . . , am}. Dynamic time warping can be used to find the subsequence of the user's motion data which matches the avatar instructor's gesture A1 best. A modified dynamic time warping algorithm that can be called subsequence dynamic time warping is used to search for a subsequence inside a longer sequence that optimally fits the other shorter sequence. Suppose that the starting point of one gesture is straight after the endpoint of the last gesture, one can fix the starting point of the subsequence as b1. For the subsequence {b1, b2, . . . , bk} (k=2, 3, . . . , n) of the user, its dynamic time warping distance with the avatar's gesture A1 is S(m1,k). The optimal endpoint n1 of the user's gesture should be the frame that leads to the best match between the two sequences and gives the minimum dynamic time warping distance:
If prior techniques for dynamic time warping were applied, due to the existence of local minimum points, the endpoint of the user's gesture cannot be determined until we obtain the whole motion sequence of the user. The entire sequence B={b1, b2, . . . , bn} is searched to find out the global minimum point. This means searching from k=2 to k=n to find out the global minimum point, which requires significant computation. Methods and systems of the invention instead analyze a subsequence of the data, corresponding to a gesture, and avoid the need to search for a global minimum in an entire motion sequence of the user. A global minimum point is instead estimated by analysis of subsequences.
The accuracy analysis module 42 in the system 10 of
S′(m1,k+j)=d(m1,k+j)+S′(m1,k+j−1) (10)
where j=1, 2, . . . , e. Then for the true distance Strue and the estimated distance Sestimated, the relative error vector is
error=|Sestimated−Strue|·/Strue (11)
An error tolerance threshold δ is used to measure the relative error. Sestimated−Strue| is the absolute error between Strue and Sestimated. |Sestimated−Strue|·/Strue is the relative error In the experiments we use e=20 and δ=5%. These values were determined experimentally to provide good results. A preferred assumption is based upon the user completing one gesture, and the user may stay in the end position for a short time (˜1 s, which is ˜30 frames). In this instance, when e<30, larger e means higher accuracy and larger computation., but when e>30, the assumption may not hold. An example practical range for e is from 15-30. Larger values of δ can result in false detection (which means that the point which satisfies Mean(error)<δ may not be the global minimum). Too small δ may result in failure in detection (which indicates that the method cannot find a point where Mean(error)<δ holds, even at the true global minimum). A practical example range for δ is 3%˜10%. If the average relative error Mean(error)<δ, it is concluded that the local minimum point at k is the global minimum point and therefore the endpoint of this gesture. Otherwise continue to test the next local minimum point. Transitions or pauses in physical gesture movements create a natural subsequence, but the selection of subsequences can also be a predetermined number of frames that don't correspond to a discrete physical gesture. Gestures for purposes of analysis can therefore correspond to a physical gesture, a portion of a physical gesture, a portion of sequential physical gestures, or a limited number of sequential physical gestures.
In sum, for each local minimum point k, decide/estimate whether it's the global minimum point. The following assumption is used: if k is the global minimum point, then frame k+1, k+2, . . . , k+e in user sequence B will all be aligned with frame ml of the avatar instructor's sequence A when DTW is applied to A and B. Based on this assumption, calculate two vectors for each local minimum point k. (1) Strue={S(m1,k+1), S(m1,k+2), . . . , S(m1,k−e). This is the true DTW warping distance vector for the sequence of frames. (2) Sestimated={S′(m1,k+1), S′(m1,k+2), . . . , S′(m1,k+e)}. This vector is the estimated DTW warping distance vector based on the above assumption. Then, compare Strue and Sestimated using equation (11). (11) calculates the relative error between Strue and Sestimated. If Strue and Sestimated are within a predetermined error, the assumption is successful for this local minimum point, and this local minimum point can be used as an estimate of the global minimum point.
Using this approach, gesture segmentation is implemented in the process of dynamic time warping and scores for different gestures can be provided to the user in real time. Subsequences can be defined offline as a preliminary step when recording training avatar data. For the user data, the present method can to align subsequences in the data with subsequences in the avatar data and finds the corresponding gestures The present methods are able to align the two sequences perfectly with the existence of any kinds of delay in the user data. For each gesture, the extra complexity to test local minimum points is only Θ(m1e). Moreover, if B1={b1, b2, . . . , bn1} is determined as the gesture related to the avatar instructor's gesture A1={a1, a2, . . . , am1}, dynamic time warping can be conducted from the new starting point (m1+1, n1+1).
when g is large, the present gesture segmented method can significantly decrease the computation complexity compared to default dynamic time warping on the entire sequence.
Based on the alignment result given by the optimal warping path in each gesture, rescaling of the two motion sequences nonlinearly on the time axis can be conducted to match them. When multiple adjacent frames in one sequence are aligned with one single frame in the other sequence, the single frame will be repeated for several times. For example, if Â={ai, ai+1, . . . , ai+w−1} of the avatar instructor are aligned with bj of the user, w-1 frames identical with bj will be inserted after frame j. In this way the user's movement in each frame matches the corresponding movement of the avatar instructor.
Real Time Experiment of Gesture Segmented Dynamic Time Warping.
In the experiment 10 subjects (aged 18˜30, 7 males, 3 females) were required to perform a gesture designed by a physical therapist nine times. For one performance of each subject, the subject receives an evaluation Y ∈ {0,1} from the physical therapist where Y=0 represents good performance and Y=1 indicates that he fails the gesture. In the meantime, the cloud-based virtual training system of
P
S|Y(s|0)PY(0)=PS|Y(s|1)PY(1) (13)
where PY(y) is the prior probability of each class. Assuming that the two classes are Gaussian-distributed,
where μy is the sample mean and σy2 is the sample variance of classy. From (13) and (14) the following is provided:
The solution s0 of (15) is the optimal threshold for the evaluation score S given by the system. From the experiment we get s0=62.8. Scores below 62.8 would benefit from real-time guidance from the system.
Experiments also tested providing users visual and textual guidance through the system. First, we will discuss different alignment types in the result of gesture based dynamic time warping. Here we define the monotonicity of a subsequence Â={ai, ai+1, . . . , ai+w−1} as follows. If all the features of  are monotonic (i.e. keep increasing or decreasing) then  is monotonic, or else it is non-monotonic. Suppose that all the frames in Â={ai, ai+1, . . . , ai+w−1} are aligned to bj, then there are two different cases. If  is monotonic, it means that the effects of multiple frames in  are similar to the effect of bj, which indicates that B is faster than A at that time. If  is non-monotonic, it means that some reciprocating movements in  are aligned to one single frame bj. Thus B's gesture is incomplete for this reciprocating motion. Based on different alignment ways between the avatar instructor and the user, we summarize in Table 1 four types of alignments and their corresponding feedback (used as textual guidance) for the user.
Next, we discuss how to calculate accurate evaluation score for each gesture based on the different kinds of training exercises and the types of alignments discussed above. Above, S(m1, n1) is used to provide evaluation score for the user. However, when the user performs faster or slower than the avatar instructor as type 1 and 2, the difference between the two sequences is counted several times. For example, if all the frames in Â={ai, ai+1, . . . , ai+w−1} are aligned to bj, then the accumulative distance for this part is
However, for some training exercises where speed is not important, the distance should be counted for only once, and (16) can be revised as
Therefore, for exercises in which speed is not important, we use (17) to calculate the evaluation score. For exercises where speed should be considered, the original accumulative distance in (16) is used.
After completing one gesture, the user can see the score of his performance on the screen. To better help the user calibrate this performance for any low-score gesture, a replay system can provide two kinds of guidance (visual and textual guidance) for the user. Firstly, the rescaled movements of the avatar instructor together with the rescaled movements of important body parts of the user are shown on the screen. In this way, the user can see the difference of his movements and the avatar instructor's and know how to correct his performance. Secondly, according to the four types in Table 1, textual guidance can be shown on the screen to remind the user about his error type if he made mistakes in speed or movement range of the gesture. (For those exercises in which speed is not important, type 1 and 2 will be ignored.)
Results
The experiments are based on a testbed (shown in
The tested exercise is laterally moving one's left arm from the solid position to the dotted position and then returning to the solid position with different angle θ for five times. The angle of the left shoulder is measured and five gestures are defined for this exercise. The avatar instructor's motion data for the five gestures are shown as the upper curve in
Results obtained with the present methods and system were compared to the prior traditional method of MCC and default dynamic time warping on the entire sequence that is searched through for a global minimum as discussed above. Data was obtained by calculating a correlation coefficient for the aligned sequences x and y in each method. The correlation coefficient P is defined as:
where
While specific embodiments of the present invention have been shown and described, it should be understood that other modifications, substitutions and alternatives are apparent to one of ordinary skill in the art. Such modifications, substitutions and alternatives can be made without departing from the spirit and scope of the invention, which should be determined from the appended claims.
Various features of the invention are set forth in the appended claims.
The application claims priority under 35 U.S.C. §119 from prior provisional application Ser. No. 62/239,481, which was filed Oct. 9, 2015.
This invention was made with government support under grant number IIS-1522125 awarded by National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62239481 | Oct 2015 | US |