The subject disclosure relates to video analysis processes, and more particularly, to a system and method for real-time automatic video key frame analysis for critical motion in complex athletic movements.
Video analysis of bio-mechanical movement is becoming very popular for applications such as sports motion and physical therapy. However, identifying which frames in a video include the content of most significance can be a challenge when pouring through hundreds of frames in a sequence. Trying to do so quickly is prone to significant error. Conventionally, a user would try to freeze frame different points in a general section of frames to show for example, a human subject where the body was deficient in a movement. With the proliferation of mobile computing devices, video can be analyzed remotely. The analyzer can extract select frames to show the subject user the flaws in mechanics. However, single frames lack context and seeing the action in real-time or at a high frame rate is more desirable.
Some approaches have attempted to automate the process of identifying critical points in motion. Previous solutions may use for example, pose estimation to analyze movement, which is a flawed technique and very computationally slow and inefficient. Pose estimation is prone to errors in lighting, background, and clothing type. Using poses for intermediate data (feature vector) to other models injects human intelligence into the process, which has significant issues. Moreover, the processing power needed makes it near impossible to run analysis using pose estimation in real-time at high frame-rates on a mobile device.
In one aspect of the disclosure, a method for identifying critical points of motion in a video is disclosed. The method includes using a feature extraction neural network trained to identify features among a plurality of images of movements. The features are associated with a critical point of movement. A video sequence is received including a plurality of video frames comprising a motion captured by a camera. The feature extraction neural network identifies individual frames from the plurality of video frames. The identified individual frames include a known critical point in the captured motion based on identified features associated with a critical point of movement. The identified frames which show the critical points in the captured motion are displayed.
It is understood that other configurations of the subject technology will become readily apparent to those skilled in the art from the following detailed description, wherein various configurations of the subject technology are shown and described by way of illustration. As will be realized, the subject technology is capable of other and different configurations and its several details are capable of modification in various other respects, all without departing from the scope of the subject technology. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.
The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, it will be apparent to those skilled in the art that the subject technology may be practiced without these specific details. Like or similar components are labeled with identical element numbers for ease of understanding.
In general, and referring to the Figures, illustrative embodiments of the subject technology provide processes which automate the identification of critical points of motion in a video that captured a sequence of movements. Aspects of the subject technology provide training for an artificial intelligence (A.I.) or machine learning module. The embodiments described below are capable of engineering training data in such a way that builds a neural network configured to learn and identify the optimal features from a video sequence, which is far superior to using for example, pose-estimation. It is computationally faster and significantly less prone to error. In another aspect, the automated process generates the display of video frames from a video that show the critical points of motion in a movement sequence.
In an illustrative embodiment, aspects of the subject technology may be applied to analysis of a bio-mechanical movement. For example, in sports, specific sequences of movements have the same general motion but identifying the inefficient placement of one body part to another during the motion is a challenge. Generally, there are certain points in the motion that are more critical to generating an outcome of success from the movement than others. Finding the video frames that show these points in the movement when one subject person's body is different than another is susceptible to error when done manually or by other techniques such as pose estimation. Aspects of the subject technology improve the accuracy in identifying the critical points in videos between different subjects. Some embodiments include additional analysis once critical frames are found. Embodiments may include a finely tuned neural network that identifies body parts for a particular critical frame (accuracy being key for body part identification). Body parts may then be evaluated for positioning deviation from an ideal position. An ideal position may, in some applications, be associated with generating an optimal characteristic (for example, maximum power, successful contact, successful object path after contact, among other characteristics depending on the movement). Other analysis including for example, finding objects (golf club, barbell, projectile) may also be performed on a specific critical frame giving more insight into user/player performance. For example, the position of a golf club may be considered crucial to performance in some critical frames.
Step 20 may include automatic key frame labeling. In the keyframe labeling, the initial labeled data may be engineered using an automatic process that converts it from human labels to labels better understood by a machine. In some embodiments, image similarity scoring using for example, a structured similarity index may be included prior to training. In some embodiments, using the labeled data, an automatic process may determine the optimal statistical range that other images should be counted as a keyframe for training. For example, the unlabeled video frames may be scored based on an image similarity algorithm for [0,1]−1 being exactly the same. Other embodiments may determine the median of the distribution after removing outliers. Another embodiment may compute the mean/standard deviation and discard anything outside 1 standard deviation.
Step 30 may include training feature extraction. In training for feature extraction, the engineered labels may then be used to train a feature extraction neural network. A custom constructed neural network may be generated which is ideal for extracting high level features capable of distinguishing critical points of motion from an image. Embodiments may integrate a unique and custom loss function which the model uses during training to evaluate itself and learn an optimal maximum. Through trial and error, and research, the optimal amount of stochastic augmentations which may be applied to input images may be found. This step may alter images in such a way and just enough that challenges the neural network to learn better. Too little and the network may not reach an optimal maximum: too much and the network may not learn anything.
Step 40 may include training a temporal network. The feature extraction network may be used to train the temporal model. In some embodiments, a custom-built neural network may be developed, ideal for understanding high level features over time. These features may be extracted from the neural network in the previous step. In this step the same loss function may be utilized to boost training.
The final output of the above process may be a neural network embodiment which may be executed/performed by computer hardware in real-time to analyze videos. The neural network takes in sequences of images, for example, video, and then identifies the best image from the set that matches a desired critical motion to analyze further. Once the neural network is trained, some embodiments may include a software embodiment which executes the processes of identifying the frames with critical points of motion for display to an end user.
In general, the feature extractor module converts a single two-dimensional image into a compressed representation optimal for temporal training. The temporal model uses that representation to then learn underlying sequence of events and when the critical frames occur. The temporal neural network is being trained on the input feature vectors from the feature extractor neural network. Then the temporal neural network is predicting the critical key frames using these feature vectors. Training is split into two steps in some embodiments because the computational power/memory requirement to construct a single monolithic neural network that goes from video images directly to critical frames may be too heavy for some computing systems. The subject approach is highly optimized since the process(es) require significantly less computational power/memory and is significantly faster to construct. This may be done by splitting training into two steps (feature extractor and temporal model). The feature extractor neural network essentially transforms/maps an image to a latent (hidden space) representation. The latent space is a compressed representation that optimizes training for the temporal model. The temporal model uses the latent space representation to predict the critical frames. The temporal model also learns the underlying temporal patterns between critical frames; for example, the sequence of events and the time that passes between critical frames.
The statistical range for determining whether an unlabeled frame should be converted into a labeled frame may be based on a distribution curve of all the similarity values calculated for a specific keyframe. For example, a normalized root mean squared error may be used as a similarity algorithm. Using this algorithm comparing the specific keyframe to itself would produce a value 0.0 because they are exactly equal. As a particular frame gets further away in time from the keyframe the similarity value may go up stopping at 1.0 because it's normalized from 0.0 to 1.0. The system may collect all these values for a specific keyframe across all the videos in the training dataset. Assuming for example, there are 2000 similarity metrics for a keyframe, a statistical analysis may include removing outliers. For simplicity sake to explain the process, for a process that includes a basic outlier removal, the median value of the list of similarity metrics may be determined. If the median value is for example, 0.25, this value may be used to determine which frames may receive a label from those un-labeled frames associated to a keyframe. Based on the similarity algorithm, the higher the value the more dis-similar the frame is. So, any frame with a score greater than 0.25 (as an example using the illustrate value above), is placed into the no-label category. Inversely if the score is less than or equal to 0.25 it is labeled as keyframe.
During the development of one embodiment, domain experts labelled a start and end point for a critical time period in a video. This resulted in inconsistent data. due to human's inability to consistently perceive tiny visual changes in images across many video sequences. These aggregated inconsistencies can result in too much randomness that are difficult to be modeled statistically. As will be appreciated, the automated portions of the subject technology (for example, via computer recognition) are more adept at identifying tiny visual changes and can consistently manage these across a wide variety of video sequences. This produces a statistically stable distribution.
As will be appreciated, users may benefit from the various aspects of the subject technology when applied to for example, analyzing the motion of a movement to identify flaws or inefficiencies in the movement (for example, hitches, improper alignment of a body part, etc.) The output is provided nearly instantaneously in a software application embodiment which displays a frame with a critical point of motion right after the video is taken. Users are able to use the application anywhere and may do so on site where they are moving/performing a motion. For examples, golfers may video their golf stroke during a round and through the subject technology, see on display why their shot went awry.
The network 806 may be, without limitation, a local area network (“LAN”), a virtual private network (“VPN”), a cellular network, the Internet, or a combination thereof. For example, the network 806 may include a mobile network that is communicatively coupled to a private network, sometimes referred to as an intranet that provides various ancillary services, such as communication with various application stores, libraries, and the Internet. The network 806 allows the image critical points analytics engine 810, which is a software program running on the image analytics service server 816, to communicate with the image input data source 812, computing devices 802(1) to 802(N), and the cloud 820, to provide analysis of critical points in complex movements. In one embodiment, the data processing is performed at least in part on the cloud 820.
For purposes of later discussion, several user devices appear in the drawing, to represent some examples of the computing devices that may be the source of image data. Image data may be in the form of video sequence data files that may be communicated over the network 806 with the image critical points analytics engine 810 of the image analytics service server 816. Today, user devices typically take the form of portable handsets, smart-phones, tablet computers, personal digital assistants (PDAs), and smart watches, which generally include cameras integrated into their respective device packaging.
For example, a computing device (e.g., 802(N)) may send a request 103(N) to image critical points analytics engine 810 to analyze the features of video sequence data captured by the computing device 802(N), so that critical points in the motion captured are identified and analyzed for deviation from an ideal form/positioning.
While the image input data source 812 and image critical points analytics engine 810 are illustrated by way of example to be on different platforms, it will be understood that in various embodiments, the image input data source 812 and the image analytics service server 816 may be combined. In other embodiments, these computing platforms may be implemented by virtual computing devices in the form of virtual machines or software containers that are hosted in a cloud 820, thereby providing an elastic architecture for processing and storage.
As discussed above, functions relating critical point motion analysis of the subject disclosure can be performed with the use of one or more computing devices connected for data communication via wireless or wired communication, as shown in
The computer platform 900 may include a central processing unit (CPU) 904, a hard disk drive (HDD) 906, random access memory (RAM) and/or read only memory (ROM) 908, a keyboard 910, a mouse 912, a display 914, and a communication interface 916, which are connected to a system bus 902.
In one embodiment, the HDD 906, has capabilities that include storing a program that can execute various processes, such as the neural network engine 940, in a manner described herein. The neural network engine 940 may be part of the image analytics service server 816 of
In another embodiment, the computer platform 900 may represent an end user computer (for example, computing devices 802(1) to 802(N)) of
As will be appreciated by one skilled in the art, aspects of the disclosed invention may be embodied as a system, method or process, or computer program product. Accordingly, aspects of the disclosed invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, aspects of the disclosed invention may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer readable media may be utilized. In the context of this disclosure, a computer readable storage medium may be any tangible or non-transitory medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
Aspects of the disclosed invention are described below with reference to block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. The previous description provides various examples of the subject technology, and the subject technology is not limited to these examples. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the invention.
A phrase such as an “aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. An aspect may provide one or more examples. A phrase such as an aspect may refer to one or more aspects and vice versa. A phrase such as an “embodiment” does not imply that such embodiment is essential to the subject technology or that such embodiment applies to all configurations of the subject technology. A disclosure relating to an embodiment may apply to all embodiments, or one or more embodiments. An embodiment may provide one or more examples. A phrase such an embodiment may refer to one or more embodiments and vice versa. A phrase such as a “configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A configuration may provide one or more examples. A phrase such a configuration may refer to one or more configurations and vice versa.
The word “exemplary” is used herein to mean “serving as an example or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs.
This application claims benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application having Ser. No. 63/083,286 filed Sep. 25, 2020, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63083286 | Sep 2020 | US |