The present invention relates to a system and method for video processing, and, in particular embodiments, to a system and method for player highlighting in sports video.
Sports video broadcasting and production is a notable business for many cable, broadcasting, or entertainment companies. For example, ESPN has a sports video production division. Some sports video production divisions have proprietary software to perform advanced editing functionalities to sports videos. The features of the software include adding virtual objects (e.g., lines) into the video or video frames. It is also expected that more sports video production features and functionalities could appear in future video production software. One building block feature of such software is to detect and track moving objects in sports video, such as players on a sports field, which could be applied in many scenarios in sports video editing applications. One example of such scenarios is to avoid player occlusion when inserting virtual objects into the videos. Improving and adding production features and functionalities in video production software is desired for improving sports or other video broadcasting and online streaming businesses, improving viewer quality of experience, and attracting more customers.
In one embodiment, a method for video detection and tracking includes detecting a plurality of objects in a video frame using a combined Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) algorithm, highlighting the detected objects, and tracking one of the detected objects that is selected by a user in a plurality of subsequent video frames.
In another embodiment, a user device for video detection and tracking includes a processor and a computer readable storage medium storing programming for execution by the processor, the programming including instructions to detect a plurality of objects in a video frame displayed on a display screen coupled to the user device using a combined HOG and LBP algorithm, highlight the detected objects on the display screen, and track one of the detected objects that is selected by a user in a plurality of subsequent video frames on the display screen.
In yet another embodiment, an apparatus for video detection and tracking includes a detection module configured to detect a plurality of objects in a frame in a video using a combined HOG and LBP algorithm, a tracking module configured to track one of the detected objects that is selected by a user in a plurality of subsequent frames in the video, and a graphic interface including a display configured to highlight the detected objects in the frame and the tracked object in the subsequent frames.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
System and method embodiments are disclosed herein to enable features or functionalities for video detection and tracking. The features automatically detect and localize the position of an object (e.g., a sports player) in a video frame and track the moving object in the video over time, e.g., in real time. The functionalities provide improved accuracy in detecting and tracking moving objects in video in comparison to current or previous algorithms or schemes. The functionalities include detecting and highlighting one or more objects (e.g., players) in a video (e.g., a sports video). A user can select a detected and highlighted object that is of interest to the user. The object (e.g., player) may be highlighted with a bounding box (or scanning window) in each frame when the video is playing. The selected and highlighted object is then tracked in subsequent video frames, e.g., until the detection process is restarted.
A combination of Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) algorithms is used to describe every scanning window in a sliding window detection approach. The HOG algorithm is described by N. Dalal and B. Triggs in “Histograms of oriented gradients for human detection,” in conference for Computer Vision and Pattern Recognition (CVPR) 2005, volume 1, pages 886-893, 2005, which is incorporated herein by reference. The HOG features (or descriptors) are based on edge orientation histograms, scale-invariant feature transform (SIFT) features or descriptors, and shape contexts, and are computed on a dense grid of uniformly spaced cells and use overlapping local contrast normalizations for improved performance. The LBP algorithm is described by T. Ojala, et al. in “A comparative study of texture measures with classification based on feature distributions,” in Pattern Recognition, 29(1):51-59, 1998, which is incorporated herein by reference. The SIFT algorithm is described by D. G. Lowe in “Distinctive image features from scale-invariant keypoints,” in International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004, which is incorporated herein by reference.
Features of LBP are also described by T. Ahonen, et al. in “Face Recognition with Local Binary Patterns,” in the Eighth European Conference for Computer Vision, pp. 469-481, 2004, and in “Face Description with Local Binary Patterns: Application to Face Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12): 2037-2041, 2006, both of which are incorporated herein by reference. The combined features of the locally normalized HOG and the LBP improve the performance of detecting moving objects in a video, as described below. A combined HOG and LBP scheme is described by Xiaoyu Wang, et al. in “An HOG-LBP Human Detector with Partial Occlusion Handling,” in International Conference on Computer Vision (ICCV) 2009, which is incorporated herein by reference.
The system 100 also includes a user friendly graphic interface 130, for instance using Microsoft Foundation Class (MFC). The graphic interface 130 is coupled to the detector 110 and the tracking module 120, and is configured to display video frames and enable the functions by the detector 110 and the tracking module 120. For instance, the tracking module 120 can track a moving object, such as a player, displayed via the graphic interface 130 at a determined average rate, e.g., 15 frames per second (fps) with sufficiently stable and precise result. The player is initially detected by the detector 110 and selected by the user via the interface 130. The system 100 may be developed and implemented for different software platforms, for instance as a Windows™ version or a Linux version. The system 100 may correspond to or may be part of a user equipment (UE) at the customer location, such as a video receiver, a set top box, a desktop/laptop computer, a computer tablet, a smartphone, or other suitable devices. The system 100 can be used for detection and tracking of any still or moving video objects in any type of played video, e.g., real-time played or streamed video or saved and loaded video (such as from a hard disk or DVD).
The detection algorithm includes the steps of the method 500. At step 501, an input image (or video frame) is received. At steps 502, the gradient at each pixel in the image is computed, in accordance with the HOG algorithm. At step 503, the gradients at the pixels are processed using convoluted tri-linear interpolation. At step 504, the output of step 503 is processed using integral HOG. At step 505, the LBP at each pixel in the image is also computed, in accordance with the LBP algorithm. At step 506, the output of step 505 is processed using integral LBP. The steps 502, 503, 504 and the steps 505 and 506 can be implemented in parallel. The outputs form steps 504 and 506 are processed using a combined HOG and LBP algorithm (to compute a HOG-LBP feature) for each scanning window. At step 508, the output of step 507 is processed using SVM classification.
A deformable model algorithm described by P. Felzenszwalb, et al. in “A discriminatively trained, multiscale, deformable part model,” in CVPR, 2008, which is incorporated herein by reference, has achieved efficient detection algorithms on various standard datasets including the INRIA dataset shown by Dalal and B. Triggs, the PASCAL dataset shown by Everingham, et al. in “The PASCAL Visual Object Classes Challenge,” at http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html, the TUD dataset shown by M. Andriluka, et al. in “People-Tracking-by-Detection and People-Detection-by-Tracking,” in CVPR 2008, and the Caltech pedestrian dataset shown by P. Dollar, et al. in “Pedestrian Detection: A Benchmark,” in CVPR 09, Miami, USA, June 2009, all of which are incorporated herein by reference.
The HOG-LBP algorithm described above is able to handle the deformable part and to localize the object tightly in comparison to the deformable model algorithms. To compare the HOG-LBP algorithm to the deformable model algorithm, the deformable model algorithm is set up using the HOG-LBP features, taking two root filters and several part filters. The performance of such configured deformable algorithm is acceptable. However, the algorithm's speed may be relatively slow. Thus, the deformable algorithm is not suitable for directly processing sport videos, which may require faster implementation. The deformable model algorithm is applied on test images to compare the performance with the HOG-LBP detection algorithm described above.
To guarantee that the detection algorithm matches the speed requirement of real time video playing, the tracking module can be integrated with the video detection software. For processing speed consideration, a practical and relatively simple approach is implemented by computing the similarity of candidate window patches (scanning windows or boxes) with the highlighted object's patch. Given the position of a player in a last frame, the patch is cropped out and the HOG-LBP feature is computed. A color histogram is also computed for this patch using hue channel of a HSV color model. By combining HOG-LBP and color histogram, the feature is built to describe the object patch. In the current frame, a sliding window method is applied on the neighboring area of the object's last position. The HOG-LBP and color histogram features are extracted for every scanning window to compare with the object feature. The similarity measure of two patches is evaluated by computing the correlation of two feature vectors, which is an inner product of two features. The candidate window with the maximum score is selected and compared with a pre-determined threshold. The threshold is set to check whether the patch is similar enough with the last one. If the candidate window is higher than the threshold, the candidate window is accepted as the new location of the object and the object tracking continues. Otherwise, a verification module is invoked to correct the result or stop tracking to restart detection.
The tracking is used in addition to the detection to improve the performance of the system. While detection is implemented initially to identify the objects, the tracking function is used in subsequent frames to improve the speed of the system. Tracking a moving object in subsequent frames is simpler and faster to implement (in software) than applying the detection of objects for each frame.
As described above, the advantage of tracking in comparison to detection in each frame is speed. However, the bounding box for tracking an object (or player) of interest may drift over time (e.g., after a number of frames), for instance due to variations in the object (or player) appearance, background clutter, illumination change, occlusion, and/or other changes or aspects in the frames. To handle the drift effect of tracking and correct the position of the box or window patch, a verification process is included to the detection and tracking processes. After the tracking process extracts the HOG-LBP and color histogram in the neighboring area of the last tracked position, a next step is implemented to verify if there exists one window in the neighboring area that includes a player or object within. The HOG-LBP feature is sent to SVM processing to find candidate locations of the player. The the color histogram of the candidates is then compared with one or more previous tracking results. The score for verification is based on the weighted sum of SVM and color histogram comparison results. The candidate patch with the maximum score is compared with a pre-determined verification threshold. If the score is greater than the threshold, the tracking continues.
However, if the score is below the threshold, the following steps are implemented. If the verification function is invoked for a first time (during tracking), a counter is initialized for the number of verification attempts, and the verification function is called in the next frame. The tracking module or function is applied on the current frame to provide a prediction for next verification. If the system can't correct the position of the player after implementing the verification process on a plurality of subsequent frames, then the system resets the counter and ends the tracking. The system can then return to the detection process.
The CPU 910 may comprise any type of electronic data processor. The memory 920 may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, the memory 920 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs. In embodiments, the memory 920 is non-transitory. The mass storage device 930 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus. The mass storage device 930 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
The video adapter 940 and the I/O interface 960 provide interfaces to couple external input and output devices to the processing unit. As illustrated, examples of input and output devices include a display 990 coupled to the video adapter 940 and any combination of mouse/keyboard/printer 970 coupled to the I/O interface 960. Other devices may be coupled to the processing unit 901, and additional or fewer interface cards may be utilized. For example, a serial interface card (not shown) may be used to provide a serial interface for a printer.
The processing unit 901 also includes one or more network interfaces 950, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or one or more networks 980. The network interface 950 allows the processing unit 901 to communicate with remote units via the networks 980. For example, the network interface 950 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the processing unit 901 is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.