The present invention generally relates to a method and associated apparatus for analyzing video to detect far-view scenes in sports video to determine when certain image processing algorithms should be applied. The method comprises analyzing and classifying the fields of view of images from a video signal, creating and classifying the fields of view of sets of sequential images, and selectively applying image processing algorithms to sets of sequential images representing a particular type of field of view.
This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present invention that are described below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
As mobile devices have become more capable and mobile digital television standards have evolved, it has become increasingly practical to view video programming on such devices. The small screens of these devices, however, present some limitations, particularly for the viewing of sporting events. Small objects, such as the ball in a sporting event, can be difficult to see. The use of high video compression ratios can exacerbate the situation by significantly degrading the appearance of small objects like a ball in a far-view scene.
It is possible to enhance the appearance of these objects, but such algorithms may be computationally costly or degrade the overall image quality if applied when not needed. It would be desirable to be able to detect particular types of scenes that could benefit from object enhancement algorithms such that the algorithms may be selectively applied. The invention described herein addresses this and/or other problems.
In order to solve the problems described above, the present invention concerns analyzing video to detect far-view scenes in sports video to determine when certain image processing algorithms should be applied. The method comprises analyzing and classifying the fields of view of images from a video signal, creating and classifying the fields of view of sets of sequential images, and selectively applying image processing algorithms to sets of sequential images representing a particular type of field of view. This and other aspects of the invention will be described in detail with reference to the accompanying drawings.
The above-mentioned and other features and advantages of this invention, and the manner of attaining them, will become more apparent, and the invention will be better understood, by reference to the following description of embodiments of the invention taken in conjunction with the accompanying drawings, wherein:
The exemplifications set out herein illustrate preferred embodiments of the invention. Such exemplifications are not to be construed as limiting the scope of the invention in any manner.
As described herein, the present invention provides a method and associated apparatus for analyzing video to detect far-view scenes in sports video to determine when certain image processing algorithms should be applied. The method comprises analyzing and classifying the fields of view of images from a video signal, creating and classifying the fields of view of sets of sequential images, and selectively applying image processing algorithms to sets of sequential images representing a particular type of field of view.
While this invention has been described as having a preferred design, the present invention can be further modified within the spirit and scope of this disclosure. This application is therefore intended to cover any variations, uses, or adaptations of the invention using its general principles. Further, this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains and which fall within the limits of the appended claims.
In a preferred embodiment, the present invention may be implemented in signal processing hardware within a television production or transmission environment. The method is used to detect far-view scenes in sports video, with an exemplary application in soccer video. Far-view scenes are those corresponding to wide-angle camera views of the play field, wherein the objects of interest, for instance, the players and ball, are small enough to be easily degraded by video compression or not be clearly visible.
Information 130 about the identified objects is passed to the object enhancement stage 150 and to an object aware encoder 170. This information may include, for instance, location, size, trajectory, or a mask of a ball. Video frames 140 are processed at object enhancement stage 150 using the information 130 about the identified objects. For instance, a highlight color may be placed over the location of a ball or puck to allow the viewer to more easily identify its location.
The resulting video frames 160 with enhancement applied to the detected objects are then encoded by the object-aware encoder 170, resulting in an output bitstream 180. The use of object information 130 by the object-aware encoder 170 may allow the encoding to be adjusted to preserve the visibility and appearance of identified objects in the far-view scenes, such as players or the ball. For instance, a lower compression ratio may be used for scenes in which the ball appears as a small object, or for particular areas of frames where the ball appears.
In a system like system 100, object localization and enhancement are performed without regard to the type of view represented in the frames being processed. Thus, unnecessary processing is performed on some types of scenes, potentially resulting in wasted time, wasted processing resources, or image quality degradation.
Far-view scenes 230 are sent to object localization and enhancement processing 250. This processing may include highlighting of a detected ball, illustration of a ball trajectory, or other enhancements. Non-far-view scenes 240 bypass the object localization and enhancement stage and are sent directly to the object-aware encoder 280. Object information 260 produced by the object localization and enhancement stage 250 and the enhanced far-view scenes 270 are also sent to the object-aware encoder 280, which produces an encoded output bitstream 290.
As described earlier, the use of object information 260 by the object-aware encoder 280 allows the encoding to be adjusted, for instance, to preserve the visibility and appearance of identified objects, such as soccer balls, in the far-view scenes. For instance, a lower compression ratio may be used for scenes in which the ball appears, or for particular areas of frames where the ball appears.
The detection of far-view scenes of step 220 comprises the following stages:
At step 330, “player-like” objects are identified through analysis of the foreground objects, which are connected sets of non-field pixels within the field boundary identified in step 320. For a foreground object o with constituent pixels {(xi, yi)}, the following object parameters are computed:
area:
(i.e., the number of pixels in the object),
height:
width:
compactness: co=ao/(ho×wo), and
aspect ratio: ro=ho/wo.
Objects are considered “player-like” when the area, compactness, and aspect ratio each exceed a threshold related to known characteristics of players in a soccer video. Stated otherwise, if (ao>ta), (co>tc), and (ro>tr), the object o is considered “player-like.” In a preferred embodiment directed toward 320×240-pixel soccer videos, the threshold values of ta=10, tc=0.1, and tr=0.5 are used.
At step 340, the maximum area Amax and median area Amed of all player-like objects are computed. As described above, the area of a particular object may be computed simply as the number of pixels comprising the object. Likewise, the area of the field Afield may be computed as the number of pixels comprising the field mask.
At step 350, the area of the field, the median area of player objects, and the maximum area of player objects are compared to thresholds related to the expected areas in a far-view scene. If (Afield>Tfield), (Amed<Tmed), and (tmax<Amax<Tmax), the frame is labeled as FV at step 360. That is, if the field area in the frame is large, the median player area is small, and the maximum player area is within an expected range, the field-of-view of the scene is wide, or far.
If the frame is not labeled as FV, further determinations are made as to whether the frame may be far view, or is most likely not far view. The frame is classified as MFV at step 380 if the criteria (Afield>tfield) and (Amax≦tmax) are met at step 370. Stated otherwise, if the field area is above a lower threshold, but the maximum area of a player-like object is not above a minimum threshold, a reliable determination cannot be made based upon the single frame. If the frame is not labeled as FV or MFV, it is classified as NFV at step 390.
In a preferred embodiment directed toward 320×240-pixel soccer videos, the threshold values of Tfield=0.4×H×W (40% of the frame area), tfield=0.3×H×W (30% of the frame area), Tmed=600, tmax=100, and Tmax=2400, were used, where H and Ware the height and width of the frame in pixels.
The steps of method 300 are repeated for each frame in the sequence. Thus, a sequence of classifications of field of view, one per frame, is obtained.
The frame-level classification of method 300 will generally produce some erroneous frame classifications. In order to mitigate the effects of such errors, the video is segmented into sets of contiguous “similar-looking” frames called “chunks.” The process of identifying chunks is described below with respect to
At step 515, the list of chunks C is empty. The starting frame number j, the first frame in the chunk under construction, is initialized to a value of 1, the first frame in the video sequence.
At step 520, the color histogram of the jth frame is computed. In a preferred embodiment, Hj is a 256-bin histogram of grayscale values of pixels in frame j. A smaller number of larger histogram bins may be utilized to reduce the computational intensity of histogram comparison. The color histogram of the jth frame serves as a basis for assembly of the chunk.
At step 525, a loop begins for frames i=j+1, . . . , N, where N is the number of the last frame in the sequence. The frames following frame j will be analyzed one at a time for similarity to frame j to determine if they should be included in the chunk that begins with frame j.
At step 530, the color histogram Hj of the ith frame is computed using the same technique used at step 520. Then at step 535, the histogram difference between the ith frame and the jth frame, dij is computed. The difference may be computed as dij=∥Hi−Hj∥1, where ∥·∥1 refers to the 1-norm or sum of absolute differences (SAD) between two vectors.
At step 540, a determination is made based on the difference dij from step 535 as to whether the color histogram of frame i is similar enough to that of frame j for frame i to be included in the chunk beginning with frame j. The frames are considered similar enough if the distance dij between their color histograms Hj and Hj is below a threshold Tchunk. In a preferred embodiment, Tchunk=0.4.
If dij>Tchunk, that is, if the color histogram difference of frame i from frame j is too large for frame i to be included in the chunk, the interval [j, i−1] is added to the list of chunks at step 545. Thus, the current chunk is terminated at the previous frame, frame i−1, the last frame meeting the similarity threshold. The starting frame number j for the new chunk is set to the current value of i at step 565, and the process returns to step 520 for building of the next chunk.
However, if dij is less than or equal to Tchunk, the frame i is considered similar enough to the initial frame of the chunk, j, to be added to the current chunk. A determination is then made at step 555 as to whether i=N (i.e., the last frame in the video has been reached). If not, the process returns to the beginning of the loop at step 525 and the next frame is considered for inclusion in the current chunk.
The ending frame number of the chunk under construction is thereby increased until either a sufficiently dissimilar frame is located or until the last frame is reached. If the last frame has been reached, that is, i=N, the interval [j N] consisting of the final frames is added to the list of chunks 575 at step 570 and the process terminates.
At the end of this process, a list of chunks C has been produced. Each chunk is represented by a pair [b e], where b is the beginning frame of the chunk and e is the ending frame.
At step 620, each frame of the input video chunk 610 is classified as FV, MFV, or NFV. This frame-level classification may be performed using method 300 described above.
At step 630, the percentage of FV frames is computed for the chunk. If more than 50% of the frames in the chunk are determined to be FV at step 640, the whole chunk is classified as FV at step 650. That is, if the majority of constituent frames are far view, the chunk is considered far view.
If the percentage of FV frames is not above 50%, the percentage of MFV frames in the chunk is computed at step 660. If more than 50% of frames are determined to be MFV at step 670, the chunk is classified as MFV at step 680. If neither criterion is satisfied, the chunk is classified as NFV at step 690. In an alternative embodiment, chunks may be classified as NFV if the frame count is below a certain threshold.
At step 740, if an MFV chunk lies adjacent to an FV chunk, it is reclassified as FV. That is, if a determination regarding the field of view could not be made at step 730, the chunk will be considered far-view if it is adjacent to a far-view chunk. A chunk [b1 e1] is said to be adjacent to a chunk [b2 e2] if b1=e2+1 or e1=b2−1. In a preferred embodiment, only MFV chunks adjacent to original FV chunks are reclassified and reclassification based on adjacency to other reclassified FV chunks is not allowed.
At step 750, all remaining MFV chunks as reclassified as NFV. That is, if a determination regarding the field of view of the chunk could not be made at step 620 and the chunk is not adjacent to a chunk identified as far-view, the chunk will be assumed to not be far-view.
At step 760, the process merges all FV chunks that lie adjacent to each other into larger chunks. By merging two adjacent FV chunks [b1 e1] and [b2 e2], where e1=b2−1, C is modified by removing these two chunks and adding [b1 e2] in their place. The new chunk inherits the FV label of its constituent parts. This merging process is repetitive and is performed until there are no adjacent FV chunks left.
If an FV chunk has fewer than Nmin frames, it may be reclassified as NFV. In one particular embodiment, Nmin is chosen to be 30 frames. Thus, processing of short scenes may be avoided.
Finally, at step 780, all of the remaining FV chunks in C are classified as far-view scenes. The beginning and ending frames of each FV chunk indicate the boundaries 790 of the corresponding far-view scene. Similarly, all remaining NFV chunks are classified as non-far-view scenes (after merging adjacent ones as described earlier).
At the end of this process, a list of far-view scenes SFV and a list of non-far-view scenes SNFV are obtained. Each scene is represented by a pair [b e], where b is the beginning frame of the scene and e is its end frame. These scene boundaries may then be used by the object highlighting system shown in
While the present invention has been described in terms of a specific embodiment, it will be appreciated that modifications may be made which will fall within the scope of the invention. For example, various processing steps may be implemented separately or combined, and may be implemented in general purpose or dedicated data processing hardware or in software. The overall complexity of the method may be reduced by relaxing the criteria for objects in the field to be considered in decision-making. For example, instead of detecting player-like objects, all objects larger than a threshold area may be considered. Also, instead of using grayscale pixel values for computing histograms during chunk segmentation, it is possible to use full-color values (e.g. RGB, YUV). Furthermore, distance measures other than the SAD may be used for comparing histograms. Instead of using threshold-based criteria for frame classification, a classifier (e.g., support vector machine) may be learned from a labeled training set of FV, MFV, and NFV frames. Furthermore, the proposed method may be applied to other sports or events with moving objects of interest. Finally, the method may be used to detect other types of scenes than far view for specialized processing.
This application claims benefit, under 35 U.S.C. §365 of International Application PCT/US2010/002028, filed Jul. 19, 2010 which was published in accordance with PCT article 21(2) on Jan. 27, 2001 in English and which claims the benefit of U.S. provisional patent application No. 61/271,381 filed Jul. 20, 2009.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2010/002028 | 7/19/2010 | WO | 00 | 1/12/2012 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2011/011052 | 1/27/2011 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6842197 | Llach-Pinsach et al. | Jan 2005 | B1 |
7006945 | Li | Feb 2006 | B2 |
7199798 | Echigo et al. | Apr 2007 | B1 |
8184947 | Murabayashi et al. | May 2012 | B2 |
8326042 | Wu et al. | Dec 2012 | B2 |
20010017887 | Furukawa et al. | Aug 2001 | A1 |
20020028021 | Foote et al. | Mar 2002 | A1 |
20020159637 | Echigo et al. | Oct 2002 | A1 |
20030179294 | Martins | Sep 2003 | A1 |
20040090453 | Jasinschi et al. | May 2004 | A1 |
20040130567 | Ekin et al. | Jul 2004 | A1 |
20060026628 | Wan et al. | Feb 2006 | A1 |
20060061602 | Schmouker et al. | Mar 2006 | A1 |
20070242088 | Kim et al. | Oct 2007 | A1 |
20070263723 | Segiguchi et al. | Nov 2007 | A1 |
20070281331 | Koo et al. | Dec 2007 | A1 |
20080235318 | Khosla et al. | Sep 2008 | A1 |
20090083790 | Wang et al. | Mar 2009 | A1 |
20090147992 | Tong et al. | Jun 2009 | A1 |
20090196492 | Jung et al. | Aug 2009 | A1 |
20100027840 | Roberts et al. | Feb 2010 | A1 |
20100034425 | Lin et al. | Feb 2010 | A1 |
20100098307 | Huang et al. | Apr 2010 | A1 |
20110026606 | Bhagavathy et al. | Feb 2011 | A1 |
20110243417 | Madabhushi et al. | Oct 2011 | A1 |
20120114184 | Barcons-Palau et al. | May 2012 | A1 |
Number | Date | Country |
---|---|---|
101465003 | Jun 2009 | CN |
1043664 | Oct 2000 | EP |
1501313 | Jan 2005 | EP |
1515448 | Mar 2008 | EP |
2000069420 | Mar 2000 | JP |
2000293543 | Oct 2000 | JP |
2001245303 | Aug 2001 | JP |
2002281506 | Sep 2002 | JP |
2003264757 | Sep 2003 | JP |
2005182402 | Jul 2005 | JP |
2005260501 | Sep 2005 | JP |
2006087098 | Mar 2006 | JP |
2007101867 | Apr 2007 | JP |
2007233798 | Sep 2007 | JP |
2010507327 | Mar 2010 | JP |
2010522926 | Jul 2010 | JP |
20080075986 | Aug 2008 | KR |
871012 | Nov 2008 | KR |
WO0045335 | Aug 2000 | WO |
WO 2007045001 | Apr 2007 | WO |
WO2008048268 | Apr 2008 | WO |
WO2008118147 | Oct 2008 | WO |
2009067170 | May 2009 | WO |
WO 2009067170 | May 2009 | WO |
2010083018 | Jul 2010 | WO |
WO2011011059 | Jan 2011 | WO |
Entry |
---|
Chang, W. et al: “Template-Based Scene Classification for Baseball Videos Using Efficient Playfield Segmentation”, Third International Conference on Intelligent information 2007; Hiding and Multimedia Signal Processing. IEEE Computer Society. |
Delannay, D. et al: “Detection and Recognition of Sports (wo)men from Multiple Views,”; Third ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC) 2009. |
Ekin, A. et al: “A Framework for Tracking and Analysis of Soccer Video ”, Visual Communications and Image Processing, 2002. Proceedings of SPIE, vol. 4671. |
Kolekar, M. et al: “A Hierarchical Framework for Semantic Scene Classification in Soccer Sports Video,”. TENCON 2008, IEEE Region 10 Conference 2008. |
Liang, D. et al: “Video2Cartoon:A System for Converting Broadcast Soccer Video into 3D Cartoon Information”, IEEE Transactions ion Consumer Electronics, vol. 53, No. 3, Aug. 2007. |
Marpe, D. et al: “Contect-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, No, 7, Jul. 2003. |
Morioka, K. et al: “Seamless Object Tracking in Distributed Vision Sensor Vision”, SICE Annual Conference in Sapporo, Aug. 4-6, 2004. Hokkaido institute of Technology, Japan. |
Pei, C. et al: “A Real Time Ball Detection Framework for Soccer Video”, Sixteenth International Conference on Systems, Signals and Image Processing, IEEE, 2009. |
Phillips, P. et al: “Object Identification and Registration via Sieve Processes”, Signal ProcessingSensor Fusion and Target Recognition, SPIE vol. 2755. |
Ren, J. et al: “Tracking the Soccer Ball Using Multiple Fixed Cameras”, Computer Vision and Image Understanding, 2009. |
Sadlier, D. et al: “A Combined Audio-Visual Contribution to Event Detection in Field Sports Broadcast Video. Case Study: Gaelic Football”,. Signal Processing & Information Tech., 2003. ISSPIT 2003. Proceed. of the 3rd Int'l Symp. in Darmstadt DE. Dec. 14-17, 2003. |
Tong, X. et al: “A Three-Level Scheme for Real-Time Ball Tracking”, MCMS 2007, LNCS 4577, pp. 161-171, 2007; Springer-Verlag Berlin, Heidekberg 2007. |
ITU-T H.264 Series H: Audiovisual and Multimedia Systems. (Infrastructure of audiovisual services-Coding of moving video): Advanced video coding for generic audiovisual services, 2005. |
Search Reportg Nov. 4, 2010. |
Kaveh Kangarloo, et al., “Grass Field Segmentation The First Step toward Player Tracking, Deep Compression and Content Based Football Image Retrieval”, International Workshop on Systems, Signals and Image Processing, Sep. 13-15, 2004, pp. 131-134. |
Youness Tabii, et al., “A New Method for Video Soccer Shot Classification”, VISAPP 2007—International Conference on Computer Vision Theory and Applications, pp. 221-224. |
Jinjun Wang, et al., “Soccer Replay Detection Using Scene Transition Structure Analysis”, ICASSP IEEE 2005, pp. 433-436. |
Fei Yan, et al., “Layered Data Association Using Graph-Theoretic Formulation with Applications to Tennis Ball Tracking in Monocular Sequences”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, No. 10, Oct. 2008, pp. 1814-1830. |
Xinguo Yu, et al., “Trajectory-Based Ball Detection and Tracking in Broadcast Soccer Video”, IEEE Transactions on Multimedia, vol. 8, No. 6, Dec. 2006, pp. 1164-1178. |
ITU-T H.264, “Series H: Audiovisual and Multimedia Systems Infrastructure of Audiovisual Services-Coding of Moving Video”, ITU-T Recommendation H. 264, Mar. 2005, 343 pgs. |
Non-Final Office Action for related U.S. Appl. No. 13/386,145 dated Oct. 11, 2013. |
PCT Search Report for related application PCT/US2010/002039 dated Nov. 4, 2010. |
China Search Report dated Dec. 18, 2013. |
Number | Date | Country | |
---|---|---|---|
20120121174 A1 | May 2012 | US |
Number | Date | Country | |
---|---|---|---|
61271381 | Jul 2009 | US |