Broadband network operators, such as multiple system operators (MSOs), distribute and deliver services such as video, audio, and multimedia content to subscribers or end-users. For example, a broadband cable network MSO may utilize resources for transmitting digital video as linear (i.e., scheduled) services or as non-linear services enabling viewers to retrieve audiovisual contents at any time independent from linear (i.e., scheduled) broadcast services. Non-linear services may be time-displaced, for instance, by only a few minutes or by many months from its corresponding linear broadcast service or may be repeat consumption of a program.
Highlight or replay viewing of content provides a specific example of non-linear viewing. For instance, a viewer that may not have sufficient free time to watch a three hour sports contest, such as a football game, from beginning to end may instead view the game at a later time on any number of different types of electronic viewing devices. For instance, viewing may be accomplished with a portable device with an objective of only watching plays of a specific team, plays involving a specific player, or exciting moments occurring within a game.
Professionally produced sports video typically includes the use of replay clips or video segments which necessarily correlate with exciting moments or highlights occurring within a game. The ability to view such highlights and/or replays as a non-linear service may be particularly desirable for users that are only able to start watching a game after its start so that the viewer may quickly catch up on what has occurred in the game or such viewing may enable a viewer to quickly ascertain the essence of a completed game.
Various features of the embodiments described in the following detailed description can be more fully appreciated when considered with reference to the accompanying figures, wherein the same numbers refer to the same elements.
For simplicity and illustrative purposes, the principles of the embodiments are described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent however, to one of ordinary skill in the art, that the embodiments may be practiced without limitation to these specific details. In some instances, well known methods and structures have not been described in detail so as not to unnecessarily obscure the embodiments.
According to an embodiment, a method and apparatus are provided enabling a highlight or replay service to be offered by a provider over a network to subscribers as a non-linear service. The service enables automatic, high quality, replay extraction from video and playback of the extracted replays to subscribers. Such a service may automatically extract and generate a sequence of highlight/replay clips within a given broadcast program, such as a football game, other sports contest, or the like. For this purpose, the method and apparatus should be able to readily identify highlight and replay clips or other clips of desire from video sequences to enable ready location and/or extraction thereof.
With respect to a video of a football game or the like, highlights and replays may be detectable to some extent by heuristics, such as locations in video where the use of telestration, slow motion, or the absence of a score panel may be detected. However, these readily detectable events are by nature non-exclusive and unreliable and may generate many false detections or may be unable to detect some highlights or replays that do not contain such events.
According to an embodiment, highlight and replay clips are located within video based on the use, for instance, in professional broadcast sports video, of visual cues to signal the start and/or end of highlights and replays. Thus, highlight or replay sequences or clips within a video are demarcated by a sequence of visually distinctive video frames. For purposes of this disclosure, these sequences of visually distinctive frames are referred to as “marker frames” or “markers”. The sequence of marker frames is also referred to as a demarcating segment. A video may contain multiple demarcating segments each serving a different purpose, for instance, one to signal the start of a replay clip and another to signal the end of a replay clip.
According to an embodiment, a method and apparatus are provided that are able to automatically identify, verify, and apply marker frames within a video for purposes of automatically generating a highlight/replay clip service.
Different video programs, types of programs, producers of programs, etc. necessarily use different demarcating segments or visual cues. Moreover, demarcating segments used by a particular video producer may change from one game to the next, from one part of the season to the next, and from one season to the next. Thus, the method is required to identify demarcating segments in each video and not rely on known demarcating segments previously identified in other videos. After the demarcating segments are identified within a given video, highlights and replays from the given video can be reliably detected by matching the signatures of the marker frames to the signatures of the video frames and determining their location within the video stream.
A demarcating segment typically involves the use of statically and dynamically visually distinct video frames because demarcating segments must provide sufficiently strong visual stimuli to indicate the start or end of a replay clip or like segment of interest and demarcating segments must not be confused with other visual content, such as content from other genre and advertisements.
According to an embodiment and based on the above information, video frames from the video that are visually distinct are automatically detected. As an example of video frames,
For purposes of obtaining the candidate marker frames 24, selected visual features may be extracted and obtained from each frame of the video. An example of a visual feature may be, for instance, a set of Color Layout Descriptors (CLDs) which capture spatial distribution of color in an image or frame of video. Another example of visual features is the Edge Histogram Descriptor (EHD). Yet another example of visual feature is the color histograms of color image. Different visual features for a video frame can be combined into an augmented visual feature. For example, combining CLD and EHD coefficients would formulate a 92 dimension vector, where 12 coefficients come from CLD and 80 coefficients come from EHD. Based on the features from a frame unit (such as one or more consecutive frames), a statistical model for the video is generated. A probability vector for a frame unit is obtained by plugging the features from a frame unit into the statistical model. The probability vector may be used to automatically separate and identify visually distinct frames or candidate marker frames from remaining video frames.
After the candidate marker frames 24 are extracted from the video, the candidate marker frames 24 are grouped based on visual similarity into a plurality of different groups. In one embodiment, the candidate marker frames are grouped based on a subset of CLD coefficients, i.e., if the subsets of coefficient values for two video frames are identical, then the two frames fall into the same group. One way to implement this grouping is to create a hashing key by concatenating the selected subset of feature values into an integer. Other clustering methods, such as K-means or hierarchical clustering, can be used for the purpose of grouping. In another embodiment, the K-means clustering is used to group the candidate marker frames. See step 28 in
For purposes of identifying true replay marker frames from the groups of candidate marker frames, an embodiment provides each group with a proximity score. See step 40 in
Typically the so called “input video event” exists over a time period starting from the left boundary T_left to the right boundary T_right. In one embodiment, in order to calculate the proximity score, a proximity function is defined for the left and right boundary respectively, over the left window [T_left-K, T_left] and right [T_right, T_right+K], where K is the length of the window. The function value can be proportional to the proximity to the input video event, hence ascending at the left window and descending as the right window. Alternatively, the function can be flat at either or both of the windows. The flat function essentially defines a binary proximity score. In another embodiment, a single frame is selected within the time period of the given input video event; in this case T_left=T_right.
Usually different marker frames are used to demarcate the start and end of a highlight/replay. However, sometimes the same sequence of marker frames is used to demarcate both the start and the end of highlight/replay. In general, each group of the candidate marker frames is subject to the test of proximity score for both start and end. Hence, if a group of marker frames demarcates both start and end of highlight/replay, it will receive high proximity score for both start and end.
A score panel that may be shown during a sporting event, such as a football game, may disappear during replay. However, score panels may also be absent at the beginning of a game, during advertisements, or during other segments of a game. Telestration may appear in some replay segments, but not all replay segments. Slow motion may appear in some replay segments, but not all replay segments. Although detecting replay segments solely based on absence of score panel, detection of telestration, or detection of slow motion events may be inherently unreliable or inconclusive, these events can be useful in step 40 for identifying which group of the visually distinct candidate marker frames are the true marker frames of replay segments based on the development of a proximity score. The closer a candidate marker frame is temporally to one of the above replay events (score panel absence, telestration, slow motion, etc.) the higher the proximity score. Collectively, the group of candidate marker frames that contains the true replay marker frames will produce a higher average proximity score, because these frames collectively will be statistically closer, in time, to the given replay events.
Accordingly, after the proximity score is determined for each group 30′, 32′, 34′, 36′ and 38′ of candidate marker frames in
Thus, as shown in
A list of possible video segments may be used to create a video playback application for a user. In one embodiment, the user may be presented with a sequence of the demarcated replay segments 44. Each segment may be represented, for example, using a text description, time information, or one or more images. The user may then select the demarcated replay segment 44 to play the segment. The user may select and watch one segment at a time, or, for example, may choose several segments and watch all selected segments in a sequence.
As is well known in the art, hashing is a data storage method utilizing a key-value store. For each value to be stored, a hash function takes the value to be stored as input and outputs a corresponding hash key. Subsequently, the value is then stored in the key-value store using the computed hash key. If multiple values correspond to the same hash key, then the multiple values are stored in a collection or list affiliated with the hash key. Grouping candidate marker frames into one or more groups based on visual similarity may be accomplished by the choice of the hash function. In one embodiment, the hash function takes as input the CLD values and outputs an integer hash key. For example, if a first CLD and a second CLD represent visually similar frames, then a hash function may output the same integer hash key for both CLDs. As a result, storing the CLDs in a key-value store will result in the two visually similar CLDs being stored together in the same collection or list.
In one embodiment, the hash function may be computed by concatenating (bitwise operation) a selected subset of visual features into an integer, which is used as the hash key.
In step 60 of
In another embodiment, in step 60 of
In step 62 of
In the test video, twenty-six separate telestration input events were detected and thereby indicate that there is at least twenty-six separate points in the test video where a replay clip may be provided. With use of the method described above and in
As shown in
In
Each of the 27,155 candidate marker frames 214 identified in the test video was grouped with visually similar candidate marker frames that are temporally scattered throughout the test video. See step 216 in
In step 218 of
A score is calculated for each group for each of the replay clips detected as having telestration. The collective result is determined for each group and thereafter, each group is ranked by proximity score in step 222 of
Thus, the collective result is used to identify the most likely groups indicating the seed marker frames 224 of the start and end of each replay from the 10,322 candidate marker groups. As shown in
As shown in
The above example demonstrates that the two step process including use of input events, such as telestration, to qualify marker frames and then use of qualified marker frames to detect replay clips, provides better results than the mere direct use of input events, such as telestration, to search for replay clips in the test video.
In
Although not a seed marker frame, the first of the candidate marker frames 316 between replay clips 306 and 308 obtains a score based on its proximity to the false occurrence of a telestration event 312, and the candidate marker frame 316 occurring shortly before replay clip 310 obtains a score due to its close proximity to replay clip 310. However, based on the use of collective scoring within the group and not individual frame score, the collective proximity score of the group of candidate marker frames 316 will be less than that of the group of true marker frames 314. This is true despite telestration events not being present in each replay clip and despite telestration events occurring outside of replay clips. Provided the majority of true marker frames are adjacent to replay clips having telestration events, the true group of marker frames will be selected over other groups of candidate marker frames.
While the above examples primarily focus on automatic detection of replay clips in broadcast videos of football games, this is only provided by way of example and replays in any type of sports broadcast or any type of desired video segment that is demarcated in some manner within a video of any type of subject matter that is not limited to broadcast sports games, can be automatically detected.
Further, a video processing electronic device configured to automatically detect frames in a video that demarcate a pre-determined type of video segment within the video is also contemplated. For instance, such a device may include at least one processing unit configured to identify candidate marker frames within a video as discussed above, group the candidate marker frames into a plurality of groups based on visual similarity as discussed above, compute a score for each of the groups based on temporal proximity of each of the candidate marker frames within the group to a detected event in the pre-determined type of video segment within the video as discussed above, and select at least one of the groups based on the score as marker frames that demarcate the pre-determined type of video segment. Such a device may also be configured to automatically locate the pre-determined type of video segments within the video by detecting the marker frames in the video and generate a highlight video containing only the pre-determined video segments.
The above referenced device and processing unit may include various processors, microprocessors, controllers, chips, disk drives, and like electronic components, modules, equipment, resources, servers, and the like for carrying out the above methods and may physically be provided on a circuit board or within another electronic device. It will be apparent to one of ordinary skill in the art that the processors, controllers, modules, and other components may be implemented as electronic components, software, hardware or a combination of hardware and software.
For example, at least one non-transitory computer readable storage medium having computer program instructions stored thereon that, when executed by at least one processor, cause the at least one processor to automatically detect frames in a video that demarcate a pre-determined type of video segment within the video is contemplated by the above described embodiments.
In the foregoing specification, specific embodiments have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the embodiments as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the embodiments.
The present application is a continuation of U.S. patent application Ser. No. 17/336,125 filed on Jun. 1, 2021, which is a continuation of U.S. Pat. No. 14/302,229, filed Jun. 11, 2014, now U.S. Pat. No. 11,023,737, the contents of which are incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
8259806 | Radhakrishnan et al. | Sep 2012 | B2 |
8515253 | Cottrell | Aug 2013 | B2 |
8913872 | Goz | Dec 2014 | B1 |
20030063798 | Li | Apr 2003 | A1 |
20050257151 | Wu | Nov 2005 | A1 |
20080123733 | Yu et al. | May 2008 | A1 |
20080159383 | Kukreja et al. | Jul 2008 | A1 |
20090116811 | Kukreja et al. | May 2009 | A1 |
20100124378 | Das et al. | May 2010 | A1 |
20120114167 | Tian et al. | May 2012 | A1 |
20120210228 | Wang | Aug 2012 | A1 |
20140028917 | Smith et al. | Jan 2014 | A1 |
Entry |
---|
Official Action, RE: Canadian Application No. 2,951,852, dated Nov. 20, 2017. |
Examination Report, RE: Great Britain Application No. GB1621061.9, dated Feb. 1, 2017. |
F. Dufaux, “Key frame selection to represent a video”, Image Processing, 2000, pp. 275-278. |
Y. Gao, et al., “Thematic Video Thumbnail Selection”, Image Processing, 2009, pp. 4333-4336. |
J.L. Lai, et al., “Key frame extraction based on visual attention model”, Journal of Visual Communication and Image Representation, 2012, pp. 114-125. |
PCT Search Report & Written Opinion, RE: Application No. PCT/US2015/033722, dated Sep. 14, 2015. |
X. Zhu, et al., “Exploring Video Content Structure for Hierarchical Summarization,” Multimedia Systems, ACM, vol. 10, No. 2, Nov. 1, 2004, pp. 98-115. |
M.M. Yeung, et al., “Efficient matching and clustering of video shots,” Proceedings of the International Conference on Image Processing (ICIP), vol. 1, Oct. 23, 1995, pp. 338-341. |
Number | Date | Country | |
---|---|---|---|
20220366693 A1 | Nov 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17336125 | Jun 2021 | US |
Child | 17877557 | US | |
Parent | 14302229 | Jun 2014 | US |
Child | 17336125 | US |