This invention relates broadly to methods and systems for replay generation for broadcast video, and to a data storage medium having stored thereon computer code means for instructing a computer to execute a method for generating replays of an event for broadcast video.
The growing interest in sporting excellence and patriotic passions at both the international level and the domestic club has created new culture and businesses in the sports domain. Sports video is widely distributed over various networks and its mass appeal to large global audiences has led to increasing research attentions on sports domain in recent years.
Studies have been done on soccer video, and promising results have been reported. These researches mainly focus on semantic annotation, indexing, summarisation and retrieval for sports video. They do not address video editing and production such as automatic replay generation and broadcast video generation.
Generating soccer highlights from a live game is a labour-intensive process. To begin with, multiple cameras are installed all around the sporting arena to maximise coverage. Each camera is often designated a limited aspect of the game such as close-up on coaches and players to capture their emotions. A skilled operator therefore necessarily mans each camera, and their combined entourages add values to the broadcast video to approximate the live atmosphere of the real thing. A broadcast director sits in the broadcast studio, monitoring the multiple video feeds, and deciding which feed to go on-air. Of these cameras, a main camera that is perched high above the pitch level, provides a panoramic view of the game. The camera operator pans-tilts-zooms this camera to track the ball on the field and provide live game footage. This panoramic camera view often serves as the majority of the broadcast view. The broadcast director, however, has at his disposal a variety of state-of-the-art editing video tools to provide enhancement effects to the broadcast. These often come in the form of video overlays that includes textual score-board and game statistics, game-time, player substitutions, slow-motion-replay, etc.
At sporadic moments in the game that he deems appropriate, the director may also decide to launch replays of the prior game action.
These replay clips are then immediately available for further editing and voice-over. Typically, these are then used during the half-time breaks for commentary and analysis. They may also be used to compile a sports summary for breaking news.
At least one embodiment of the present invention seeks to provide a system for automatic replay generation for video according to any of the embodiments described herein.
In accordance with a first aspect of the present invention there is provided a method for generating replays of an event for broadcast video comprising the steps of receiving a video feed; automatically detecting said event from said camera video feed; generating a replay video of said event, and generating broadcast video incorporating said replay.
The replay video may be automatically generated.
The replay may be automatically incorporated into said broadcast video.
Said step of automatically detecting said event may comprise the steps of extracting a plurality of features from said camera video feed, and inputting said features into an event model to detect said event.
Said step of extracting the plurality of features may comprise the step of analysing an audio track of said video feed, determining an audio keyword using said audio analysis and extracting the features using said audio keyword.
Said audio keyword may be determined from a set of whistle, acclaim and noise.
Said step of extracting a plurality of features further may comprise the step of analysing a visual track of said video feed, determining a position keyword using said visual analysis and extracting the features using said position keyword.
Said step of determining a position keyword may further comprise the steps of detecting one or more of a group consisting of field lines, a goal-mouth, and a centre circle using said visual analysis and determining said position keyword using one or more of said group.
Said step of extracting a plurality of features may further comprise the step of determining a ball trajectory keyword using said visual analysis and extracting the features using said ball trajectory keyword.
Said step of extracting a plurality of features may further comprise the step of determining a goal-mouth location keyword using said visual analysis and extracting the features using said goal-mouth location keyword.
Said step of extracting a plurality of features may further comprise the step of analysing the motion of said video feed, determining a motion activity keyword using said motion analysis and extracting the features using said motion activity keyword.
Said step of detecting said event may further comprise the step of constraining the keyword values within a moving window and/or synchronising the frequency of the keyword values for at least one of said position keyword, said ball trajectory keyword, said goal-mouth location keyword, said motion activity keyword and said audio keyword.
Said step of inputting said features into an event model may further comprise the step of classifying said event into one of a group consisting of an Attack event, a Foul event, an “Other” event, and a No event.
The method may further comprise the step of automatically detecting boundaries of said event in the video feed using at least one of the features.
The method may further comprise searching for changes in the at least one of the features for detecting the boundaries.
Said step of generating a replay video of said event may comprise the steps of concatenating views of said event from at least one camera, and generating a slow motion sequence incorporating said concatenated views.
Said step of generating the broadcast video may comprise the step of determining when to insert said replay video according to predetermined criteria.
Said replay video may be inserted instantly or after a delay based on said predetermined criteria.
Said predetermined criteria may depend on classifying said event into one of a group consisting of an Attack event, a Foul event, an “Other” event, and a No event.
Said video feed may be from a main camera.
In accordance with a second aspect of the present invention there is provided a system for generating replays of an event for broadcast video, the system comprising a receiver for receiving a video feed; a detector for automatically detecting said event from said camera video feed; a replay generator for generating a replay video of said event, and a broadcast generator for generating broadcast video incorporating said replay.
Said detector may extract a plurality of features from said camera video feed, and inputs said features into an event model to detect said event.
Said detector may analyse an audio track of said video feed, determines an audio keyword using said audio analysis and extracts the features using said audio keyword.
Said audio keyword may be determined from a set of whistle, acclaim and noise.
Said detector may analyse a visual track of said video feed, determines a position keyword using said visual analysis and extracts the features using said position keyword.
Said detector may further detect one or more of a group consisting of field lines, a goal-mouth, and a centre circle using said visual analysis and determines said position keyword using one or more of said group.
Said detector may determine a ball trajectory keyword using said visual analysis and extracts the features using said ball trajectory keyword.
Said detector may determine a goal-mouth location keyword using said visual analysis and extracts the features using said goal-mouth location keyword.
Said detector may further analyse the motion of said video feed, determines a motion activity keyword using said motion analysis and extracts the features using said motion activity keyword.
Said detector may constrain the keyword values within a moving window and/or synchronises the frequency of the keyword values for at least one of said position keyword, said ball trajectory keyword, said goal-mouth location keyword, said motion activity keyword and said audio keyword.
Said detector may classify said event into one of a group consisting of an Attack event, a Foul event, an “Other” event, and a No event.
Said detector may further detect boundaries of said event in the video feed using at least one of the features.
Said detector may search for changes in the at least one of the features for detecting the boundaries.
Said replay generator may concatenate views of said event from at least one camera, and generate a slow motion sequence incorporating said concatenated views.
Said broadcast generator may determine when to insert said replay video according to predetermined criteria.
Said broadcast generator may insert said replay video instantly or after a delay based on said predetermined criteria.
Said predetermined criteria may depend on classifying said event into one of a group consisting of an Attack event, a Foul event, an “Other” event, and a No event.
Said receiver may receive said video feed from a main camera.
In accordance with a third aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer to execute a method for generating replays of an event for broadcast video, the method comprising the steps of receiving a video feed; automatically detecting said event from said camera video feed; generating a replay video of said event, and generating broadcast video incorporating said replay.
One preferred form of the present invention will now be described with reference to the accompanying drawings in which;
With reference to
Each of the steps in
Event detection (referred to in step 102 in
The feature extraction (step 208 in
Visual analysis (F1, F2, F3)
The visual analysis may involve 3 keywords: F1, F2, and F3.
The Position keyword (referred to as F1 in Table 1) reflects the location of play in the soccer field will now be discussed in more detail with reference to
Video from the main camera may be used to identify the play region the field. The raw video will only show a cropped version of the field as the main camera pans and zooms. In one embodiment play regions spanning the entire field are identified. In order to identify the regions, the following three features may be used: (1) field-line locations, (2) goal-mouth location, and (3) centre circle location.
Field line detection is one factor in determining the F1 keyword. In detail referring to
In
(ρi,θi) i=1, 2, . . . , N (1)
where ρi and θi are the ith radial and angular co-ordinate respectively and N is the total number of lines detected in the frame as seen in
Goal-Mouth Detection
The detection of two goalposts may be used to identify the goal-mouth which is another factor in determining the F1 keyword. In more detail referring to
In
Centre circle detection is a further factor in determining the F1 keyword. Referring now to
There may be 4 unknown parameters {x0,y0,a2,b2} in a horizontal ellipse expression, where (x0,y0) is the centre of the ellipse 1608, a and b are the half major axis 1612 and half minor axis 1610 of the ellipse.
The y-axis location of the two horizontal borderlines are yup,ydown, we have:
where ρi is the centre vertical line found in Eq. (1). The unknown parameter a2 can be computed by the following transform to 1-D parameter space:
To improve efficiency, we may only need to evaluate (x,y) from region 21604 to compute a2.
The above steps may be applied to all possible border line pairs and the a2 found with the largest accumulated value in parameter space is considered to be the solution. This method may be able to locate the ellipse even it is cropped provided the centre vertical line, upper and lower border are present. The detected centre circle may be represented by its central point (xe,ye). If no centre circle is detected, then xe=ye=−1
Position Keyword Creation
In one embodiment the present invention adopts a Competition Network (CN) to detect the F1 keyword using the field-lines, the goal-mouth, and the centre circle. The CN consists of 15 dependent classifier nodes, each node representing one area of the field. The 15 nodes compete amongst each other, and the accumulated winning node may be identified as the chosen region of play.
The CN operates in the following manner: at time t, every detected field-line (ρit,θit), together with the goal-mouth (xgt,ygt) and centre circle (xet,yet) forms the feature vector vi(t) where i=1 . . . N, N is the number of lines detected at each time t. Specifically, vi(t) is
v
i(t)=[ρit, θit,xgt,ygt,xet,yet]T i=1, . . . , N (6)
The response of each node is
where
wj=[wj1, wj2, . . . , wj6] j=1, . . . , 15 (8)
is the weight vector associated with the jth node, j=1 . . . 15 for the 15 regions. The set of wining nodes at time t is
However, {j*(t)}sometimes is not a single entry. There are 3 possible scenarios for {j*(t)}, i.e, a single winning entry, a row winning or column winning entry of the regions. This instantaneous winner list may not be the final output of the CN as it may not be robust. To improve classification performance, the accumulated response may be computed as
R
j(t)=Rj(t−1)+rj(t)−α·Dist(j,j*(t))−β (10)
where Rj(t) is the accumulated response of node j, α is a scaling positive constant, β is the attenuation factor, and Dist(j,j*(t)) is the Euclidean distance between node j to the nearest instantaneous wining node within the list{j*(t)}. The variable α−Dist(j,j*(t)) in Eq (10) corresponds to the amount of attenuation introduced to Rj(t) based on the Euclidean distance of node j to the winning node, the further away, the larger the attenuation.
To compute the final output of CN at time t, the maximal accumulated response may be found at node j#(t) where
If Rj
The trajectory of the ball may be useful to recognise some events. For example, the relative position between the ball and goal-mouth can indicate events such as “scoring” and “shooting”. The ball trajectory is obtained using a trajectory-based ball detection and tracking algorithm. Unlike the object-based algorithms, this algorithm does not evaluate whether a sole object is a ball. Instead, it uses a Kalman filter to evaluate whether a candidate trajectory is a ball trajectory. The ball trajectory (F2 Table 1), may be a two dimensional vector stream recording the 2D co-ordinates of the ball in each frame.
Besides being used in position keyword model, goal-mouth location (referred to as F3 in Table 1) itself may be an important indicator of an event. A goal-mouth can be formed by the two goalposts detected, and may be expressed by its four vertexes: left-top vertex (xit, yit), left-bottom vertex (xib, yib), right-top vertex (xrt, yrt), and right-bottom vertex (xrb, yrb). The F3 keyword is a R8 vector stream.
In a soccer game, as the main camera generally follows the movement of the ball, the camera motion (referred to as F4 in Table 1) thus provides an important cue to detect events. In one embodiment the present invention calculates the camera motion keyword using motion vector field information available from the compressed video format.
In more detail with reference to
where Pk is the probability of the kth grey-level in the MB. In
Also the average motion magnitude pm is computed as:
Thus a motion activity vector is formed as a measure of the motion activity [pz,pp,pt,pm]T.
In one embodiment the purpose of the audio keyword (referred to as F5 in Table 1) may be to label each audio frame with a predefined class. As an example 3 classes can be defined as: “whistle”, “acclaim” and “noise”. In one embodiment the Support Vector Machine (SVM) with the following kernel function is used to classify the audio
As the SVM may be a two-class classifier, it may be modified and used as “one-against-all” for our three-class problem. The input audio feature to the SVM may be found by exhaustive search from amongst the following audio features tested: Mel Frequency Cepstral Coefficients (MFCC), Liner Prediction Coefficient (LPC), LPC Cepstral (LPCC), Short Time Energy (STE), Spectral Power (SP), and Zero Crossing Rate (ZCR). In one embodiment a combination of LPCC subset and MFCC subset features are employed.
One possible function of post-processing may be to eliminate sudden errors in created keywords. The keywords are coarse semantic representations so the keyword value should not change too fast. Any sudden change in the keyword sequences can be considered as an error, and can be eliminated using majority-voting within a sliding window length of wl and step-size ws (frame). For different keyword, the sliding window has different wl and ws defined experientially:
Another function of post-processing may be to synchronise keywords from different domains. Audio labels are created based on a smaller sliding window (20 ms in our system) compared with visual frame rate (25 fps, each video frame lasts 40 ms). Since the audio sequence is twice that of video sequence, it is easy to synchronise them.
After post-processing, the keywords are used by the event models (step 210 in
The event models (referred to as step 210 in
1. defining general criteria for which event to select for replay.
2. achieving acceptable event detection accuracy from the video taken by the main camera, as fewer cues are available compared with event detection from the broadcast video.
1) Selection of Replay Event
To find general criteria on the selection of event for replay, a quantitative study of 143 replays in several FIFA World-Cup 2002 games was conducted. It may be shown that all of the events replayed belong to three types in Table 2, and our system will generate replay for these events (the types are examples only and a person skilled in the art could generate an appropriate set of event types for a give application).
The labelled-event Attack consists of scoring or just-missing shot of a goal. The event Foul consists of a referee decision (referee whistle), and Other consists of injury events and miscellaneous. If none of the above events is detected, the output of the classifier may default to “no-event”.
2) Event Moment Detection
Events may be detected based on the created keywords sequences. In broadcast video the transitions between the type of shot/view may be closely related to the semantic state of the game, so Hidden Markov Model (HMM) classifier, which may be good at discovering temporal pattern, may be applicable. However, when applying HMM on the keyword sequences created in the above section, we noticed that there is less temporal pattern in the keyword sequences, and this makes the HMM method less desirable. Instead we find that there appear certain feature patterns in those keyword sequences at and only at a certain moment during the event. We name such moment with distinguishing feature pattern “event moment”, e.g. the moment of hearing whistle in “Foul”, the moment of very close distance between goal-mouth and ball in “Attack”. By detecting this moment it may be possible to detect the occurrence of the event.
In more detail to classify the three types of events, 3 classifiers are trained to detect event moments for the associated events. To make the classifier robust, each classifier uses a different set of mid-level keywords as input:
The chosen keyword streams are synchronised and integrated into a multi-dimension keyword vector stream from which the event moment is to be detected. To avoid employing the heuristics, a statistical classifier to detect decision boundary is employed, e.g. how small the ball-goal-mouth distance is in “Attack” event, how slow the motions are during a “Foul” event.
The output of each classifier is “Attack”/no-event, “Foul”/no-event and “Other”/no-event respectively. The classifier used is the SVM with the Gaussian kernel (radial basis function (RBF)) in Eq (14).
To train the SVM classifier, event and non-event segments are first manually identified, mid-level representations are then created. To generate the training data, the specific event moments within the events are manually tagged and used as positive examples for training the classifier. Sequences from rest of the clips are used as negative training samples. In the detection process, the entire keyword sequences from the test video are feed to the SVM classifier and the segments with the same statistical pattern as event moment are identified. By applying post-processing, the small fluctuation in SVM classification results is eliminated to avoid reduplicated detection of the event moment from the same event.
In
Event boundary detection (referred to as step 104 in
There are many factors affecting the human perception or understanding of the duration of an event: One factor is time, i.e. events usually process only a certain temporal duration. Another factor is the position where the event happens. Mostly events happen in a certain position, hence scenes from previous location may not be of much interest to audience. However, this assumption may not be true for fast changing events.
A first embodiment of event detection is shown in
Detecting the view change (referred to as step 302 in
A forward search is applied to detect the ending view change tee (referred to as step 308 in
Replay generation (referred to as step 106 in
Replay insertion (referred to as step 108 in
Referring to
Since foul events may be different from attack events in sports games, we use a different method to insert replays related to foul events into the video. Referring to
The parameters and learning process (referred to as step 812 in
In a second embodiment of replay insertion Table 3 shows the results of an example quantitative study done on a video database.
It is found from this example that all the replays belong to two classes: instant replay and delayed replay. Most replays are instant replays that are inserted almost immediately following the event if subsequent segments are deemed un-interesting. The other replay class, delayed replay, occurs for several reasons:
The event detection result that has segmented the game into sequentially “event”/“no-event” structure, as illustrated in
The replay starting time trs and ending time tre may be computed as:
t
rs
=t
ee
+D
3 (15)
t
re
=t
rs+(tee−tes)*v (16)
where tes and tee are the starting and ending time of the event previously. D3 is set to 1 second in accordance with convention, and v is a factor defining how slow the replay is displayed compared with real-time.
Then the system may examine whether the time slot from trs to tre in the subsequent no-event segment meets one of the following conditions:
If so, an instant replay is inserted.
Delayed replays may be inserted for MP, FI or IE events. The events may be buffered and a suitable time slot found to insert delayed replays. In addition, to identify whether an event is an IE event, an importance measure I is given to the event based on the duration of its event moment as generally the longer the event moment, the more important the event:
I=t
te
−t
ts (17)
Events with I>T4 are deemed as important events. In the example embodiment, T4 is set to 80 frames so that only about 5% of events detected become important events. This ratio is consistent with broadcast video identification of important events. The duration of the delayed replay is the same as the instant replay in the example embodiment. The system will search in subsequent no-event segments for a time slot with tre−trs in length that meets the condition of no motion.
If such a time slot is found, a delayed replay is inserted. This search continues until a suitable time slot is found for a FI event, or two delayed replays have been inserted for an IE event, or a more important IE event occurs.
In the following results obtained using example embodiments will be described.
Position Keyword
As described in section 3, suitable values of wj in Eq (8) may be chosen such that the CN output is able to update in approximately 0.5 second if a new area is captured by the main camera.
To evaluate the performance of the position keyword creation, a video database with 7800 frames (10 minutes videos) is manually labelled. The result of keyword generation for this database is compared with the labels and the accuracy of the position keyword is listed in Table 4. It is noted that the detection accuracy for field area 4 is low compared with the other labels. This may be because Field area 4 (
Ball Trajectory
Ball trajectory test is conducted on 15 sequences (176 seconds recording). These sequence consists of various shots with different time duration, view type and ball appearance. Table 5 shows the performance.
Audio Keyword
Three audio classes are defined: “Acclaim”, “Whistle” and “Noise”. A 30 minutes soccer audio database is used to evaluate the accuracy of the audio keyword generation module. In this experiment, 50%/50% is used as training/testing data set. The performance of the audio feature selected by exhaustive search is compared with existing techniques where feature selection is done by using domain knowledge.
Event Detection
To examine the performance of our system on both main camera video and broadcast video, 50 minutes of unedited video from the main camera recording of S-League game and 4.5 hours of FIFA world cup 2002 broadcast video are used in the experiment. Specifically, as the broadcast video database 1100 is an edited recording, i.e. it has additional shot information besides the main camera capture 1102 (as illustrated in
The “boundary decision accuracy (BDA)” in Table 7 and Table 8 is computed by
where τdb and τmb are the automatically detected event boundary and the manually labelled event boundary, respectively. It is observed that the boundary decision accuracy for event “Other” is lower compared with the other two events. This is because “Other” event is mainly made up of injury or sudden events. The cameraman usually continues moving the camera to capture the ball until the game is stopped, e.g. the ball is kicked out of the touch-line so that the injured player can be treated. Then the camera is focused on the wounded players. This results in either missing the extract event moment by the main camera or an unpredictable duration of camera movement. These reasons affect the event moment detection and hence affect the boundary decision accuracy.
As both the automatically generated video and the broadcast video recorded form broadcast TV program are available in the example embodiment, one can use the later as the ground truth to evaluate the performance of the replay generated. The following table compares the automatic replay generation by an example embodiment to the actual broadcast video replays.
The term “same” in Table 9 refers to replays that are inserted in both the automatically generated video and the broadcast video. From Table 9 it can be observed that, though the main camera captures at 3 times slower than real-time speed as the replay (v=3.0 in Eq. 16), the duration of the replay segments generated are shorter that the replays in broadcast video. This may be mainly because the replays in broadcast video use not only the main camera capture but also the sub-camera capture. However, the audience prefer shorter replays as there will be less view interruption in the generated video.
Another result from Table 9 is that the example embodiment generates significantly more replays than human broadcaster's selection. One reason for that result may be that that an automated system will “objectively” generate replays if predefined conditions are met, whereas human generated replays are inherently more subjective. Also, the strict time limit set to generate a replay means that a good replay segment selection might be missed due to the labour intensiveness of manual replay generation. Hence, with the assistance of an automatic system, more replays will be generated. The accuracy of the automated detection algorithms may also vary and may be optimised in different embodiments, e.g. utilising machine learning, supervised learning etc.
The present invention may be implemented in hardware and/or software by a person skilled in the art. In more detail
To those skilled in the art to which the invention relates, many changes in construction and widely differing embodiments and applications of the invention will suggest themselves without departing from the scope of the invention as defined in the appended claims. The disclosures and the descriptions herein are purely illustrative and are not intended to be in any sense limiting.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG2005/000248 | 7/22/2005 | WO | 00 | 6/18/2007 |
Number | Date | Country | |
---|---|---|---|
60590732 | Jul 2004 | US |