The invention relates to a method of annotating video footage, a data store for storing annotated video footage, a method of generation of a personalised video summary, and a system for annotating video footage and a system for generation of a personalised video summary.
Video footage, particularly sports footage, often includes periods of relative inactivity followed by more interesting or high activity periods. Live broadcasts of such video footage often include commentary and/or replays of the later as these are of more interest to the watcher. It is also common for later broadcasts of the footage to provide a video summary of the footage, which will often be a combination of the most interesting replays. Typically a human production director manually chooses which portions of footage to use for replays and which replays to use in a video summary.
It is known in the art to automatically analyse video footage to attempt to replicate the decision process of the human production director. Generally prior art methods attempt to identify “events” within the footage, such as a goal in football, and determine the “boundaries” of the event that will form the replay. An index of the footage may be formed that identifies the time of each event and boundaries for the replay. In live broadcasts the index may be used for automatically inserting a replay into the broadcast or in later broadcasts the index may be used to generate a video summary. Generation of the video summary is therefore a summary of the events within the footage.
For example in a paper by Ekin and A. M. Tekalp, entitled “Automatic Soccer Video Analysis and Summarization”, published in Symp. Electronic Imaging: Science and Technology: Storage and Retrieval for Image and Video Databases IV, IS&T/SPI03, January 2003, CA, a soccer video analysis and summarization framework is disclosed using cinematic and object-based features, such as dominant colour region detection, robust shot detection, shot view classification, and some higher level detection such as goals, referee, and penalty-box, and replay detection. However this does not allow identification of goals by a specific player of a team.
N. Babaguchi, Y. Kawai and T. Kitahashi, in a paper entitled “Event Based Indexing of Broadcasted Sports Video by Inter-modal Collaboration,” published in IEEE Trans. Multimedia, vol. 4, no. 1, pp. 68-75, March 2002 disclose a semantic analysis method for broadcasting sports video based on video/audio and closed caption (CC) text. The closed captioned text is created by the manual transcription of commentator's speech for the sports game. The video, audio and CC are processed to detect highlights, segment the story, and extract play and player.
The CC from commentators' speech is not structured, and on average, for one minute long video there are as many as 10 sentences, or about 100 words. Also due to the nature of commentator language it is difficult to parse these sentences and extract information. In fact a special technique known as Natural Language Parsing (NLP), is required to extract information from the text. Techniques to parse unstructured text are highly computationally intensive and provided only limited accuracy and effectiveness. Additionally, speech transcription of CC text results in the delay of reporting live sports events.
In a further example in a paper by DongQing Zhang, and Shih-Fu Chang, entitled “Event Detection in Baseball Video Using Superimposed Caption Recognition”, published in ACM Multimedia 2002, Juan Les Pins, France, Dec. 1-6, 2002. (ACM MM 2002) a system for baseball video event detection and summarization using superimposed caption text detection and recognition, called video OCR is disclosed. The system detects different types of events in baseball video including scoring and last pitch of each batter. The method is good for detecting game structure and certain events. However, because of the difficulties in achieving high accuracy in video OCR, its use for semantic analysis of sports video has been limited.
Correctly identifying who or which sportsperson is involved in an event has proven a particularly difficult problem to solve. Other useful information about each event includes when it occurred, what type of event and where did the event occur. Prior art methods of indexing and classification have failed to comprehensively characterise each event in the footage.
U.S. Pat. No. 6,751,776 discloses an automatic video content summarization system that is able to create personalized multimedia summary based on the user-specified theme. Natural language processing (NLP) and video analysis techniques to extract important keywords from the closed caption (CC) text as well as prominent visual features from the video footage. A Bayesian statistical framework is used, which naturally integrates the user theme, the heuristics and the theme-relevant video characteristics within a unified platform. However the use of NPL may be highly computationally intensive and may only provide limited accuracy and effectiveness because of the limitations of NPL technologies.
A need therefore exists to address at least one of the above problems.
In accordance with a first aspect of the invention there is provided a method of annotating footage that includes a structured text broadcast stream, a video stream and an audio stream, comprising the steps of:
extracting directly or indirectly one or more keywords and/or features from at least said structured text broadcast streams,
temporally annotating said footage with said keywords and/or features
analysing temporally adjacent annotated keywords and/or features to determine information about one or more events within said footage.
Said step of analysing temporally adjacent annotated features and/or temporal information may comprise:
detecting one or more events in said video footage according to where at least one of said keywords and/or features meets one or more predetermined criterion, and
determining information about each detected event from annotated keywords and/or features temporally adjacent to each detected event.
Said step of detecting one or more events may comprise the step of comparing at least one keyword and/or feature extracted from the structured text broadcast stream to one or more predetermined criterion.
Said step of determining information may comprise the step of indexing each of said events using a play keyword extracted from said structured text broadcast stream.
Said step of indexing may further comprise the step of indexing each of said events using a time stamp extracted from said structured text broadcast stream.
Said step of indexing may further comprise the step of refining the indexing of each of said events using a video keyword extracted from said video stream.
Said step of indexing may further comprise the step of refining the indexing of each of said events using an audio keyword extracted from said audio stream.
Said video footage may relate to at least one sportsperson playing a sport, and said step of extracting may further comprise the step of extracting which sportsperson features in each event from the structured text broadcast stream, and said step of annotating may comprise annotating said footage with said sportsperson.
Said step of extracting may further comprise the step of extracting when each event occurred, what happened in each event and where each event happened from at least one of said streams, and wherein said step of annotating may further comprise annotating said footage according to when each event occurred, what happened in each event and where each event happened.
Said structured text broadcast may be sports webcasting text (SWT).
Said keywords and/or features may comprises one or more keyword(s), and wherein each keyword may be determined from one or more low level features, and wherein each low level feature may be extracted from said footage.
Said one or more keyword(s) may comprise a play keyword extracted from said structured text broadcast stream, a video keyword extracted from said video stream and an audio keyword extracted from said audio stream.
Said event may comprises a state of increased action within the footage chosen from one or more of the following list: goal, free-kick, corner kick, red-card, yellow-card, where the action is football footage.
Said one or more predetermined criterion may comprise said play keyword matching one of said states of increased action.
In accordance with a second aspect of the invention there is provided a data store for storing video footage, characterised in that in use said video footage is annotated according to the method in any of the preceding claims.
In accordance with a third aspect of the invention there is provided a method of generation of a personalised video summary comprising the steps of:
storing video footage including one or more events, wherein each of said events is classified according to the method of annotating footage above;
receiving preferences for said personalised video summary;
selecting events to include from said stored video footage where the classification of a given event satisfies said preferences; and
generating said personalised video summary from said selected events.
In accordance with a forth aspect of the invention there is provided a A system for annotating footage comprising a data store storing said footage and a computer program;
a processor configured to execute said computer program to carry out the steps of the method of annotating footage above.
In accordance with a fifth aspect of the invention there is provided a system for generation of a personalised video summary comprising
a data store storing said footage and a computer program;
a processor configured to execute said computer program to carry out the steps of the method of generation of a personalised video summary above.
Example embodiments of the invention will now be described with reference to the drawings, in which:
Video footage processing, particularly automatic video processing, requires some knowledge of the content of the footage. For example, in order to generate a video summary of events within the footage, the original footage needs some form of annotation of the footage. In this way a personalised video summary may be generated that only includes events that meet one or more criterion.
An example application is annotating sports video. In sports video typical annotations may include the time of an event, the player or team involved in the event and the nature or type of event. The venue of the event may also be used as an annotation. For the following embodiments football or soccer or football will be used as one example, although it will be appreciated that other embodiments are not so restricted and may cover annotated video generally.
A user of sports video will typically have a preference for given players or teams and/or a particular nature or type of event. Accordingly once annotated, events that meet the preferences may be easily selected from the annotated footage to generate a personalised video summary. The summary may include video, audio and/or STB streams.
The methods shown in
User preferences 303 are also received at the input 301. Video generation processor 312 receives the preferences and scans the database for events with annotations that satisfy the preferences. The summary video is provided at the output 314, or may be stored in the permanent data store 310 for later retrieval.
Each processor may take the form of a separate programmed digital signal processor (DSP) or may be combined into a single processor or computer.
In an example embodiment the content data is received (step 100 in
In order to facilitate annotation, a framework is necessary. In an example embodiment each of the streams of the footage is analysed and “keywords” are extracted (step 102 in
The video features, for example, have two axes: temporal and spatial, the former refers to its variations along time, the lafter refers to its variations along spatial dimension, like horizontal and vertical positions.
For example the STB stream 400 is subjected to STB analysis 410 including parsing the text to extract key event information such as who, what where and when. Then one or more “play keywords” 416 (PKW) are extracted from the STB stream. The keywords are defined depending on the type of footage and the requirements of annotation.
The video stream 402 is subjected to video analysis 406 including video structural parsing into play, replay and commercial video segments. Then one or more “video keywords” 412 (VKW) are extracted from the video stream and/or object detection is carried out.
The audio stream 404 is subjected to audio analysis 408 includes audio low-level analysis. Then one or more “audio keywords” 414 (AKW) are extracted from the audio stream.
Once the keywords are extracted the keywords may be aligned in time for each stream 418. Player, Team and Event detection and association 419 takes place using the keywords. Here events refer to actions that are taking place during sports games. For instance, events for soccer game include goal, free kick, corner kick, red-card, yellow-card, etc.; events for tennis game include serve, deuce, etc. Each replay may then be classified 420, for example by identifying who features in each event, when each event occurred, what happened and where each event occurred. The semantically annotated video footage may then be stored in a database 422.
STB allows easier parsing of information that is less computationally intensive and more effective compared to parsing transcriptions of commentary. Normal commentary may have long sentences, may be unstructured and may involve opinions and/or informal language. All of this combines to make it very difficult to reliably extract meaningful information about events from the commentary. Prior art Natural Language Parsing (NLP) techniques have been used to parse such transcribed commentary, but this has proven highly computationally intensive and only provides limited accuracy and effectiveness.
An example of an STB stream is Sports Web casting Text (SWT). Sports game annotators manually create SWT in real-time, and the SWT stream is broadcast on the Internet. SWT is structured text that describes all the actions of sports game with relatively low delay. This allows extraction of information such as the time of an event, the player or team involved in the event and the nature or type of event. Typically SWT provides information on action and associated players/teams approximately every minute during a live game.
SWT follows an established structure with regular time stamps.
The PKW extracted from the SWT may be used to identify events and may be used to classify each event.
In order to analyse the SWT and generate the PKW over the whole footage (416 in
The PKW may consist of a static and dynamic component. In
The dynamic component includes parsing over each ADT unit. Each ADT is parsed into the following four items: Game-Time-Stamp 606; Player/Team-ID 608; Event-ID 610; and Score-value 612. That will be followed by an extraction performed on the PKW over a window of a fixed length, to extract the true sports event type and the associated player. Parsed ADTs within a time window ADTw are processed to extract player keywords and associated event keywords. For soccer or football an example window of 2 minutes may be used, since typically each soccer or football event has a longer duration than 1 minute.
As shown in
In sporting footage events may be inter-dependent instead of being considered as isolated events. As seen in
Depending on the level of event granularity or temporal resolution required the VKW may be used to further refine the indexed location and the indexed boundaries in the footage used to represent the event. For example the event may be detected using just the PKW, resulting in an event window of about 1 minute. If the event is first identified using the PKW, the VKW may be used to refine the event window to a much shorter period. For example using the VKW, the event may be refined to the replay (already chosen by the human production director) of the event within the footage.
The VKW may also be used in synchronising the event boundaries within video stream and the STB stream.
Video analysis (412 in
Video shot parsing involves parsing the footage into types of video segments (VS).
An example of a commercial detection algorithm is disclosed in U.S. Pat. No. 6,100,941. TV commercials are detected based on whether a black frame has occurred. Other parameters are used to refine the process including the average cut frame distance, cut rate, changes in the average cut frame distance, the absence of a logo, a commercial signature detection, brand name detection, a series of black frames preceding a high cut rate, similar frames located within a specified period of time before a frame being analyzed and character detection.
An example of a replay detection algorithm is disclosed in a paper by L.Y. Duan, M. Xu, Q. Tian, CS Xu, entitled “Mean shift based video segment representation and applications to replay detection”, published in ICIP2004, Singapore. Replay segments are detected from sports video, based on mean-shift based video segmentation where both spatial and temporal features are clustered to characterize video segments. For example colours and motions may be utilized for clustering. Subsequently parameters of these clusters can be used to detect replays robustly because of special characteristics of the replay logos.
An example of a play-break detection algorithm is disclosed in a paper by L. Xie, S.-F. Chang, A. Divakaran and H. Sun, entitled “Structure Analysis of Soccer or football Video with Hidden Markov Models”, published in Proc. International Conference on Acoustic, Speech and Signal Processing, (ICASSP-2002), Orlando, Fla., USA, May 13-17, 2002. A HMM based method may be used to detect Play Video Segments (PVS) 904 and Break Video Segments (BVS) 906. Dominant colour ratio and motion intensity are used in a HMM models to model two states. Each state of the game has a stochastic structure that is modelled with a set of hidden Markov models. Finally, standard dynamic programming techniques are used to obtain the maximum likelihood segmentation of the game into the two states.
As shown in
There are at least three types of video keywords. A first type has a length of one video shot. A second type is a sub-video shot which is less than one video shot. Finally a third type is a super-video shot that covers more than one video shot.
An examples of a sub-video shot would be where one video shot can be rather long, including several rounds of camera panning which covers both defence and offence for a team, in for example basketball or football. In these situations it's better to segment these long video shots into sub-shots so that each sub-shot describes either a defence or an offence.
Similarly, a super-video shot relates to where more than one video shot can better describe a given sports event. For instance in tennis video, each serve starts with a medium view of the player who is preparing for a serve. The medium view is then followed by a court view. Therefore the medium view can be combined with the following court view to one semantic unit: a single video keyword to represent the whole event of ball serving.
The process of determining VKW types is now described. In step 1000 intra video shot features (colour, motion, shot length, etc.) are analyzed. In step 1002 middle level feature detections are performed to detect sports field region, camera and object motions. In step 1004 a determination is made as to whether sub-shot based video keywords should be considered. Sub-shot video keywords can be identified and refined through step 1000, step 1002 and step 1004. Similarly super-shot video keywords are identified in step 1006 so that one semantic unit can be formed to include several video shots.
In step 1008 a video keyword classifier parses the input video shot/sub-shot/super-shot into a set of predefined VKWs. Many supervised classifiers can be used, such as neural networks (NN), supporting vector machine (SVM).
In step 1010, various types of object detection can be used to further annotate these video keywords, including soccer ball or football, goalmouth, and other important land marks. This allows higher precision in synchronising events between the streams.
An example of object detection is ball detection. As shown in
A further example of object detection is goalmouth location. The process is shown in
Similarly to VKW, the AKW may be used to further refine the indexed location and the indexed boundaries in the footage used to represent the event. The AKW may also be used in synchronising the event boundaries within audio stream and the STB stream.
Some example AKWs are listed below. AKWs may either be generic or sports specific.
Low level features 1100 that may be used for AKW extraction include Mel frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), linear prediction coefficient (LPC), short time energy (ST), spectral power (SP), and Cepstral coefficients (CC), etc. The audio data is sampled from the audio stream at a 44.1 KHz sample rate, stereo channels and 16 bits per sample.
The MFCC features may be computed from the FFT power coefficients of the audio data. A triangular band pass filter bank filters the power coefficients. The filter bank consists of K=19 triangular filters. They have a constant mel-frequency interval, and cover the frequency range of 0 Hz-20050 Hz. The Zero crossing rate may be used for analysis of narrowband signals, although most audio signals may include both narrowband and broadband components. Zero crossings may also be used to distinguish between applause and commentating.
Supervised classifiers 1102 can be used for AKW extraction such as multi-class support vector machine (SVM), decision tree and hidden Markov model (HMM). Samples of the pre-determined AKW samples are prepared first, classifiers can be trained over the training samples, and then they can be tested over testing data for performance evaluation.
Cross-media alignment (418 in
It may be useful, depending on the application, to detect events within the footage, and annotate the footage with this additional information.
In a first example events are detected (step 104 in
The player and team involved in each event are determined based on an analysis of the surrounding PKW.
In the second example events are identified based on the video stream. As one possible case, in the first example, the visual analysis previously described is used to detect each of the replays inserted by the human production director. Each of the replays is then annotated, and stored in a database. Various methods may be used to analyse the video stream and associate with events. For example with machine learning methods such as neural networks, supporting vector machines, and hidden markov models, may be used to detect events in this configuration.
As seen in
It may also be useful, depending on the application, to detect and classify replays (420 in
Replay detection and classification is described in detail in other sections. Thus the indexing and classification of replays simply forms another level of semantic annotation of the footage once stored in the database.
According to a first embodiment
For instance, a video summary of all the goals by the football star David Beckham's can be created by identifying all games for this year, and then identifying all replays associated with David Beckham and selecting those replays that involve a goal.
According to a second embodiment
Firstly the text summary 1300 is parsed to produces a sequence of important action items 1302 identified with key players, actions and teams, and other possible additional information such as time of the actions, name and location of the sports games. This generates the preferences (200 in
SWT parsing produces sequences of time-stamped PKWs that describe actions taking place in the sports game. The event boundaries are refined and aligned with the video stream and audio stream, and the annotated video is stored in a database 1306.
The preferences from the text summary are then used to select 1304 which events to include (step 202 in
The selection of events may be further refined 1308, depending on preferred length of summary or other preferences.
Finally, the video summary 1310 is generated (204 in
According to a third embodiment, a learning process may be used for detecting and classification of replays, and summary generation. Video replays are widely used in sports game broadcasting to show highlights occurred in various sessions of the broadcasting. For typical soccer or football games there are 40-60 replays generated by human production directors for each game.
There are three types of replays generated at three different stages of broadcasting. Instant replay video segments RVSinstant appear during regular sessions such as first half and second half of games. Break replay video segments RVSbreak and post-game replay video segments RVSpost appear during the break sessions between two half play sessions, and the post-game sessions. On average, there are 30-60 RVSinstant for each soccer or football game while numbers of RVSbreak, RVSpost are much smaller because only the most interesting actions or highlights can be selected for showing during the break and post-game sessions.
For a soccer or football game, where the total number of replays are denoted N-RVSinstant, N-RVSbreak, N-RVSpost, then N-RVSbreak, and N-RVSpost are much smaller than N-RVSinstant. Since human production directors carefully select N-RVSbreak and N-RVSpost from N-RVSinstant, the selection process done by human directors can be learned. The learning process may involve a machine learning methods such as neural networks, decision trees or supporting vector machines such so that different weightings or priorities can be given to different types of N-RVSinstant, even together with consideration of users' preference to create more precise video replays for users.
Based on the detected and classified RVS Instant as well as the learned weighting factors in terms of their importance, a selection can be made of the RVSinstant to generate the personalised video summaries automatically.
It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the example embodiments without departing from the spirit or scope of the invention as broadly described. The example embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG05/00425 | 12/19/2005 | WO | 00 | 3/20/2009 |