The present invention relates to video processing in general and more particularly movie processing. Still more particularly, the present invention is related to a system and method for deep annotation leading to semantic indexing of videos based on comprehensive analyses of video, audio, and associated script.
Video portals delivering content based services need to draw more users onto their portals in order to enhance revenue: one of the most practical ways to achieve this is to provide user interfaces that make users see the “bits & pieces” of content that are of their interest. Specifically, a movie as a whole is of interest in the initial stages of viewership; with time, different users need different portions of the movie wherein the portions could be based on scene details, actors involved, or dialogs. The users would want to query a video portal to get the relevant portions extracted from several of the videos and the extracted content packaged for to be delivered to the users. This requirement of users is a great boon for video on demand (VoD) service providers: there is an excellent commoditization and hence, monetization of small portions of the content. Such a micro-monetization is not uncommon on the web-based services: for example, in scientific publishing, there are several opportunities for micro-monetization such as (a) relevant tables and figures; and (b) experimental results associated with the various technical papers contained in a repository. In all such cases and all of the domains, deep annotation of the content helps in providing the most appropriate answers to the users' queries. Consider DVDs containing the movies: deep annotation of a movie offers an opportunity to deeply semantically index a DVD so that users could construct a large number of “video shows” based on their interests. Here, a video show is a packaged content based on “bits & pieces” of a movie contained in the DVD. An approach to achieve deep annotation of a video is to perform a combined analysis based on audio, video, and script associated with the video.
U.S. Pat. No. 7,467,164 to Marsh; David J. (Sammamish, Wash.) for “Media content descriptions” (issued on Dec. 16, 2008 and assigned to Microsoft Corporation (Redmond, Wash.)) describes a media content description system that receives media content descriptions from one or more metadata providers, associates each media content description with the metadata provider that provided the description, and may generate composite descriptions based on the received media content descriptions.
U.S. Pat. No. 7,457,532 to Barde; Sumedh N. (Redmond, Wash.), Cain; Jonathan M. (Seattle, Wash.), Janecek; David (Woodinville, Wash.), Terrell; John W. (Bothell, Wash.), Serbus; Bradley S. (Seattle, Wash.), Storm; Christina (Seattle, Wash.) for “Systems and methods for retrieving, viewing and navigating DVD-based content” (issued on Nov. 25, 2008 and assigned to Microsoft Corporation (Redmond, Wash.)) describes a system for enhancing a user's DVD experience by building a playlist structure shell based on a hierarchical structure associated with the DVD, and metadata associated with the DVD.
U.S. Pat. No. 7,448,021 to Lamkin; Allan (San Diego, Calif.), Collart; Todd (Los Altos, Calif.), Blair; Jeff (San Jose, Calif.) for “Software engine for combining video or audio content with programmatic content” (issued on Nov. 4, 2008 and assigned to Sonic Solutions, a California corporation (Novato, Calif.)) describes a system for combining video/audio content with programmatic content, generating programmatic content in response to the searching, and generating an image as a function of the programmatic content and the representation of the audio/video content.
U.S. Pat. No. 7,366,979 to Spielberg; Steven (Los Angeles, Calif.), Gustman; Samuel (Universal City, Calif.) for “Method and apparatus for annotating a document” (issued on Apr. 29, 2008 and assigned to Copernicus Investments, LLC (Los Angeles, Calif.)) describes an apparatus for annotating a document that allows for the addition of verbal annotations to a digital document such as a movie script, book, or any other type of document, and further the system stores audio comments in data storage as an annotation linked to a location in the document being annotated.
“Movie/Script: Alignment and Parsing of Video and Text Transcription” by Cour; Timothee, Jordan; Chris, Miltsakaki; Eleni, and Taskar; Ben (appeared in the Proceedings of the 10th European Conference on Computer Vision (ECCV 2008), Oct. 12-18, 2008, Marseille, France) describes an approach in which scene segmentation, alignment, and shot threading are formulated as a unified generative model, and further describes a hierarchical dynamic programming algorithm that handles alignment and jump-limited reordering in linear time.
“A Video Movie Annotation System-Annotation Movie with its Scrip” by Zhang; Wenli, Yaginuma; Yoshitomo; and Sakauchi; Masao (appeared in the Proceedings of WCCC-ICSP 2000. 5th International Conference on Signal Processing, Volume 2, Page(s):1362-1366) describes a movie annotation method for synchronizing a movie with its script based on dynamic programming matching and a video movie annotation system based on this method.
The known systems do not address about how to bootstrap deep annotation of videos based video and script analyses. A bootstrapping process allows for errors in the initial stages of the analyses and at the same time achieves enhanced accuracy towards the end. In the bootstrapping process, it is important to have as much of possible coarse-grained annotation of a video as possible so that the error in script alignment is minimized and the effectiveness of the deep annotation is enhanced. The present invention provides for a system and method to use the script associated with a video in the coarse-grained annotation of the video, and uses the coarse-grained annotation along with the script to generate the fine-grained annotation (that is, deep annotation) of the video.
The primary objective of the invention is to associate deep annotation and semantic index with a video/movie.
One aspect of the invention is to exploit the script associated with a video/movie.
Another aspect of the invention is to analyze the script to identify a closed-world set of key-phrases.
Yet another aspect of the invention is to perform coarse-grained annotation of a video based on the closed-world set of key-phrases.
Another aspect of the invention is to perform coarse-grained annotation of a script.
Yet another aspect of the invention is to map a key frame of a video scene to one or more script segments based on the coarse-grained annotation of the key frame and the coarse-grained annotation of script segments.
Another aspect of the invention is to identify the best possible script segment to be mapped onto a video scene.
Yet another aspect of the invention is to analyze the script segment associated with a video scene to achieve a fine-grained annotation of the video scene.
Another aspect of the invention is to identify homogeneous video scenes based on the fine-grained annotation of the video scenes of a video.
a provides additional information about script and scene structures.
a provides illustrative annotations.
a depicts an approach for deep indexing of a video.
a depicts an illustrative segment mapping.
Deep annotation and semantic indexing of videos/movies help in providing enhanced and enriched access to content available on web. While this is so, the deep annotation of videos to give access to “bits & pieces” of content poses countless challenges. On the other hand, there has been tremendous work on the shallow annotation of videos although there has not been a great success even at this level. An approach is to exploit any and all of the additional information that is available along with a movie. One such information base is the movie script: The script provides detailed and necessary information about the movie under making; that is, the script is prepared much before the movie shooting. Because of this factor, the script and the made movie may not correspond with each other; that is, the script and the movie may not match one to one. Additionally, it should be noted that the script, from the point of view of the movie, could be outdated, incomplete, and inconsistent. This poses a big challenge in using the textual description of the movie contained in the script. This means that independent video processing is complex and at the same time, independent script processing is also complex. A way to address this two-dimensional complexity is to design a system that bootstraps through incremental analyses.
A video script provides information about a video and is normally meant for to be understood by human beings (for example, shooting crew members) in order to effectively shoot a video.
a provides additional information about script and scene structures.
A video script is based on a set of Key Terms that provide a list of a sort of reserved words with a pre-defined semantics. Similarly, Key Identifiers provide certain placeholders in the script structure for to be filled in appropriately during the authoring of a script.
(EXT.), DAY, NIGHT, CLOSEUP, FADE IN, HOLD ON, PULL BACK TO REVEAL, INTO VIEW, KEEP HOLDING, . . . ;
Input: A Video, say a movie, Script
Output: A set of key-phrases that defines the closed world for the given video;
Step 1: Analyze all the instances of OBJECT and obtain a set of key-phrases, SA, based on, say, Frequency Analysis;
Step 2: Analyze all the instances of PERSON and obtain a set of key-phrases, SB, based on, say, Frequency Analysis;
Step 3: Analyze all the instances of LOCATION and obtain a set of key-phrases, SC, based on, say, Frequency Analysis;
Step 4: Analyze SCENE descriptions and obtain a set of key-phrases based, SD, on, say, Frequency Analysis;
Step 5: Analyze DIALOG descriptions and obtain a set of key-phrases, SE, based on, say, Frequency Analysis;
Step 6: Analyze ACTION descriptions and obtain a set of key phrases, SF, based on, say, Frequency Analysis;
Step 7: Perform consistency analysis on the above sets SA-SF and arrive at a consolidated set of key phrases CW-Set.
a provides illustrative annotations.
Video analysis to arrive annotations makes use of multiple techniques, some are based on image processing, some are based on text processing, and some are on audio processing.
Given:
For each scene VSi, Generate a fine-grained annotation FAi
The process of deep annotation receives a content described in terms of a set of video scenes as input and uses the script described in terms of a set of script segments which is associated with the content and a set of coarse-grained annotations associated the set of video scenes to arrive at a fine-grained annotation for each of the video scenes.
a depicts an approach for deep indexing of a video.
Deep Annotation and Semantic Indexing
Given: Script segments and video scenes
Note: Multiple video scenes correspond with a script segment
Step 1: Based on script structure, identify script segments and make each segment complete by itself;
Step 2: Analyze input script and generate a closed world set (CW-Set) of key-phrases;
Step 3: Use CW-Set and annotate each video key frame (VKFi) of each video scene VSi;
Step 4: For each VKFi of VSi, based on VKFAi (video key frame annotation), identify K matching script segments (SSj's) based on coarse-grained annotation associated with each script segment. This step accounts for both inaccuracy in the coarse-grained annotation and outdatedness of the script.
Step 4a: Apply a warping technique to identify the best possible script segment that matches with most of the key frames of the video scene VSi.
Step 5: Analyze the script segment associated with VSi to generate VSAi (video scene annotation). Note that this step employs a multitude of semi-structured text processing to arrive at an annotation of the video scene.
Step 7: Identify homogeneous video scenes called video shows based on VSA's. A typical way to achieve this is to use a clustering technique based on the annotation of the video scenes. The identified clusters tend to identify video scenes that have similar annotations and hence, the corresponding scenes are also similar as well.
Given: Video scene VS
Start from SS11 and generate a sequence of X positional weights as follows:
a depicts an illustrative segment mapping. For illustrative purposes, consider the following (900):
Video Scene: 1; VS1 is the video scene.
Number of key frames: 6; VKF11, VKF12, VKF13, VKF14, VKF15, and VKF16 are the illustrative key frames.
Number of segments per key frame: 5; That is, the top 5 of the matched segments are selected for further analysis.
Total number of segments: 20
910 depicts the best matched segment (SS 5) with respect to the key frame VKF11 while 920 depicts the second best matched segment (SS 6) with respect to the key frame VKF16. There are totally 7 IsoSegmental lines and 930 depicts first of them. 940 depicts the various computations associated with the first IsoSegmental line. 950 indicates the script segment number (SS 5), 960 indicates the positional weight sequence, 970 depicts the associated error sequence, and 980 provides the error value. Based on the error value associated with the 7 IsoSegmental lines, the IsoSegmental line 2 with the least error is selected as the best matched segment (SS 6) for mapping onto VS1.
Note that VSAi is a set with each element providing information in the form of SVO triplets associated with an OBJECT, PERSON, LOCATION, SCENE, DIALOG, or ACTION;
Primarily, there are six dimensions: OBJECT dimension, PERSON dimension, LOCATION dimension, SCENE dimension, DIALOG dimension, and ACTION dimension;
In order to identify homogeneous scenes, two things are essential: one is a homogeneity factor and the second thing is a similarity measure. The homogeneity factor provides an abstract and computational description of a set of homogeneous scenes. For example, OBJECT dimension is an illustration of a homogeneity factor. The similarity measure, on the other hand, defines how two video scenes along the homogeneity factor correlate with each other. For example, term by term matching of two SVO triplets is an illustration of a similarity measure.
Thus, a system and method for deep annotation and semantic indexing is disclosed. Although the present invention has been described particularly with reference to the figures, it will be apparent to one of the ordinary skill in the art that the present invention may appear in any number of systems that need to overcome the complexities associated with deep textual processing and deep multimedia analysis. It is further contemplated that many changes and modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the present invention.