This invention relates generally to the description of multimedia content, and more particularly to constructing semantic descriptions using transform technology.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawings hereto: Copyright © 2001, Sony Electronics, Inc., All Rights Reserved.
Digital multimedia information is becoming widely distributed through broadcast transmission, such as digital television signals, and interactive transmission, such as the Internet. The information may be in still images, audio feeds, or video data streams. However, the availability of such a large volume of information has led to difficulties in identifying content that is of particular interest to a user. Various organizations have attempted to deal with the problem by providing a description of the information that can be used to search, filter and/or browse to locate the particular content. The Moving Picture Experts Group (MPEG) has promulgated a Multimedia Content Description Interface standard, commonly referred to as MPEG-7 to standardize the content descriptions for multimedia information. In contrast to preceding MPEG standards such as MPEG-1 and MPEG-2, which define coded representations of audio-visual content, an MPEG-7 content description describes the structure and semantics of the content and not the content itself.
Using a movie as an example, a corresponding MPEG-7 content description would contain “descriptors,” which are components that describe the features of the movie, such as scenes, titles for scenes, shots within scenes, and time, color, shape, motion, and audio information for the shots. The content description would also contain one or more “description schemes,” which are components that describe relationships among two or more descriptors, such as a shot description scheme that relates together the features of a shot. A description scheme can also describe the relationship among other description schemes, and between description schemes and descriptors, such as a scene description scheme that relates the different shots in a scene, and relates the title feature of the scene to the shots.
MPEP-7 uses a Data Definition Language (DDL) to define descriptors and description schemes, and provides a core set of descriptors and description schemes. The DDL definitions for a set of descriptors and description schemes are organized into “schemas” for different classes of content. The DDL definition for each descriptor in a schema specifies the syntax and semantics of the corresponding feature. The DDL definition for each description scheme in a schema specifies the structure and semantics of the relationships among its children components, the descriptors and description schemes. The DDL may be used to modify and extend the existing description schemes and create new description schemes and descriptors.
The MPEG-7 DDL is based on the XML (extensible markup language) and the XML Schema standards. The descriptors, description schemes, semantics, syntax, and structures are represented with XML elements and XML attributes. Some of the XML elements and attributes may be optional.
The MPEG-7 content description for a particular piece of content is an instance of an MPEG-7 schema; that is, it contains data that adheres to the syntax and semantics defined in the schema. The content description is encoded in an “instance document” that references the appropriate schema. The instance document contains a set of “descriptor values” for the required elements and attributes defined in the schema, and for any necessary optional elements and/or attributes. For example, some of the descriptor values for a particular movie might specify that the movie has three scenes, with scene one having six shots, scene two having five shots, and scene three having ten shots. The instance document may be encoded in a textual format using XML, or in a binary format, such as the binary format specified for MPEG-7 data, known as “BiM,” or a mixture of the two formats.
The instance document is transmitted through a communication channel, such as a computer network, to another system that uses the content description data contained in the instance document to search, filter and/or browse the corresponding content data stream. Typically, the instance document is compressed for faster transmission. An encoder component may both encode and compress the instance document or the functions may be performed by different components. Furthermore, the instance document may be generated by one system and subsequently transmitted by a different system. A corresponding decoder component at the receiving system uses the referenced schema to decode the instance document. The schema may be transmitted to the decoder separately from the instance document, as part of the same transmission, or obtained by the receiving system from another source. Alternatively, certain schemas may be incorporated into the decoder.
Description schemes directed to describing content generally relate to either the structure or the semantics of the content. Structure-based description schemes are typically defined in terms of segments that represent physical, spatial and/or temporal features of the content, such as regions, scenes, shots, and the relationships among them. The details of the segments are typically described in terms of signals, e.g., color, texture, shape, motion, etc.
The semantic description of the content is provided by the semantic-based description schemes. These description schemes describe the content in terms of what it depicts, such as objects, people, events, and their relationships. Depending on user domains and applications, the content can be described using different types of features, tuned to the area of application. For example, the content can be described at a low abstraction level using descriptions of such content features as objects' shapes, sizes, textures, colors, movements and positions. At a higher abstraction level, a description scheme may provide conceptual information about the reality captured by the content such as information about objects and events and interactions among objects. For example, a high abstraction level description may provide the following semantic information: “This is a scene with a barking brown dog on the left and a blue ball that falls down on the right, with the sound of passing cars in the background.”
Current methods for constructing semantic descriptions allow for automatic creation of simple, low-level descriptions. However, human descriptions are often referential and metaphorical. Hence, the above methods cannot be used for semantic descriptions resembling more complex human descriptions.
Existing descriptions are blended to create a new description, and a residue is extracted from each of the plurality of existing descriptions. Further, a set of image style pyramids is created for the new description using residues extracted from the existing descriptions.
In the following detailed description of embodiments of the invention, reference is made to the accompanying drawings in which like references indicate similar elements, and in which is shown, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, mechanical, electrical, functional and other changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Beginning with an overview of the operation of the invention,
The client system 113 includes a content accessing module 115 that uses the content description 101 to search, filter and/or browse the corresponding content data stream. The content accessing module 115 may employ a decoder 119 to obtain the structure and semantic information about the content using the instance document 111.
In one embodiment, the description constructor 127 creates a set of image style pyramids for the new content description 101. The set of image style pyramids may include, for example, a Gaussian pyramid, a Laplacian pyramid, and a wavelet pyramid. The encoder 109 then transmits the image style pyramids of the new description to the client 113. In one embodiment, the repository 103 stores image style pyramids of semantic descriptions to facilitate efficient construction of new descriptions. In addition, the image style pyramids may be used for analysis of the semantic descriptions or any other processing of the semantic descriptions. Subject to restrictions governing data loss, the image style pyramids may be decoded to recover the original descriptions.
In one embodiment, the new description is an MPEG-7 description scheme (DS) pertaining to semantic aspects of the content. Each semantic description may be represented as a graph, with nodes deriving from the SemanticBase DS and edges being semantic relations selected from a list of conforming relations from semantic objects. In particular, graphical classification schemes (GCS) may be used to store templates of descriptions that may be reused and graph transformation steps that may be reused. Graph transformations may include, for example, a pushout such as a single pushout known as a pasting operation or a double pushout known as a cut and paste operation, and a pullback such as a single pullback known as a node replacement or a double pullback known as a replace operation for complex parts. Depending on the area of content, a description may belong to a certain application domain representing a grammar with respect to templates and transformation in the GCS. The grammar may be used to partition a description. That is, factoring a description by templates or by several distinct grammars in the GCS may be used to break down a description.
In one embodiment, the description constructor 127 constructs new semantic descriptions using a process that resembles a mental space model. Mental spaces provide context for communication by importing a lot of information not included in the speech, thereby providing a mechanism for interpreting semantic content in language. This information is imported using maps. These maps function by using (i.e., “recruiting”) frames which represent predefined constructs for interpretation, projecting structure from one mental space to another, and integrating or abstracting imported material from more than one other mental space. Accordingly, each mental space may represent an extended description containing entities, relationships, and frames. Several mental spaces may be active at once, in order to properly define all the entities in the description. These mental spaces enter into relationships with each other. Because the mental spaces borrow structure and entities from each other, mappings are necessary between these mental spaces. The whole composite forms a backdrop to the expressed description and completes the process of attaching semantic meaning to the entities involved.
Referring to
The MPEG-7 model allows mental spaces that include, for example, basic descriptions created for a current description, template elements allowing for validation and recruitment, production steps to provide the process (“run the space”), production steps and ontology links to allow interpretation and recruitment, and basic elements that are graphs and productions. In addition, the MPEG-7 model allows for blending. Results of the blend may be expressed as a selective projection (restriction of the pushout maps that can be done by restricting to subsets of the input set), composition (fusion in the iterative step), completion (recruitment from GCS that has been tapped to do the description), elaboration (tentative running of processes discovered by completion), and an emergent structure (recorded to add new entries to the GCS or to complete the description).
Referring to
Next, processing logic blends the identified content descriptions together. In particular, processing logic creates a blend for each pair of the identified descriptions (processing block 404), creates a generic space for each pair of the identified descriptions (processing block 406), and extracts a residue from each of the input descriptions (processing block 408). Then, processing block blends each pair of the prior results (processing block 410), creates a next generic space for each pair of the prior results (processing block 412), and extracts a residue from each of the prior results (processing block 414). The operations of processing blocks 410 through 414 are preformed iteratively until processing block 410 produces a single output (processing box 416).
Further, processing logic creates a set of image style pyramids for the new description using the residues, resulting generic spaces, and/or resulting blends (processing block 418). The set of image style pyramids may include, for example, a wavelet pyramid, a Laplacian pyramid, and a Gaussian pyramid.
The creation of image style pyramids allows for analyses of descriptions, efficient transmission and storage of descriptions, and efficient construction of new descriptions.
In one embodiment, depending on the rules for running the blend and the information saved in the wavelet pyramid, all the pyramids in the set can be used to reconstruct the original descriptions. If subtraction (cutting) of the generic space from the blended space results in two spaces, then the wavelet transform may be recoverable. Otherwise, the extra spaces may need to be saved, as will be discussed in greater detail below in conjunction with
In one embodiment, multiple image descriptions are encoded as a wavelet transform that includes a set of new image descriptions. Subsequently, the original image descriptions may be decoded from the wavelet transform in a lossless or lossy fashion depending on restrictions governing data loss.
Referring to
At processing block 504, processing logic creates a blend of these source descriptions based on their matching elements. The blend may be created by performing the pushout and then running the blend.
At processing block 506, processing logic creates a generic space for the source descriptions by pulling the resulting map back to the generic space.
At processing block 508, processing logic extracts a residue from each input source description.
If the source descriptions include more than two descriptions, process 500 is repeated for each additional pair of source descriptions, and then the results are blended in subsequent iterations until a single output is produced.
Referring to
The generic space 608 may be used to extract residues from the input descriptions 602 and 604.
Residues may also be derived from blends. Then, the sequence of generic space may lead to a wavelet pyramid 624 or 626 illustrated in
The image style pyramids 620 through 624 have familiar image analysis and multimedia names and properties, allowing for analysis of descriptions, as well as their efficient storage, transmission and construction.
Method and apparatus for constructing semantic descriptions using transform technology has been described. Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiments shown. This application is intended to cover any adaptations or variations of the present invention.
The terminology used in this application with respect to MPEG-7 is meant to include all environments that provide content descriptions. Therefore, it is manifestly intended that this invention be limited only by the following claims and equivalents thereof.
This application is related to and claims the benefit of U.S. Provisional Patent application Ser. No. 60/506,931 filed Sep. 29, 2003, which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60506931 | Sep 2003 | US |