This application claims the benefit of Korean Patent Application No. 2004-10820, filed on Feb. 18, 2004, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present invention relates to an image reproducing method, medium, and apparatus, and more particularly, to a method, medium, and apparatus for summarizing a plurality of frames, which classify the plurality of frames and output a frame summary by selecting representative frames from the classified frames.
2. Description of the Related Art
In general, an image reproducing apparatus, which plays back still images or video streams stored in a storage medium for a user to watch via a display device, also decodes encrypted image data and outputs the decoded image data. Recently, networks, digital storage media, and image compression/decompression technologies have been developed. Accordingly, apparatuses storing digital images in storage media and reproducing the digital images became popular.
When a number of digital video streams or still images are stored in a bulk storage medium, it is necessary to have functions which allow a user to easily and quickly select a desired image and to reproduce the image or to select only an interesting or desired portion of a video from among the stored images and reproduce and edit the portion easily and quickly. A function allowing a user to understand contents of video streams easily and quickly is called “video summarization”.
One method of summarizing a plurality of frames is to select representative frames from the plurality of frames and browse the representative frames or to view a shot (i.e., a zone including same scenes) including the representative frames in a video stream. The number of selected representative frames or a method of browsing the representative frames can be varied according to a detailed application. In general, to select representative frames, a chosen video stream is split into a number of shots corresponding to scene changes, and one or more keyframes are selected from each shot. Since a number of shots exist in a video stream and the number of keyframes obtained from the shots is very large, it is impertinent to use the keyframes for video summarization. Therefore, clusters are formed by classifying the keyframes according to a similarity between frames, and a representative frame is chosen from each cluster, and then, a frame summary of a video stream is generated. This is a general representative frame selecting method. To form clusters, various clustering methods are disclosed. Ratakonda (U.S. Pat. No. 5,995,095) discloses the Linde-Buzo-Gray method applied between consecutive frames, since frames having low similarity are classified into the same cluster when a pair of keyframes having low similarity is repeated, it may be impertinent to apply the result to video summarization. Liou et al. U.S. Pat. No. 6,278,446) discloses the nearest neighborhood method to cluster generation, noting that it is difficult to control the number of output clusters output, and that since it is determined with a special threshold value whether a frame is included in a cluster, an appropriate threshold value must be set for each input video stream. Yeo et al. (U.S. Pat. No. 5,821,945) and Uchihachi et al. (U.S. Pat. No. 6,535,639), and Loui et al. (U.S. Publication No. 2003-0058268) apply hierarchical methods to cluster generation. However, since these references adopt a general hierarchical method or adopt a method according to a Bayesian model setting, where the length of a video stream is long but the number of required clusters is small, where a video stream to which a set model is not applied, or where classifying frames having high similarity into different clusters is generated. In particular, when the latter problem is generated in a case where the required number of representative frames is very small, since a plurality similar frames can be included in a summary, a user may not trust a provided video summarization function.
Accordingly, the present invention provides a method and apparatus for summarizing a plurality of frames, which classify the plurality of frames according to a similarity of frames and output a frame summary by selecting representative frames from the classified frames. The present invention solves conventional problems and provides convenience to a user of an image reproducing apparatus by performing a function of summarizing a plurality of still images or a video stream into a certain number of frames.
Additional aspects and/or advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
The foregoing and/or other aspects of the present invention are achieved by providing a method of summarizing video streams, the method including receiving a video stream and extracting a keyframe for each shot, selecting a predetermined number of representative frames from the keyframes corresponding to the shots, and outputting a frame summary using the representative frames.
The receiving of the video stream and extracting of the keyframe for each shot may include splitting the input video stream into shots, and extracting a keyframe for each shot.
The selecting of the predetermined number of representative frames from the keyframes corresponding to the shots may include splitting a plurality of keyframes corresponding to shots into a number of clusters which is the same as a predetermined number of representative frames, and extracting a representative frame from each cluster.
The splitting of the plurality of keyframes corresponding to the shots into a number of clusters which same as a predetermined number of representative frames may include composing a node having zero depth (i.e. depth information) for each keyframe of the plurality of keyframes and calculating feature values of the keyframes and differences between the feature values of the keyframes, until a number of highest nodes is equal to the predetermined number of representative frames, selecting two highest nodes having the minimum difference between feature values, connecting the two selected nodes to a new node having a depth obtained by adding 1 to the largest value of depths of the highest nodes, and calculating a feature value of the new node, and until the number of highest nodes, each including a more number of keyframes than a predetermined value (MIN), is equal to the predetermined number of representative frames, removing highest nodes, each including a less number of keyframes than the predetermined value (MIN), and descendant nodes of the highest nodes and removing a highest node having the largest depth among the remaining highest nodes.
The extracting of the representative frame from each cluster may include calculating a mean value of feature values of keyframes included in each cluster, calculating differences between the mean value and the feature values of the keyframes, and selecting a keyframe having the minimum difference value as a representative frame.
As an alternative, the extracting of the representative frame from each cluster may include calculating a mean value of feature values of keyframes included in each cluster; calculating differences between the mean value and the feature values of the keyframes, selecting two keyframes having the minimum difference values, and selecting a keyframe satisfying a predetermined condition out of the two selected keyframes as a representative frame.
The outputting of the frame summary using the representative frames may include summarizing the video stream using the selected representative frames and information of the selected representative frames, and outputting a frame summary and frame information. As an alternative, the outputting of the frame summary using the representative frames may include arranging the selected representative frames in temporal order using information of the selected representative frames, outputting a frame summary and frame information, and when a number of representative frames is re-designated, outputting a frame summary and frame information by arranging representative frames, which are selected according to the re-designated number of representative frames, in temporal order. As another aspect, the outputting of the frame summary using the representative frames may include increasing the number of representative frames until a sum of the duration of each shot including the selected representative frames is longer than a predetermined time, and calculating standard deviations of time differences between shots including representative frames remained by excluding each representative frame and removing a representative frame having the minimum standard deviation when the representative frame is excluded, until the sum of the duration of each shot including the selected representative frames is shorter than a predetermined time.
It is another aspect of the present invention to provide a method of summarizing a plurality of still images, the method including receiving still images and selecting a predetermined number of representative frames, and outputting a frame summary using the selected representative frames.
The receiving of still images and selecting of the predetermined number of representative frames may include splitting a plurality of still images into a number of clusters which is the same as a predetermined number of representative frames, and extracting each representative frame for each cluster.
The splitting of the plurality of still images into a number of clusters which is the same as the predetermined number of representative frames may include composing a node having 0 depth for each still image and calculating feature values of the still images and differences between the feature values of the still images, until the number of highest nodes is equal to the predetermined number of representative frames, selecting two highest nodes having the minimum difference between feature values, connecting the two selected nodes to a new node having a depth obtained by adding 1 to the largest value of depths of the highest nodes, and calculating a feature value of the new node, and until the number of highest nodes, each including a more number of still images than a predetermined value (MIN), is equal to the predetermined number of representative frames, removing highest nodes, each including a less number of still images than the predetermined value (MIN), and descendant nodes of the highest nodes and removing a highest node having the largest depth among the remaining highest nodes.
The extracting of each representative for each cluster may include calculating a mean value of feature values of still images included in each cluster, calculating differences between the mean value and the feature values of the still images, and selecting a still image having the minimum difference value as a representative frame.
As an alternative, the extracting of each representative for each cluster may include: calculating a mean value of feature values of still images included in each cluster, calculating differences between the mean value and the feature values of the still images; selecting two still images having the minimum difference values, and selecting a still image satisfying a predetermined condition out of the two selected still images as a representative frame.
It is another aspect of the present invention to provide an apparatus for summarizing video streams, the apparatus including a representative frame selector receiving a video stream and selecting representative frames, and a frame summary generator summarizing the video stream using the selected representative frames and outputting a frame summary and frame information.
The representative frame selector may include a keyframe extractor receiving a video stream, extracting a keyframe for each shot, and outputting keyframes corresponding to shots, a frame splitting unit receiving the keyframes corresponding to shots and splitting the keyframes corresponding to shots into a number of clusters same as a predetermined number of representative frames, and a cluster representative frame extractor selecting one representative frame among keyframes corresponding to shots included in each cluster and outputting the representative frames.
The frame splitting unit may include a basic node composing unit receiving the keyframes corresponding to shots and composing a node having zero depth for each keyframe, a feature value calculator calculating feature values of the keyframes of the nodes and differences between the feature values, and a highest node composing unit selecting two highest nodes having the minimum difference between the feature values and connecting the two selected nodes to a new node having a depth obtained by adding 1 to the largest value of depths of the highest nodes.
The highest node composing unit may further include a minor cluster removing unit removing highest nodes, each including a less number of keyframes than a predetermined value (MIN), and descendant nodes of the highest nodes, and a cluster splitting unit removing a highest node having the largest depth among the remaining highest nodes.
It is another aspect of the present invention to provide an apparatus for summarizing still images, the apparatus including a representative still image selector receiving still images and selecting a predetermined number of representative frames and a still image summary generator summarizing the still images using the selected representative frames and outputting a frame summary and frame information.
The representative still image selector may include a still image splitting unit receiving the still images and splitting the still images into a number of clusters same as a predetermined number of representative frames, and a cluster representative still image extractor selecting one representative frame among still images included in each cluster and outputting the representative frames.
The still image splitting unit may include a still image basic node composing unit receiving the still images and composing a node having 0 depth for each still image, a still image feature value calculator calculating feature values of the still images of the nodes and differences between the feature values, and a still image highest node composing unit selecting two highest nodes having the minimum difference between the calculated feature values and connecting the two selected nodes to a new node having a depth obtained by adding 1 to the largest value of depths of the highest nodes.
The still image highest node composing unit may further include a still image minor cluster removing unit removing highest nodes, each including a less number of still images than a predetermined value (MIN), and descendant nodes of the highest nodes, and a still image cluster splitting unit removing a highest node having the largest depth among the remaining highest nodes.
It is another aspect of the present invention to provide a medium comprising computer readable code implementing embodiments of the present invention
These and/or other aspects and advantages of the invention will become apparent and more readily appreciated from the following description of the embodiments taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below to explain the present invention by referring to the figures.
The representative frame selector 10 receives a decoded video stream from the video stream decoder 40 and selects representative frames equal to a predetermined number of representative frames provided from the frame summary generator 20. The frame summary generator 20 provides the predetermined number of representative frames designated by a user to the representative frame selector 10, receives representative frames selected by the representative frame selector 10 and outputs a frame summary having a format desired by the user to the display unit 60.
The user interface unit 30 provides data generated by a user operation to the frame summary generator 20. The video stream decoder 40 decodes an encrypted video stream stored in the video storage unit 50 and provides the decoded video stream to the representative frame selector 10. The video storage unit 50 stores encrypted video streams. The display unit 60 receives frames summarized in response to a user's command from the frame summary generator 20 and displays the frame summary so that the user can view the frame summary.
The keyframe extractor 100 receives a video stream from the video stream decoder 40, extracts a keyframe for each shot, and outputs the keyframes corresponding to the shots to the frame splitting unit 110. The frame splitting unit 110 receives the keyframes corresponding to the shots from the keyframe extractor 100 and splits the keyframes corresponding to the shots into a number of clusters which is the same as a predetermined number of representative frames provided by the frame summary generator 20. The cluster representative frame extractor 120 receives the split keyframes corresponding to the shots from the frame splitting unit 110, selects one representative frame among keyframes corresponding to the shots included in each cluster, and outputs the representative frames to the frame summary generator 20.
The basic node composing unit 130 receives the keyframes corresponding to the shots from the keyframe extractor 100 and composes a basic node having zero depth (i.e., depth information) for each keyframe. The feature value calculator 140 calculates feature values of the keyframes of the basic nodes included in highest nodes and differences between the feature values. The highest node composing unit 150 selects two highest nodes having the minimum difference, i.e., the highest similarity, between the calculated feature values and connects the two selected nodes to a new highest node having a depth increased by 1.
The minor cluster removing unit 160 removes highest nodes, each including a smaller number of keyframes than a predetermined value (MIN), out of the highest nodes received from the highest node composing unit 150 and descendant nodes of the highest nodes. The cluster splitting unit 170 removes a highest node having the largest depth among the remaining highest nodes.
The representative still image selector 200 receives still images from the still image storage unit 230 and selects representative frames according to the predetermined number of representative frames provided from the still image summary generator 210. The still image summary generator 210 provides the predetermined number of representative frames designated by a user to the representative still image selector 200, receives representative frames selected by the representative still image selector 200, and outputs a frame summary to the display unit 235.
The still image user interface unit 220 provides data generated by a user operation to the still image summary generator 210. The still image storage unit 230 stores still images. The display unit 235 receives the frame summary from the still image summary generator 210 and displays the frame summary so that the user can view the frame summary.
The still image splitting unit 240 receives the still images from the still image storage unit 230 and splits the still images into a number of clusters which is the same as a predetermined number of representative frames provided by the still image summary generator 210. The cluster representative still image extractor 250 receives the split still images from the still image splitting unit 240, selects one representative frame among still images included in each cluster, and outputs the representative frames to the still image summary generator 210.
The still image basic node composing unit 255 receives still images from the still image storage unit 230 and composes a basic node having zero depth (depth information) for each still image. The still image feature value calculator 260 calculates feature values of the still images included in highest nodes and differences between the feature values. The still image highest node composing unit 265 selects two highest nodes having the minimum difference, i.e., the highest similarity, from among the calculated feature values and connects the two selected nodes to a new highest node having a depth increased by 1.
The still image minor cluster removing unit 270 removes highest nodes, each including a less number of still images than a predetermined value (MIN), out of the highest nodes received from the still image highest node composing unit 265 and descendant nodes of the highest nodes. The still image cluster splitting unit 275 removes a highest node having the largest depth among the remaining highest nodes.
Operations of an apparatus for summarizing a plurality of frames according to an embodiment of the present invention will now be described with reference to
Referring to
Referring to
Referring to
Referring to
Referring to
Also, referring to
Referring to
Referring to
Referring to
Referring to
Since a process of extracting the representative frames from the still images according to the predetermined number of representative frames is a process in which the keyframes corresponding to shots are substituted by the still images, in the process of extracting the representative frames from a video stream, which is described with reference to
Exemplary embodiments may be embodied in a general-purpose computing devices by running a computer readable code from a medium, e.g. computer-readable medium, including but not limited to storage/transmission media such as magnetic storage media (ROMs, RAMs, floppy disks, magnetic tapes, etc.), optically readable media (CD-ROMs, DVDs, etc.), and carrier waves (transmission over the internet). Exemplary embodiments may be embodied as a computer-readable medium having a computer-readable program code unit embodied therein for causing a number of computer systems connected via a network to effect distributed processing. The network may be a wired network, a wireless network or any combination thereof. The functional programs, codes and code segments for embodying the present invention may be easily deducted by programmers in the art which the present invention belongs to.
As described above, according to a method, medium, and apparatus for summarizing a plurality of frames according to embodiments of the present invention, since video summarization adaptively responds to the number of clusters demanded by a user, various video summarization types are possible, and the user can understand the contents of video streams easily and quickly and do activity such as selection, storing, editing, and management. Also, since representative frames are selected from clusters including frames corresponding to scenes with a high appearance frequency, frames whose contents are not distinguishable or whose appearance frequencies are low can be excluded from the video summarization, and the possibility of selected frames corresponding to different scenes is higher. Therefore, reliability of the user with respect to a frame summary can be higher, and since video formats, decoder characteristics, characteristics of a shot discriminating method, and characteristics of a shot similarity function are independently designed, the method and apparatus can be applied to various application environments.
Although a few embodiments of the present invention have been shown and described it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention as defined by the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2004-10820 | Feb 2004 | KR | national |