The present invention relates to a method for navigating in a sequence of images, e.g. in a movie and for interactive rendering of the same, specifically for videos rendered on portable devices that allow easy user interaction, and to an apparatus for conducting the method.
For video analysis, different technologies exist. A technology called “object segmentation” is known in the art for producing spatial image segmentations, i.e. object boundaries, based on color and texture information. An object is defined quickly by a user using object segmentation technology, just by selecting one or more points within the object. Known algorithms for object segmentation are “graph cut” and “watershed”. Another technology is called “object tracking”. After an object has been defined by its spatial boundary, the object is tracked automatically in the subsequent sequence of images. For object tracking, the object is typically described by its color distribution. A known algorithm for object tracking is “mean shift”. For increased precision and robustness, some algorithms rely on the object appearance structure. A known descriptor for object tracking is Scale—invariant feature transform (SIFT). A further technology is called “object detection”. Generic object detection technology makes use of machine learning for computing statistical model of the appearance of the object to be detected. This requires many examples of the objects (ground truth). Automatic object detection is done on new images by using the models. Models typically rely on SIFT descriptors. Most common machine learning techniques used nowadays include boosting and support vector machine (SVM). In addition, face detection is a specific object detection application. In this case, the features used are typically filter parameters, more specifically “haar wavelet” parameters. A well known implementation relies on cascaded boosted classifiers, e.g. Viola & Jone.
Users watching video content such as news or documentaries might want to interact with the video by skipping some segment or going directly to some point. This possibility is even more desirable when using a tactile device such as a tablet used for video rendering that makes it easy to interact with the display.
For making this non linear navigation possible several means are available on some systems. A first example is skipping a fixed amount of playback time, e.g. moving forward in the video for 10 or 30 seconds. A second example is to make a jump to the next cut or to the next group of pictures (GOP). These two cases provide a limited semantic level of the underlying analysis. The skipping mechanism is oriented according to the video data, not according to the content of the movie. It is not clear for the user what image is displayed at the end of the jump. Further, the length of the interval skipped is short.
A third example is that a jump is made to the next scene. A scene is a part of action in a single location in a TV show or movie, composed of a series of shots. When skipping a whole scene, in general this means jumping to a part of the movie where a different action begins, at a different location in the movie. This might be a too long video portion, which is skipped. A user might want to move by finer steps.
On some system where in-depth video analysis is available, some objects or persons can even be indexed. The users can then click on these objects/faces when they are visible on the video, the system can then move to the point where these persons appear again or display additional information on this particular object. This method relies on the number of objects that the system can effectively index. For the time being, there are relatively few detectors compared to the huge variety of objects one can encounter in e.g. an average news video.
It is an object of the invention to propose a method for navigation and an apparatus for conducting the method, which overcomes the limitations outlined above and offers a more user friendly and intuitive navigation.
According to the invention, a method for navigating in a sequence of images is proposed. The method comprises the steps of:
The method has the advantage that a user watching a sequence of images, which is a movie or news program, either being broadcasted or recorded, is navigating through the sequence of images according to the content of the images and is not dependent on some fixed structure of the broadcasted stream which is defined mainly due to technical reasons. Navigation is made intuitive and more user friendly. Preferably, the method is performed in real-time so that the user has the feeling of actually moving the object. By a specific interaction, the user asks for the point in time where the designated object disappears from the screen.
The first input for selecting the first object is clicking on the object or drawing a bounding box around the object. Thus, the user applies commonly known input methods for a man-machine interface. If an indexing exists, the user is also able to choose the objects by this index from a database.
According to the invention, the step of moving the first object to a second position according to a second input includes:
The step of identifying further includes identifying at least one image in the sequence of images where the relative position of the destination of the first object is close to the position of the second object.
This has the advantage that a user can not only choose a location on the screen which is related to the physical coordinates of the screen, but can also choose a position where he expects the object with respect to other objects in the image. For example, in a recorded soccer game, the first object might be the ball, and the user can move the ball into the direction of the goal as he expects that there is a scene he might be interested in when the ball is close to the goal, because this might be shortly before the team scores or a player kicks the ball over the goal. This kind of navigation by object is completely independent of the coordinates of the screen, but depends on the relative distance of two objects in the image. The position of the destination of the first object being close to the position of the second object also includes that the second object is exactly at the same position as the destination or that the second object overlaps the destination of the moved first object. Advantageously, the size of the objects and their variation over time is considered to define the relative position of two object to each other. A further alternative is that the user selects an object, e.g. a face and then zooms the bounding box of the face in order to define the size of the face. Afterwards, an image is searched in the sequence of images on which the face is displayed at the size or a size close to this size. This feature has the advantage that if e.g. an interview is played back and the user is interested in the speech of a specific person, assuming that the face of this person is displayed almost covering the biggest part of the screen when this person speaks. Thus, an advantage of the invention is that there is an easy method for jumping to a part of the recording where a specific person is interviewed. The first and the second object do not necessarily have to be selected in the same image of the sequence of images.
The further input for selecting the second object is clicking on the object or drawing a bounding box around the object. Thus, the user applies commonly known input methods for a man-machine interface. If an indexing exists, the user is also able to choose the objects by this index from a database.
For selecting the objects, object segmentation, object detection or face detection is employed. When the first object is detected, object tracking techniques are used to track the position of this object in the subsequent images of the sequence of images. Also key-point technique is employed for selecting an object. Further, key-point description is used for determining the similarity of objects in different images in the sequence of images. A combination of the above mentioned techniques for selecting, identifying and tracking an object is used. Hierarchical segmentation produces a tree whose nodes and leaves correspond to nested areas of the images. This segmentation is done in advance. If a user selects an object by tapping to a given point of an image, the smallest node containing this point is selected. If a further tap of the user is received, the node selected with the first tap is considered as father of the node selected with the second tap. Thus, the corresponding area is considered to define the object.
According to the invention, only a part of the images of the sequence of images are analyzed for identifying at least one image where the object is close to the second position. This part to be analyzed is a certain number of images following the actual image, the certain number of images representing a certain playback time following the currently displayed image. Another way to implement the method is to analyze all following images from the currently displayed image or all previous images from the currently displayed image. This is a familiar way for a user to navigate in a sequence of images as it represents a fast forward or fast backward navigation. According to another implementation of the invention, only I or only I and P pictures or all pictures are analyzed for the object based navigation.
The invention further concerns an apparatus for navigation in a sequence of images according to the above described method.
For better understanding the invention shall now be explained in more detail in the following description with reference to the figures. It is understood that the invention is not limited to this exemplary embodiment and that specified features can also expediently be combined and/or modified without departing from the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
11305767.3 | Jun 2011 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2012/060723 | 6/6/2012 | WO | 00 | 3/17/2014 |