The present principles relate generally to presenting video content and, more particularly, to summarizing and browsing video content.
Many works try to represent a whole video sequence into a single static or animated image. Given the desired dimensions of the output image, the idea is to incorporate as many “interesting image regions” as possible within these fixed dimensions. In such solutions, “interesting image regions” are usually key-frames. Key-frames should be mutually distinct so that near duplicates are not selected. Such key-frames are usually selected either manually or by uniform subsampling in time, or automatically, for example, by using a shot and sub-shot detector. In the latter case, for each sub-shot the frame with the best quality and the maximum saliency is selected, or alternatively an objective function is minimized so that the selected key-frames (their number is supposed to be fixed) are optimal in that they are the ones among all the input frames that would enable the reconstruction of the complete sequence with a minimum cost.
Once the representative frames or regions have been selected for the summary, the next problem is to arrange them such that the output representation remains compact and coherent and allows efficient browsing. Previous solutions that address the browsing problem are really few.
Embodiments of the present principles are directed at least in part to addressing the deficiencies of the prior art by providing a method and apparatus and arrangement for summarizing and intra and inter-video browsing video content. Various embodiments of the present principles provide a new compact representation of an input video set that enables efficient temporal browsing inside a single input video and also browsing from one video to another based on an inter-videos relationship.
In one embodiment of the present principles, an arrangement for the presentation of a plurality of video sequences for viewing includes at least one horizontal strip having time-sequenced video frames belonging to a single video sequence and at least one vertical strip having a plurality of video frames belonging to different video sequences, each of the plurality of video frames of the at least one vertical strip having at least one feature in common. In such an embodiment the at least one horizontal strip and the at least one vertical strip are configured to intersect at a video frame of the at least one horizontal strip having the at least one feature in common with the video frames in the at least one vertical strip.
In an alternate embodiment of the present principles, a method for arranging video sequences for summarizing and browsing includes arranging video frames of a single video sequence in at least one strip having a first direction, the at least one strip arranged in the first direction having time-sequenced video frames, arranging video frames of different video sequences in at least one strip having a second direction, the frames of the at least one strip arranged in the second direction having at least one feature in common and configuring the video frames of the at least one strip arranged in the second direction to intersect the at least one strip arranged in the first direction at a video frame of the at least one strip arranged in the first direction having the least one feature in common with the video frames of the at least one strip arranged in the second direction.
In an alternate embodiment of the present principles an apparatus for arranging video sequences for summarizing and browsing includes a memory for storing at least control programs, instructions, software, video content, video sequences and data and a processor for executing the control programs and instructions. In such an embodiment, when executing the control programs the processor configures the apparatus to arrange video frames of a single video sequence in at least one strip having a first direction, the at least one strip arranged in the first direction having time-sequenced video frames, arrange video frames of different video sequences in at least one strip having a second direction, the frames of the at least one strip arranged in the second direction having at least one feature in common and configure the video frames of the at least one strip arranged in the second direction to intersect the at least one strip arranged in the first direction at a video frame of the at least one strip arranged in the first direction having the least one feature in common with the video frames of the at least one strip arranged in the second direction.
In alternate embodiment of the present principles a machine-readable medium having one or more executable instructions stored thereon, which when executed by a digital processing system causes the digital processing system to perform a method for arranging video sequences for summarizing and browsing, the method includes arranging video frames of a single video sequence in at least one strip having a first direction, the at least one strip arranged in the first direction having time-sequenced video frames, arranging video frames of different video sequences in at least one strip having a second direction, the frames of the at least one strip arranged in the second direction having at least one feature in common and configuring the video frames of the at least one strip arranged in the second direction to intersect the at least one strip arranged in the first direction at a video frame of the at least one strip arranged in the first direction having the least one feature in common with the video frames of the at least one strip arranged in the second direction.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. The drawings are not to scale, and one or more features may be expanded or reduced for clarity.
Embodiments of the present principles advantageously provide a method, an apparatus and an arrangement for summarizing and browsing video content. Although the present principles will be described primarily within the context of horizontal and vertical strips, the specific embodiments of the present principles should not be treated as limiting the scope of the invention. It will be appreciated by those skilled in the art and informed by the teachings of the present principles that the concepts of the present principles can be advantageously applied to video frames comprising strips oriented in substantially any direction.
The embodiments of the present principles leverage both width and height of a dedicated space of a presentation space/display screen. The interactive representation enables efficient intra and inter-video browsing while avoiding visual overload on a presentation space/display screen contrary to multi-track representation of, for example, existing video editors.
More specifically, various embodiments of the present principles provide a compact representation of a collection of input video sequences based on crossed horizontal and vertical strips. In one embodiment, horizontal strips correspond to key-frames of filmstrips or video sequences and vertical strips are composed of key-frames from at least one different video having a common feature with a key-frame of the horizontal sequence(s). The vertical key-frames are connected together based on a common feature. For example, in one embodiment the key-frames can be connected because they all contain the same detected and recognized face, or they all correspond to the same scene captured roughly simultaneously but from different viewpoints.
Thus, in such embodiments a single horizontal strip summarizes a single input video while enabling efficient intra-video browsing, as provided by a continuous timeline on the x-axis and a single vertical strip enables efficient inter-video browsing by presenting, to a user, key-frames coming from different videos while depicting certain similarities.
Embodiments of the present principles leverage both width and height of a dedicated space of a presentation space such as a display screen. The interactive representation of the present principles enables efficient intra and inter-video browsing while avoiding visual overload.
In one embodiment of the present principles, inter-video key-frame connection is based on facial detection/recognition. The idea is to connect Kj1i1 with Kj2i2 if in both videos a common face was detected and both faces match.
In an alternate embodiment of the present principles, inter-video key-frame connection is based on image similarity. The idea is to connect Kj1i1 with Kj2i2 (defined below) using content-based image retrieval algorithms without assumption on objects that compose the scene. It should be noted that any kind of metadata could be used to establish the inter-video key-frame connections and so to deal with any video collection (e.g. same actor, same action (smoking, swimming, couple kissing . . . ), same place, etc.).
Thus, in embodiments of the present principles, such as in the embodiment of
In one embodiment in which N input videos constitute an input video set ν={νi}i=1 . . . N and where key-frames are selected with the same uniform temporal subsampling for every input video, where strips are simple key-frames strips and where connections between key-frames from different input videos are based on temporal synchronization, for each input video νi={Iti}, given a temporal subsampling step s>0, Mi key-frames {Kji}j=0 . . . M
In one embodiment, for constructing the N horizontal strips, it is assumed that all input videos have the same frame width and frame height. If not, conversion to a common format is performed. For each i, given the key-frame set {Kji}, the horizontal strip image L is defined as a simple horizontal image stack, in one embodiment, in accordance with equation one (1) which follows:
H
i(x1,y1)=Kji(x2,y2) with y1=y2 and x1=x2+j*frame_width,
∀x2∈[0,frame_width[,y2∈[0,frame_height[,j∈[0,Mi−1]
where (x1,y1) and (x2, y2) correspond to pixel locations in the domain of the horizontal strip image Hi (whose width may change for every i since Mi may change) and the image domain [0,frame_width[×[0,frame_height[respectively.
t
0
i1+δi1,i2=t0i2. (2)
To provide proper temporal synchronization between the two input videos, the time offset has to be determined. In various embodiments of the present principles, such synchronization information can be determined using metadata associated with the video files for the input videos if capture devices for the different videos were previously synchronized. Alternatively, such information can also be determined using audio or image feature matching between the two input videos, as discussed by Bagri et al., “A Scalable Framework for Joint Clustering and Synchronizing Multi-Camera Videos”, European Signal Processing Conference (EUSIPCO), 2013; and by Elhayek et al., “Feature-Based Multi-video Synchronization with Subframe Accuracy”, DAGM 2012 (Deutsche Arbeitsgemeinschaft für Mustererkennung DAGM e.V.—German Association for Pattern Recognition).
In one embodiment of the present principles, a time threshold u is defined such that 0<u<s/2 and two key-frames Kj1i1 and Kj2i2 from different input videos νi1 and νi2 are considered. The two frames, Kj1i1 and Hj2i2, are considered as “connected” if |j1*s+δi1,i2−j2*s|<u. That is, the j1-th key-frame of νi1 and the j2-th key-frame of νi2 are captured at times separated by less than u. Note that with u<s/2, one key-frame of a video cannot be connected with more than one key-frame of a video νi2.
In accordance with an embodiment of the present principles, for the construction of vertical strips for each key-frame, Kji, the following set of key-frames ji=Kji∪{Cpi,j}p=1 . . . P
V
j
i(x1,y1)=Cpi,j(x2,y2) with x1=x2 and y1=y2+p*frame_height,
∀x2∈[0,frame_width[,y2∈[0,frame_height[,p∈[0,Pji] (3)
where (x1,y1) and (x2,y2) correspond to pixel locations in the domain of the vertical strip image Vji (whose height may change for every (i, j) since Pji may change) and the image domain [0,frame_width[×[0,frame_height[respectively.
Equation (3) describes the construction of a vertical strip from bottom to top with a selected key-frame on the lowest row. Of course constructing similarly a vertical strip from top to bottom with selected key-frame on the highest row is straight forward, as well as constructing a vertical strip with selected key-frame on an intermediate row. This means that that there are several options for the vertical arrangement of the key-frames, ji={Cpi,j}p=0 . . . P
In one embodiment of the present principles, the vertical arrangement is deduced automatically and in real-time from what is actually displayed on the screen to maintain on a same row the key-frames from a same video. That is, in accordance with an embodiment of the present principles, if an already displayed vertical strip contains a key-frame of video νi on a certain row and that the user asks for the display of another vertical strip also containing a key-frame of video νi, the arrangement is done so that both key-frames of video νi appear on the same row. If it is not possible to do so without introducing holes in the required vertical strip, optimization is performed to satisfy the rule requiring maintaining vertical locations for the largest possible number of key-frames of an embodiment of the present principles. For example, in one embodiment of the present principles all configurations can be tested by considering a vertical segment of length of the number of key frames in the second vertical strip, sliding this segment along the y-axis and retaining the position that maximizes the number of rows of this segment containing one key frame of the first vertical strip corresponding to a video for which another key-frame has to be displayed in the second vertical strip
In accordance with embodiments of the present principles, when displaying simultaneously a vertical strip image and a horizontal strip image having a common key-frame, they are crossed at their key-frame in common. For a given pair (i, j) and for p such that Cpi,j=Kji,
H
i(x+j*frame_width,y)=Vji(x,y+p*frame_height),
∀X∈[0,frame_width[,y∈[0,frame_height[.
That is, at an intersection, pixels of horizontal and vertical strip images have color values in common since they intersect at a common key-frame.
In one embodiment of the present principles, to avoid visual overload during display and collisions or inconsistencies between strips, simultaneous display of two horizontal (or more) and two vertical strips (or more) is prevented. For example, assume that at the initialization, only one first horizontal strip is shown, then that a user requests for the display of a first vertical strip, then requests again for the display of a second horizontal strip. If the user requests again for the display of a second vertical strip during the display of the second horizontal strip, the display of the first horizontal strip will be removed/hidden before the display of the second horizontal strip. For example,
In various embodiments of the present principles, if a single strip exists in either the horizontal or vertical direction, the display of multiple strips in the other direction is possible. For example,
In an embodiment in which key frames in a vertical strip are selected, there can exist frames in horizontal strips that have no feature in common. For example,
In at least one embodiment of the present principles key-frame selection is based on saliency, activity and/or aesthetic estimation. Frames with a local maximum of saliency or aesthetic score or local minimum of activity are considered. Alternatively, key frame selection can be performed manually by a user.
However, when key-frame selection is not based on uniform temporal subsampling, inter-video key-frame connection, if still based on synchronization, must be adapted. That is, in such embodiments, one key-frame is connected to the closest key-frame of each other video if their temporal distance does not exceed a given threshold.
In at least one embodiment of the present principles, the key frames described above can comprise videos as well as still pictures. As such, vertical strip images can contain still pictures that have been connected to the considered video key-frame(s) of a horizontal strip and vice versa.
In addition, in accordance with various embodiments of the present principles, to enable a user to quickly recall frames of interest previously watched, video indices and key-frame indices or other such references to previously used key-frames are stored in a memory/queue. In such embodiments, corresponding thumbnails can be displayed in a dedicated display space. As such, when a thumbnail is selected, the associated horizontal and vertical strips are displayed.
In various embodiments of the present principles, a computer readable medium (e.g., memory, storage device, removable media, and so on) is provided with stored program instructions, which, when executed by a processor, will cause a method to be implemented, such as described above according to one or more embodiments of the present principles.
The top right section (604) of the user interface 600 of
In the single lower section (606) of the user interface 600 of
In the lower section (606) of the user interface 600 of the embodiment of
The user interface 600 of the present principles further enables a user to double touch or click on any region of a horizontal or vertical strip shown in the lower section (606) to play the corresponding video from the corresponding instant in the top right section (604), or in a separate window, or a second screen.
At step 704, video frames of different video sequences are arranged in at least one strip having a second direction (e.g., a vertical direction), the frames of the at least one strip arranged in the second direction having at least one feature in common. The method 700 can then proceed to step 706.
At step 706, the video frames of the at least one strip arranged in the second direction are configured to intersect the at least one strip arranged in the first direction at a video frame of the at least one strip arranged in the first direction having the least one feature in common with the video frames of the at least one strip arranged in the second direction. The method 700 can then be exited.
Optionally, in one embodiment of the present principles, only one strip is arranged in the first direction if there is more than one strip arranged in the second direction, and only one strip is arranged in the second direction if there is more than one strip arranged in the first direction. An arrangement of the present principles can be limited as such to prevent visual confusion when displaying an arrangement of the present principles.
Although the apparatus 800 of
While the foregoing is directed to various embodiments of the present principles, other embodiments of the invention may be devised without departing from the basic scope thereof. For example, one or more features described in the examples above can be modified, omitted and/or used in different combinations. Thus, the appropriate scope of the invention is to be determined according to the claims that follow.
Number | Date | Country | Kind |
---|---|---|---|
15307079.2 | Dec 2015 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2016/080220 | 12/8/2016 | WO | 00 |