This invention relates to the field of video summarization and video indexing.
Prior art references considered to be relevant as a background to the invention are listed below and their contents are incorporated herein by reference. Additional references are mentioned in the above-mentioned U.S. provisional applications Nos. 60/898,698; 60/911,839 and 60/971,582 and their contents are incorporated herein by reference. Acknowledgement of the references herein is not to be inferred as meaning that these are in any way relevant to the patentability of the invention disclosed herein. Each reference is identified by a number enclosed in square brackets and accordingly the prior art will be referred to throughout the specification by numbers enclosed in square brackets.
Sorting through a collection of raw video is time consuming since it is necessary to view a video clip in order to determine if anything of interest has been recorded. While this tedious task may be feasible in personal video collections, it is impossible when endless video, as recorded by surveillance cameras and webcams, is involved. Millions of webcams are covering the world capturing their field of view 24 hours a day. It is reported that in UK alone there are millions of surveillance cameras covering the city streets. Many webcams even transmit their video publicly over the internet for everyone to watch. Many security cameras are also available online in stores, airports and other public areas.
One of the problems in utilizing webcams is that they provide raw, unedited, data. Most surveillance video is therefore never watched or examined. In our earlier WO2007/057893 [25] we proposed a method for video synopsis for creating shortened videos by combining selected portions from multiple original images of a scene. A video clip describes visual activities along time, and compressing the time axis allows viewing a summary of such a clip in a shorter time. Fast-forward, where several frames are skipped between selected frames, is the most common tool used for video summarization. A special case of fast-forward is called “time lapse”, generating a video of very slow processes like growth of flowers, etc. Since fast-forward may lose fast activities during the dropped frames, methods for adaptive fast forward have been developed [12, 18, 4]. Such methods attempt to skip frames in periods of low interest or lower activity, and keep frames in periods of higher interest or higher activity. A similar approach extracts from the video a collection of short video sequences best representing its contents [21].
Many approaches to video summary eliminate completely the time axis, and show a synopsis of the video by selecting a few key frames [8, 24]. These key frames can be selected arbitrarily, or selected according to some importance criteria. But key frame representation loses the dynamic aspect of video. Comprehensive surveys on video abstraction appear in [11, 13].
In both approaches above, entire frames are used as the fundamental building blocks. A different methodology uses mosaic images together with some meta-data for video indexing [6, 19, 16]. In this case the static synopsis image includes objects from different times.
Object-based approaches to video synopsis were first presented in [20, 7], where moving objects are represented in the space-time domain. The concatenation of portions of images representing objects or activities across successive frames of a video are called “tubes”. As objects are represented by tubes in the space-time volume, the terms “objects” and “tubes” are used interchangeably in the following description. These papers [20, 7] introduced a new concept: creating a synopsis video that combines activities from different times (see
An example of an object-based approach is disclosed in WO2007/057893 [25] assigned to the present applicant wherein a subset of frames in an input video is obtained that show movement of one or more objects. Selected portions from the subset that show non-spatially overlapping appearances of the objects in the first dynamic scene are copied from multiple input frames to a reduced number of frames in the output video sequence such that multiple locations of the objects as seen at different times in the input video are shown simultaneously in the output video.
The approaches disclosed in references [20, 7] are based on the observation that more activities can be shown in shorter video if the chronological order is not enforced. It would be useful to extend such an approach to the synopsis of endless video sequences such as obtained using surveillance cameras so as to limit the duration of the output video to a desired limit while nevertheless doing so in a controlled manner that reduces the risk of feature loss.
Efficient indexing, retrieval and browsing of long video is growing in importance, especially given the rapid increase in the number of surveillance cameras that endlessly collect video. Conventional video indexing uses manual annotation of the video with keywords, but this method is time-consuming and impractical for surveillance cameras. Additional video indexing methods have been proposed, based on selection of representative key frames or representative time intervals from the input video.
Video synopsis can be used for indexing, retrieval and browsing as many objects in a covered time period are shown in a short synopsis video. However, since many different objects are shown simultaneously, examining the simple synopsis video may be confusing.
US20060117356 (Microsoft) discloses a video browser that provides interactive browsing of unique events occurring within an overall video recording. In particular, the video browser processes the video to generate a set of video sprites representing unique events occurring within the overall period of the video. These unique events include, for example, motion events, security events, or other predefined event types, occurring within all or part of the total period covered by the video. Once the video has been processed to identify the sprites, the sprites are then arranged over a background image extracted from the video to create an interactive static video montage. The interactive video montage illustrates all events occurring within the video in a single static frame. User selection of sprites within the montage causes either playback of a portion of the video in which the selected sprites were identified, or concurrent playback of the selected sprites within a dynamic video montage.
WO0178050 (Inmotion Technologies Ltd.) discloses a system and method for using standard video footage even from a single video camera to obtain, in an automated fashion, a stroboscope sequence of a sports event, for example. The sequence may be represented as a static images of a photographic nature, or by a video sequence in which camera motion remains present, in which case the video sequence can be rendered as a panning camera movement on a stroboscope picture or as an animated stroboscope sequence in which the moving object leaves a trailing trace of copies along its path. Multiple cameras can be used for an expanded field of view or for comparison of multiple sequences, for example.
JP-2004-336172 discloses a system for shortening a surveillance video, which maintains chronological order of events, without separating between concurrently moving objects. Maintaining chronological order substantially limits the shortening possibilities. Also there is no suggestion to index objects so that the original time of an object in the synopsis video can be easily determined
According to a first aspect of the invention there is provided a computer-implemented method for a computer-implemented method for generating a synopsis video from a substantially endless source video stream as generated by a video surveillance camera, the method comprising:
receiving in real time object-based descriptions of at least three different source objects in said source video stream, each source object being a connected subset of image points from at least three different frames of the source video stream;
continuously maintaining a queue of said received object-based descriptions including for each respective source object its duration and location;
selecting a subset of at least three source objects from said queue based on given criteria,
sampling from each selected source object one or more synopsis objects by temporal sampling;
determining for each synopsis object a respective display time for starting its display in the synopsis video; and
generating the synopsis video by displaying selected synopsis objects or objects derived therefrom each at its respective predetermined display time;
such that at least three points, each derived from different respective times in the source video stream, are displayed simultaneously in the synopsis video and at least two points, both derived from the same time, are displayed at different times in the synopsis video.
According to a second aspect of the invention there is provided a system for generating a synopsis video from a substantially endless source video stream as generated by a video surveillance camera, the system comprising:
a source object selector adapted to be coupled to an object memory that stores a continuously maintained queue of object-based descriptions of at least three different source objects in said source video stream, said object-based descriptions including for each respective source object its duration and location, the source object selector being adapted to select at least three different source objects according to one or more defined constraints, each source object being a connected subset of image points from at least three different frames of the source video stream;
a synopsis object sampler coupled to the source object selector for sampling from each selected source object one or more synopsis objects by temporal sampling using image points derived from specified time periods;
a time selection unit coupled to the synopsis object sampler for determining for each synopsis object a respective display time for starting its display in the synopsis video;
a stitching unit coupled to the time selection unit for stitching each of the selected synopsis objects or objects derived therefrom at a respective display time so as to generate successive synopsis video frames, such that in the synopsis video frames at least three points that each derive from different respective times in the source video stream are displayed simultaneously; and
a synopsis frame memory coupled to the stitching unit for storing said synopsis video frames.
The video synopsis disclosed by the present invention is a temporally compact representation of the video that enables video browsing and retrieval and allows indexing of different features so as to allow selected features to be isolated and for their temporal progression in a specified time interval to be displayed. In accordance with some embodiments of the invention, a hierarchical video indexing based on video synopsis is employed wherein indexing is based of first selecting the class of desired objects or activities, and only later selecting an individual object or activity. This procedure may be repeated so as to allow multi-level hierarchical indexing.
An example of the general type of video synopsis with which the invention is concerned is described in [25] with reference to the space-time volume shown in
Applying this principle to infinite video as obtain by webcams and surveillance cameras involves many additional challenges:
Video synopsis can make surveillance cameras and webcams more useful by giving the viewer the ability to view summaries of the endless video, in addition to the live video stream. To enable this, a synopsis server can view the live video feed, analyze the video for interesting events, and record an object-based description of the video. This description lists for each webcam the interesting objects, their duration, location, and their appearance.
A query that could be answered by the system may be similar to “I would like to watch in one minute a synopsis of the video from this webcam captured during the last hour”, or “I would like to watch in five minutes a synopsis of last week”, etc. Responding to such a query, the most interesting events (“tubes”) are collected from the desired period, and are assembled into a synopsis video of the desired length. The synopsis video is an index into the original video as each object includes a pointer to its original time.
While webcam video is endless, and the number of objects is unbounded, the available data storage for each webcam may be limited. To keep a finite object queue we propose a procedure for removing objects from this queue when space is exhausted. Removing objects from the queue should be done according to similar importance criteria as done when selecting objects for inclusion in the synopsis, allowing the final optimization to examine fewer objects.
Within the context of the invention and the appended claims, the term “video” is synonymous with “movie” in its most general term providing only that it is accessible as a computer image file amenable to post-processing and includes any kind of movie file e.g. digital, analog. The camera is preferably at a fixed location by which is meant that it can rotate and zoom—but is not subjected to translation motion as is done in hitherto-proposed techniques. The scenes with the present invention is concerned are dynamic at least some of the time.
In order to describe the invention use will be made of a construct that we refer to as the “space-time volume” to create the synopsis videos. The space-time volume may be constructed from the input sequence of images by sequentially stacking all the frames along the time axis. However, it is to be understood that so far as actual implementation is concerned, it is not necessary actually to construct the space-time volume for example by actually stacking in time 2D frames of a dynamic source scene. More typically, source to frames are processed individually to construct target frames but it will aid understanding to refer to the space time volume as though it is a physical construct rather than a conceptual construct.
In order to understand the invention and to see how it may be carried out in practice, embodiments will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which:
a to 2d show background images from a surveillance camera at Stuttgart airport at different times;
a to 3d show four extracted tubes shown “flattened” over the corresponding backgrounds from
a and 4b show two extracted tubes from a “Billiard” scene;
a and 5b show spatial distribution of activity in the airport scene shown in
a and 10b show synopsis frames from a video captured over 24 hours at Stuttgart airport;
a and 12b show synopsis frames generated from a video captured overnight in St. Petersburg;
a and 13b show synopsis frames generated from a webcam taken over five hours of a quiet parking lot;
Examples of video synopsis as shown as representative frames in the figures are, of course, best viewed in video. Examples can be accessed at http://www.vision.huji.ac.il/video-synopsis/.
From each object, segments are created by selecting subsets of frames in which the object appears. Such segments can represent different time intervals, optionally taken at different sampling rates.
In order to apply such a technique to generate a useful synopsis of endless video, interesting objects and activities (tubes) should be identified. In many cases the indication of interest is simple: a moving object is interesting. While we use object motion as an indication of interest in many examples, exceptions must be noted. Some motions may have little importance, like leaves on a tree or clouds in the sky. People or other large animals in the scene may be important even when they are not moving. While we do not address these exceptions, it is possible to incorporate object recognition (e.g. people detection [14, 17]), dynamic textures [5], or detection of unusual activities [31]. We will give a simple example of video synopsis giving preferences to different classes of objects.
To enable segmentation of moving foreground objects we start with background construction. In short video clips the appearance of the background does not change, and it can be built by using a temporal median over the entire clip. In the case of surveillance cameras, the appearance of the background changes in time due to changes in lighting, changes of background objects, etc. In this case the background for each time can be computed using a temporal median over a few minutes before and after each frame. We normally use a median over four minutes. Other methods for background construction are possible, even when using a shorter temporal window [3, 9], but we used the median due to to its efficiency.
a to 2d show background images from a surveillance camera at Stuttgart airport.
We used a simplification of [22] to compute the space-time tubes representing dynamic objects. This is done by combining background subtraction together with min-cut to get a smooth segmentation of foreground objects. As in [22], image gradients that coincide with background gradients are attenuated, as they are less likely to be related to motion boundaries. The resulting “tubes” are connected components in the 3D space-time volume, and their generation is briefly described below.
Let B be the current background image and let I be the current image to be processed. Let V be the set of all pixels in I, and let N be the set of all adjacent pixel pairs in I. A labeling function ƒ labels each pixel r in the image as foreground (ƒr=1) or background (ƒr=0). A desirable labeling ƒ usually minimizes the Gibbs energy [2]:
where E1(ƒr) is the unary-color term, E2(ƒr, ƒs) is the pairwise-contrast term between adjacent pixels r and s, and λ is a user defined weight.
As a pairwise-contrast term, we used the formula suggested by [22]:
E
2(ƒr,ƒs)=δ(ƒr−ƒs)·exp(−βdrs), (2)
where β=2<∥(I(r)−I(s)∥2>−1 is a weighting factor (<·> is the expectation over the image samples), and drs are the image gradients, attenuated by the background gradients, and given by:
In this equation, zrs measures the dissimilarity between the foreground and the background:
z
rs
32 max∥I(r)−B(r)∥,∥,I(s)−B(s)∥, (4)
and K and σz are parameters, set to 5 and 10 respectively as suggested by [22].
As for the unary-color term, let dr=∥(r)−B(r)∥ be the color differences between to the image I and the current background B. The foreground (1) and background (0) costs for a pixel r are set to:
where k1 and k2 are user defined thresholds. Empirically k1=30/255 and k2=60/255 worked well in our examples.
We do not use a lower threshold with infinite weights, since the later stages of our algorithm can robustly handle pixels that are wrongly identified as foreground. For the same reason, we construct a mask of all foreground pixels in the space-time volume, and apply a 3D morphological dilation on this mask. As a result, each object is surrounded by several pixels from the background. This fact will be used later by the stitching algorithm.
Finally, the 3D mask is grouped into connected components, denoted as “activity tubes”.
Each tube b is represented by its characteristic function
where B(x, y, t) is a pixel in the background image, I(x, y, t) is the respective pixel in the input image, and tb is the time interval in which this object exists.
Other methods for segmentation of moving objects are possible. For example, in binary segmentation, every element in the image can be classified as belonging to an object or not belonging to an object. Segmentation can also be fuzzy, assigning to each to element in an image a grade of membership in an object. Suitable approaches are described in [32, 33, 34, 35]. The notion of fuzzy connectivity is explained in [33]. Fuzzy segmentation is sometimes called Fuzzy matting [35], and is extensively used in graphics for insertion of objects of different backgrounds [35]. In our work we treat all objects as binary. However, an extension to fuzzy objects is straightforward. For example, all elements of the cost functions can be multiplied by the fuzzy membership values of the involved elements. Also, when a fuzzy object is inserted into an image, the membership value can be used “alpha matting”, allowing a transparency effect.
We now define the energy of interaction between tubes. This energy will later be used by the optimization stage, creating a synopsis having maximum activity while avoiding conflicts and overlap between objects. Let B be the set of all activity tubes. Each tube b is defined over a finite time segment in the original video stream tb=[tbs, tbe].
The synopsis video is generated based on a temporal mapping M , shifting objects b in time from its original time in the input video into the time segment {circumflex over (t)}b=[{circumflex over (t)}bs, {circumflex over (t)}be] in the video synopsis. M (b)={circumflex over (b)} indicates the time shift of tube b into the synopsis, and when b is not mapped to the output synopsis M (b)=Ø. We define an optimal synopsis video as the one that minimizes the following energy function:
where Ea is the activity cost, Et is the temporal consistency cost, and Ec is the collision cost, all defined below. Weights α and β are set by the user according to their relative importance for a particular query. Reducing the weights of the collision cost, for example, will result in a denser video where objects may overlap. Increasing this weight will result in sparser video where objects do not overlap and less activity is presented. An example for the different synopsis obtained by varying β is given in
After extracting the activity tubes the pixel based cost can be replaced with object based cost. Specifically, the Stitching cost associated with prior art approaches such as discussed in [251 is replaced by the Collision cost in Eq. (7) (described below). This cost to penalizes for stitching two different objects together, even if their appearance is similar (e.g. two people). In addition, a “Temporal Consistency” cost is defined, penalizing for the violation of the temporal relations between objects (or tubes). Such features of the synopsis are harder to express in terms of pixel-based costs.
The activity cost favors synopsis movies with maximum activity. It penalizes for objects that are not mapped to a valid time in the synopsis. When a tube is excluded from the synopsis, i.e M (b)=Ø, then
where χb(x, y, t) is the characteristic function as defined in Eq. (6). For each tube b, whose mapping {circumflex over (b)}=M (b) is partially included in the final synopsis, we define the activity cost similar to Eq. (8) but only pixels that were not entered into the synopsis are added to the activity cost.
For every two “shifted” tubes and every relative time shift between them, we define the collision cost as the volume of their space-time overlap weighted by their activity measures:
where {circumflex over (t)}b∩{circumflex over (t)}b′ is the time intersection of b and b′ in the synopsis video. This expression will give a low penalty to pixel whose color is similar to the background, but were added to an activity tube in the morphological dilation process. Changing the weight of the collision cost Ec changes the density of objects in the synopsis video as shown in
The temporal consistency cost adds a bias towards preserving the chronological order of events. The preservation of chronological order is more important for tubes that have a strong interaction. For example—it would be preferred to keep relative time of two to people talking to each other, or keep the chronological order of two events with a reasoning relation. Yet, it is very difficult to detect such interactions. Instead, the amount of interaction d (b,b′) between each pair of tubes is estimated for their relative spatio-temporal distance as described below:
where d (b,b′,t) is the Euclidean distance between the pair of closest active pixels from b and b′ in frame t and σspace determines the extent of the space interaction between tubes.
If tubes b and b′ do not share a common time at the synopsis video, and assuming that b is mapped to earlier time than b′, their interaction diminishes exponentially with time:
d(b, b′)=exp(−({circumflex over (t)}b′s−{circumflex over (t)}be)/σtime), (11)
where σtime is a parameter defining the extent of time in which events are still considered as having temporal interaction.
The temporal consistency cost creates a preference for maintaining the temporal relations between objects by penalizing cases where these relations are violated:
where C is a constant penalty for events that do not preserve temporal consistency.
Since the global energy function in Eqs. (7) and (15) is written as a sum of energy terms defined on single tubes or pairs of tubes, it can be minimized by various MRF-based techniques such as Belief Propagation [23] or Graph Cuts [10]. In our implementation we used the simpler simulated annealing method [9] which gave good results. The simulated annealing was applied in the space of all possible temporal mappings M , including the special case when a tube is not used at all in the synopsis video.
Each state describes the subset of tubes that are included in the synopsis, and neighboring states are defined as states in which a single activity tube is removed or changes its mapping into the synopsis. As an initial state we used the state in which all tubes are shifted to the beginning of the synopsis movie. Also, in order to accelerate computation, it is possible to restrict the temporal shifts of tubes to be in jumps of 10 frames.
As mentioned earlier, millions of webcams and surveillance cameras are covering the world, capturing their field of view 24 hours a day. One of the problems in utilizing these cameras is that they provide unedited raw data. A two hours feature film, for example, is usually created from hundreds or even thousands of hours of raw video footage. Without editing, most of the webcam data is irrelevant. Also, viewing a camera in another continent may be convenient only during hours of non-activity because of time-zone differences.
An important feature of the present invention is to make the webcam resource more useful by giving the viewer the ability to view summaries of the endless video, in addition to the live video stream provided by the camera. A user may wish to watch in five minutes a synopsis of all content captured during the previous week. To enable this, we describe a system that may be based on the object-based synopsis as described in WO2007/057893, but includes additional components that allow dealing with endless videos.
In this system, a server can view the live video feed, analyze the video for interesting events, and record an object-based description of the video. This description lists for each camera the interesting objects, their duration, location, and their appearance.
A two phase process is proposed for synopsis of endless video:
Many surveillance cameras and webcams image scenes that exhibit no activity over long periods. For storage efficiency, frames corresponding to such time periods are commonly filtered out during the online phase. The original time of the remaining frames is recorded together with each frame. In one implementation, frames were recorded according to two criteria: (1) A global change in the scene, measured by the sum of squared difference (SSD) between the incoming frame and the last kept frame. This criterion tracked the lighting changes expressed by a gradual illumination change in the entire frame. (2) Existence of a moving object, measured by the maximal SSD in small windows.
By assuming that moving objects with a very small duration (e.g.—less than a second) are not important, video activity can be measured only once in a few frames.
One of the main challenges in handling endless videos is developing a scheme to “forget” older objects when new objects arrive. The naive scheme of discarding the oldest activity is not good, as a user may wish to get a summary of a long time duration which may include objects from the entire period. Instead, we propose an alternative scheme that aims to estimate the importance of each object to possible future queries, and discard objects accordingly.
All detected objects, represented as tubes in the space-time volume, are stored in a queue awaiting user queries. When an object is inserted into the queue, its activity cost (Eq. (8)) is computed to accelerate the future construction of synopsis videos. As the video generated by the webcam is endless, it is likely that at some point the allocated space will be exhausted, and objects will have to be removed from the queue.
When removing objects (tubes) from the queue, we prefer to remove objects that to are least likely to be included in a final synopsis. In our examples we used three simple criteria that can be computed efficiently: “importance” (activity), “collision potential”, and “age”. But other options are possible, for example when specific appearance or activity is of interest.
A possible measure for the importance of an object is the sum of its characteristic function as defined in Eq. (8).
Since the collision cost cannot be computed before receiving the user query, an estimate for the collision cost of tubes is made using the spatial activity distribution in the scene. This spatial activity is represented by an image which is the sum of active pixels of all objects in each spatial location, normalized to sum to one. A similar spatial activity distribution is computed for each individual object (this time not normalized). The correlation between these two activity distributions is used as a “potential collision” cost for this object.
There are several possible approaches to address the removal of older objects from the queue, taking into consideration the desired distribution of objects in the synopsis. For example, the user can be interested to focus on newer events but leave some representation for old events in case they were significant. Alternatively, the synopsis should have a uniform representation of every time interval. For example, in a synopsis of 24 hours a user may be interested to see objects from each and every hour if applicable.
In the first approach we can assume that the density of objects in the queue should decrease exponentially with the age of the objects. For example, if we divide the age axis into discrete time intervals, the number of objects at the t's interval, Nt, should be proportional to
where σ is the decay coefficient, and K is determined to control the total number of objects in the queue. When an object should be removed from the queue, the number of objects in each time interval t is compared to Nt. Only objects from time intervals t whose population exceeds Nt will be evaluated using the activity cost and the potential to collision. The object with minimal activity and maximal collision will be removed.
An example of temporal distribution of objects arriving into the queue appears in
The object queue can be accessed via queries such as “I would like to have a one-minute synopsis of this camera broadcast during the past day”. Given the desired period from the input video, and the desired length of the synopsis, the synopsis video is generated using four operations. (i) Generating a background video. (ii) Once the background video is defined, a consistency cost is computed for each object and for each possible time in the synopsis. (iii) An energy minimization step determines which tubes (space-time objects) appear in the synopsis and at what time. (iv) The selected tubes are combined with the background time-lapse to get the final synopsis. These operations are described in this section. The reduction of the original video to an object based representation enables a fast response to queries.
After user query a second (smaller) object queue is generated, having only objects from the desired time period. To enable fast optimization, the collision cost in Eq. (9) between every two objects in the smaller queue is computed in advance.
The background of the synopsis video is a time lapse background video, generated before adding activity tubes into the synopsis. The background video has two tasks: (i) It should represent the background changes over time (e.g. day-night transitions, etc.). (ii) It should represent the background of the activity tubes. These two goals are conflicting, as representing the background of activity tubes will be done best when the background video covers only active periods, ignoring, for example, most night hours.
We address this trade-off by constructing two temporal distributions. (i) A temporal activity distribution Ha of the video stream as shown in
Background frames are selected for the time-lapse background video according to the interpolated temporal distribution. This selection is done such that the area of the histogram between every two selected background frames is equal. More frames are selected from active time durations, while not totally neglecting inactive periods.
Alternatively, the background may be replaced by a synthetic background, and objects will be placed on top of this synthetic background.
Since we do not assume accurate segmentation of moving objects, we prefer to stitch tubes to background images having a similar appearance. This tube to background consistency can be taken into account by adding a new energy term Eb(M). This term will measure the cost of stitching an object to the time-lapse background. Formally, let I{circumflex over (b)}(x, y, t) be the color values of the mapped tube {circumflex over (b)} and let Bout(x, y, t) be the color values of the time lapse background. we set:
where σ({circumflex over (b)}) is the set of pixels in the border of the mapped activity tube {circumflex over (b)} and tout is the duration of the output synopsis. This cost assumes that each tube is surrounded by pixels from its original background (resulting from our morphological dilation of the activity masks).
The background consistency term in Eq. (14) is added to the energy function described in Eq. (7), giving:
where α,β,γ are user selected weights that are query dependent.
The stitching of tubes from different time periods poses a challenge to existing methods (such as [1, 16]). Stitching all the tubes at once may result in a blending of colors from different objects, which is an undesired effect. It is better to preserve the sharp transitions between different objects, while eliminating the seams only between the objects and the background. An accurate segmentation of the objects may solve this problem, but an accurate segmentation is unrealistic. Instead, the boundaries of each tube consist of background pixels due to the morphological dilation we apply when generating the activity tubes.
The α-Poisson Image Blending, proposed by [27] may be a good solution for the stitching between objects, but not as good as the Poisson Editing [15] for stitching the objects to the background. The suggested approach is to use the observation that all objects have a similar background (up to illumination changes), and stitch each tube independently to the time lapse background. Any blending method is possible, and we used a modification of Poisson editing: We added a regularization that preserves the original appearance of the objects even if they were stitched to background images with a different lighting conditions (e.g.—people seen during the day, stitched on top of an evening-time background).
It should also be noted that the objects pasted on to the background need not be graphically identical to the corresponding objects in the source video. For example, at least one object in the synopsis video may be formed by spatially warping the object or replacing a corresponding object in the source video by a predetermined symbol or icon. Also, when objects are added to the synopsis video, the pixel value of the objects may not necessarily replace the background value. The new value can be an average of the background and the object, creating a transparency effect.
Let Ω be an image domain with boundary ∂Ω. Let ƒ,b be the foreground object (tube) and background (time lapse) pixel colors, and let s be the unknown values of the stitched object over the interior of Ω. The result of the Poisson blending with regularization is given by:
to where λ is the weight of the regularization term. In [28] it was shown that stitching in the gradient domain can be done very efficiently.
After stitching each tube to the background, overlapping tubes are blended together by letting each pixel be a weighted average of the corresponding pixels from the stitched activity tubes {circumflex over (b)}, with weights proportional to the activity measures χ{circumflex over (b)}(x,y,t). Alternatively, transparency can be avoided by taking the pixel with maximal activity measure instead of the weighted average.
It may be possible to use depth ordering when “object tubes” are combined, where closer tubes will occlude further tubes. A simple “ground plane” heuristic can be used, assumes that an object whose vertical image position is lower is also closer. Other depth ordering methods include [29]. The frequency of object occlusion cases depends on the relative weights of the collision cost (that prevent such cases) in respect to other costs.
Synopsis based hierarchical video indexing aims to provide a compact and easy method of representing and browsing video content, using visual queries, even for endless video as is the case in surveillance cameras.
To build the proposed index, the video is first analyzed and active/important objects are extracted from the video to generate an object based representation of the video.
In the second stage of the indexing the objects are clustered into clusters of similar objects using any clustering method. A possible way to perform such clustering is building an affinity (similarity) matrix based on some similarity measure between every pair of objects.
An affinity measure between objects can be based on various features including, but not limited to, a correlation between the space-time representations of the objects. In order to perform efficient similarity measure, objects which are represented as 3D tubes in a space-time representation of the video, can be first warped to common coordinate system and a space time-alignment can be performed to overcome their possible different location and scale. Such warping can be useful as it results in a similarity measure that is invariant to the space-time location of the objects in the video and various projective transformations. Similar objects that have similar motion paths but in different location in the video will be considered as similar even if their original appearances are different because of perspective effects. Additional affinity measures can be the shape, size or colors of objects, and many other possible similarity measures as known in the art.
Once the affinity matrix has been constructed, a clustering method such as [30] can be used to classify each object into its corresponding class. It is important to note that the clustering process can also be used to help identifying “irregular” objects and behavior. An object that is not clustered well to any of the classes can be suspicious as being “unique” or “irregular” and can be visualized with special marking in the process of the synopsis generation which is described later.
An example of such automatic clustering process on the objects extracted from a video sequence of 24 hours, shot in a parking lot, is illustrated in
Alternatively, probabilistic clustering can be used whereby, instead of having a hard decision as to which object belongs to which class, a probability vector can be defined for each object and the different classes. This can be used in the hierarchical indexing process. For example, an object can be associated with more than one class if it fits well to those classes. It can also be used in the case where irregular activity is detected as manifested by an object that has substantially equal probabilities of belonging to different classes.
Once the clusters are determined, and objects are grouped into clusters, a collection of hierarchical video synopsis sequences for indexing can be generated based on this clustering (Synopsis from Clusters—SFC).
A possible indexing hierarchy could first present to the user a “top-level synopsis”: a synopsis video containing only a few representatives from each cluster. E.g. from the clusters shown in
The top-level synopsis can be used as an interactive indexing tool to get to each desired object or activity in the original video. Once a user selected a specific cluster or a collection of clusters, the next synopsis in hierarchy are displayed. Such a synopsis will contain more representatives or even all the objects from those clusters. At this stage the user can specify his desired object and get to its original time in the input video. In case there are many objects in each class, and it is hard to generate a short synopsis, it is possible to add more levels to the hierarchy and generate several sub clusters from each original cluster. For example, selecting the “cars moving to the right” cluster may generate two sub clusters of trucks and of sedans. In this case selecting one of the sub clusters will be needed before getting to the final result of the search, showing a synopsis with most of the original objects.
Such an approach provides a very quick search and indexing tool into a very large video which is based on visual queries and enables every object and activity in the original video to be reached in a reasonable time.
We tested video synopsis on a few video streams captured off the Internet. As the frame rate is not constant over the Internet, and frames drop periodically, whenever we use a temporal neighborhood we do not count the number of frames, but we use the absolute times of each frame.
a and 10b show the effect of the choice of collision cost of the density of objects in the video synopsis.
In most cases not all objects are of interest. A traffic surveillance camera may be interested only in cars, while other applications may prefer pedestrians. Filtering of objects can be done in several places. Objects can be filtered out before entering to the queue, and in this case it will never be possible to retrieve them. Alternatively, objects can be filtered only at the query stage. In this case the queue will include all objects, and different queries can extract different objects from the queue. It is also possible to create a customized energy function for each application.
A simple example of customization is shown in
There are a few schemes for specifying the duration and quality of the video synopsis.
(a) Let the user specify the desired duration of the video synopsis and the penalty for object collision. In this case, the optimization stage will maximize the amount of activity that can be included in the synopsis under the specified constraints.
(b) Let the user specify the desired duration of the video synopsis and the percentage of activity to be included in it. The optimization stage will generate a video synopsis having minimum collisions under the specified constraints.
(c) Let the user specify the allowed percentage of lost objects and the penalty for object collision. The optimization stage will minimize the duration of the synopsis under the specified constraints.
In our experiments we have implemented option (a), where the duration of the video synopsis was determined by the user as a hard constraint. Surveillance video may prefer options (b) or (c), assuring that most objects will be represented in the synopsis.
Fast-forward is the most common tool used for video summarization, and has always been applied to entire frames. For example, “time lapse” videos display in a short time slow processes like the growth of flowers, etc. Some current methods suggest an adaptive fast-forward [12, 18, 4] but are still limited to the framework of entire frames. With video synopsis each object can have its own “fast forward” based on its importance, or based on its original velocity. Slow objects may be accelerated, but not fast objects. Alternatively, fast objects may be slowed down for easier viewing.
Object speed changes can be done in a simple manner, e.g. bringing all moving objects to a uniform velocity. For this purpose slow objects will be speeded up, and fast objects will be slowed down. Alternatively, the change of speed of objects can be determined during the optimization stage, giving some penalty to speed changes of objects. Adding object-based speed changes to the optimization stage can further improve the temporal compression rate of the synopsis video, at the expense of increasing the complexity of the optimization.
Speed changes of an object can be performed by sampling pixels from an object at some selected time periods. If the number of selected time periods is smaller than the to number of frames in the tube, the general effect is that the objected is speeded up. If the number of selected time periods is larger than the number of frames in the tube, the object is slowed down. When a selected time period does not fall exactly on a frame, the pixel at this time can be interpolated from neighboring pixels at neighboring frames closest in time to the selected time. Any possible interpolation method may be used.
Phase transitions occur when a moving object becomes stationary and merges with the background, or when a stationary object starts moving. Examples are cars being parked or getting out of parking. In most cases phase transitions are significant events, and we detect and mark each phase transition for use in the query stage.
We can find phase transitions by looking for background changes that correspond to beginning and ending of tubes. These transitions are important as they explain the changes in the background. Since phase transitions correspond to changes in the background, the stitching of phase transitions into the background should be given special attention. Two effects may occur in the synopsis video when phase transitions are not inserted into the background at the right time. (i) Background objects will appear and disappear with no reason, causing a flickering effect. (ii) Moving objects will disappear when they stop moving, rather than become part of the background. To minimize such effects in the video synopsis, phase transitions should be inserted into the time lapse background at a time that corresponds to their original time.
Referring now to
computing image motion parameters between frames in the source video;
warping the video frames in the source video so that stationary objects in the imaged scene will be stationary in the video.
The pre-processor 13 also includes a source object detector 15 that detect objects in the source video and queues the detected objects in an object memory 16. As noted above, when an object is inserted into the queue, its activity cost (Eq. (8)) is computed to accelerate the future construction of synopsis videos, this also being done by the pre-processor 13. It is to be understood that the pre-processor 13 is shown for the sake of completeness owing to its use when creating a synopsis video from an endless source video. The invention also contemplates a reduced system without the pre-processor 13 that is adapted to be coupled to the object memory 16 for manipulating the object queue so as to create a synopsis video according to defined criteria. Such a system is realized by the remaining components in
Thus, a user interface 17 is coupled to the object memory 16 for allowing user-defined constraints to be defined. Such constraints may be used, for example, to define a time window within the source video to be summarized. It may also be used to define the required duration of the synopsis video. The user interface 17 is also used to select objects or object classes for indexing purposes. It will be appreciated that the constraints may also be predefined, in which case some embodiments of the invention will not require the user interface 17.
A source object selector 18 is coupled to the object memory 16 for selecting from the subset at least three different source objects according to the user-defined constraints or to default constraints defined by the system. Each of the different source objects is a connected subset of image points from at least three different frames of the source video. A clustering unit 19 may optionally be coupled to the source object selector 18 for clustering objects according to defined criteria, which may be specified by the user using the user interface 17. A synopsis object sampler 20 is coupled to the source object selector 18 or to the clustering unit 19 when provided, for sampling from each selected source object one or more synopsis objects by temporal selection using image points derived from some selected frames. The “sampler” may be used to change the speed of individual objects. A frame generator 21 includes a cluster selector 22 that allows only selected clusters to be included in the synopsis video. The frame generator 21 also includes a time selector 23 for selecting for each synopsis object a respective time for starting its display in the synopsis video. The frame generator 21 further includes a color transformation unit 24 for selecting for each synopsis object and each frame a respective color transformation for displaying the synopsis object. Optionally, the frame generator 21 may include a warping unit 25 for spatially warping objects prior to stitching to the synopsis video. Within the context of the description and the appended claims, the term “warping” is intended to embrace any spatial editing of an object. As noted above, this can include replacing an object in its entirety by another object such as an icon; or it can simply involve effecting slight geometric adjustments to an object prior to its being stitched in the synopsis video. A stitching unit 26 within the frame generator 21 stitches the selected color-transformed synopsis objects so as to generate successive synopsis video frames.
The frames of the synopsis video are stored in a synopsis frame memory 27 for subsequent processing or display by a display unit 28 that displays the temporally shifted objects at their specified time and color transformation.
The system 10 may in practice be realized by a suitably programmed computer having a graphics card or workstation and suitable peripherals, all as are well known in the art.
Object-based synopsis can be used to create a short video that is a synopsis of an endless video streams, such as recorded by surveillance cameras. The method includes two phases. In the input phase, which is performed in real time, the video stream is analyzed and objects of interest are detected and segmented from their background. While an object interest function based on motion has been described, any other approach for object detection, recognition, and segmentation can be used for the generation of the “tubes”—the 3D space-time representation of each object.
Queue management is necessary to bridge the gap between infinite video and finite storage, and to enable fast response to user queries. Several methodologies have been described for determining which objects should be removed from the queue once it becomes full, but other methodologies are possible. Even a random selection of objects for removal from the queue may work fine.
The second phase occurs after the user's query is given. A subset of the queue is extracted based on the period of interest, and the object tubes are arranged (by temporal shifts) to generate the optimal video synopsis. This stage, which requires off-line computation, delivers the video synopsis to the user.
Some very interesting aspects concern periodicity in background. Day-night periods are particularly amenable to detection. In most cases when a few days are covered by a single synopsis, the time-lapse background may cover only a single day, while the activities will come from all days. This should be an option given to the user specifying the query.
It will be understood that reference to “image points” unless specifically limited to binary segmentation, is intended to embrace also image points as determined by interpolation or by non-binary segmentation methods such as fuzzy segmentation.
It is also to be understood that when a source video is monochrome, the color transformation unit may be used to determine an appropriate gray scale transformation to be applied to selected synopsis objects prior to stitching. Therefore, within the context of the appended claims, the term “color” is not intended to be limited only to RGB but may also be monochrome.
It should also be noted that transformation of color or grayscale is only one type of transformation that may be applied to selected synopsis object prior to stitching. As explained above, this is particularly beneficial when generating a synopsis video from a source video that spans a long duration in order to ensure background consistency. But it may be less critical when a synopsis video is derived from a source video in which the background hue is sufficiently constant during the required time window.
It will also be understood that the system according to the invention may be a suitably programmed computer. Likewise, the invention contemplates a computer program being readable by a computer for executing the method of the invention. The invention further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the method of the invention.
This application is a continuation U.S. patent application Ser. No. 12/522,843, entitled METHOD AND SYSTEM FOR VIDEO INDEXING AND VIDEO SYNOPSIS, filed Jul. 10, 2009 as a National Stage Application of International Patent Application No. PCT/IL2007/001520, filed Dec. 9, 2007, and further claims the benefit of U.S. Provisional Patent Applications Nos. 60/898,698, filed Feb. 1, 2007; 60/911,839, filed Apr. 13, 2007; and 60/971,582, filed Sep. 12, 2007, all of whose contents are included herein by reference.
Number | Date | Country | |
---|---|---|---|
60898698 | Feb 2007 | US | |
60911839 | Apr 2007 | US | |
60971582 | Sep 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12522843 | Jul 2009 | US |
Child | 13644850 | US |