This application claims the right of priority based on British patent application number 09 012 63.4 filed on 26 Jan. 2009, which is hereby incorporated by reference herein in its entirety as if fully set forth herein.
The invention relates to a method, apparatus and computer program product for the detection of similar video segments.
In recent years there has been a sharp increase in the amount of digital video data that consumers have access to and keep in their video libraries. These videos may take the form of commercial DVDs and VCDs, personal camcorder recordings, off-air recordings onto HDD and DVR systems, video downloads on a personal computer or mobile phone or PDA or portable player, and so on. This growth of digital video libraries is expected to continue and accelerate with the increasing availability of new high capacity technologies such as Blu-Ray. However, this abundance of video material is also a problem for users, who find it increasingly difficult to manage their video collections. To address this, new automatic video management technologies are being developed that allow users efficient access to their video content and functionalities such as video categorisation, summarisation, searching and so on.
One problem that arises is the need to identify similar video segments. The potential applications include the identification of recurrent video-segments (e.g. TV-station jingles), and video database retrieval, based for instance on the identification of a short fragment provided by the user within a large database of video. Another potential application is the identification of repeated video segments before and after commercials.
In GB 2 444 094 A “Identifying repeating video sections by comparing video fingerprints from detected candidate video sequences” a method is devised to identify repeated sequences as a mean of identifying commercial breaks. Initially, the detection of hard cuts, fades, and audio level changes identifies candidate segments. Whenever a certain number of hard cuts/fades is identified, a candidate segment is considered and stored. This will be compared against the subsequent identified candidate segments. Comparison is performed using features from a set of possible embodiments: audio level, colour histogram, colour coherence vector, edge change ratio, and motion vector length.
The problem with this method is that it relies on clear boundaries between a segment and its neighbours in order for the segment to be identified in the first place, and then compared against other segments. Also, partial repetitions (i.e. only one section of a segment is repeated) cannot be detected. Furthermore colour coherence vectors provide very little spatial information and therefore are unsuitable for frame-to-frame matching. Finally, some of the features suggested are not available in uncompressed video and therefore must be calculated ad-hoc, noticeably increasing the computational and time requirements.
In WO 2007/053112 A1 “Repeat clip identification in video data” a method and system for identifying repeated clips in video data is presented. The method comprises partitioning the video data into ordered video units utilising content-based keyframe sampling, wherein each video unit comprises a sequence interval between two consecutive keyframes; creating a fingerprint for each video unit; grouping at least two consecutive video units into one time-indexed video segment; and identifying the repeated clip instance based on correlation of the video segments.
The video is firstly scanned and for each frame a colour histogram is calculated. When a change in histogram is detected between two frames, according to a given threshold, the second frame is marked as keyframe. The set of frames between one keyframe and the next constitutes a video unit. A unit-level colour signature is then extracted, as well as frame-level colour signatures. Furthermore, unit time length is also considered as feature. A minimum of two consecutive video units are then united to form a segment. This is compared against each other segment in the video. L1 distances are calculated for the unit-level signature and time lengths and if both are below fixed thresholds, a match is detected and the corresponding point in a correlation matrix is set to 1 (0 otherwise). Sequences of is then indicate sequences of matching segments. The frame-level features are used only as a post-processing verification step, and not in the proper detection process.
One drawback with the technique in WO 2007/053112 A1 is that it is based on video units, a video unit being the video between non-uniformly sampled content-based keyframes. Thus, a unit is a significant structural element, e.g. a shot or more. This is a significant problem since, in the presence of very static or very dynamic video content, the key-frame extraction process itself will become unstable and detect too few or too many units. Also, for video segments which are matching but also differ in small ways, e.g. by the addition of a text overlay, or a small picture-in-picture, and so on, the key-frame extraction may also become unstable and detect very different units. A segment is then defined as the grouping of two or more units, and the similarity metric is applied at the segment level, i.e. similarities are detected at the level of unit-pairs. So, the invention is quite limited in that it is targeted to the matching of longer segments, e.g. groups of shots, and cannot be applied to ad-hoc segments that last only a few frames. The authors acknowledge this and claim that this problem can be addressed by assuming for example, sampling at more than one keyframes per second. This, however, can only be achieved by uniform rather than content-based sampling. A major problem that emerges in that case is that video unit-level features will lose all robustness to frame rate changes. In all cases, a fundamental flaw of this method is that it makes decisions on the similarity of segments (i.e. unit-pairs) based on a fixed threshold, but without taking into consideration what similarity levels the neighbouring segments exhibit. The binarized correlation matrix may provide an excessively coarse description of the matching, and result in an excessive number of 1s, e.g. due to the presence of noise. Then, linear sequences of matching segments are searched for. With non-uniform key-frame sampling these lines of matching unit-pairs may be non-contiguous and made of breaking and non-collinear segments, and a complex line-tracking algorithm is employed to deal with all these cases. And although frame-level features are available, these are only used for verification of already detected matching segments, not for the actual detection of matching segments.
In general, the aforementioned prior art is mostly concerned with the identification of equal length segments with very high similarity, and distinctive boundaries with respects to neighbouring segments. This situation can reasonably suit the application of such methods to the identification of repeated commercials, which are usually characterized by sharp boundaries (e.g. few dark frames before/after commercial), distinctive audio levels, and equal length of the repetitions. However, the aforementioned prior art lacks the generality necessary to deal with more arbitrary applications.
One problem that is not addressed is the partial repetition of even a short segment, i.e. only a portion of a segment is repeated. In this case, it is not possible to use segment length as a feature/fingerprint for identification.
Another problem that is not addressed is the presence of text overlay in one of the two segments, or linear/non-linear distortion of one of the two segments (e.g. blurring, or luminance/contrast/saturation changes). Such distortion must be taken into account when considering more general applications.
In WO 2004/040479 A1 “method for mining content of video” a method for detecting similar segments in video signal is illustrated. A video of unknown and arbitrary content and length is subject to feature extraction. Features can be audio and video based, e.g. motion activity, colour, audio, texture, such as MPEG-7 descriptors. A feature progression in time constitutes a time series. A self-distance matrix is constructed from this time series using Euclidean distance between each point of the time series (or each vector of a multi-dimensional time series). In the claims, other measures are mentioned, specifically dot product (angle distance) and histogram intersection. Whether multiple features are considered (e.g. audio, colour, etc), for each feature the method of finding paths in the distance matrix is applied independently. The resulting identified segments are subsequently fused.
The method finds diagonal or quasi-diagonal line paths in the diagonal matrix using dynamic programming techniques i.e. finding paths of minimal cost, defined by an appropriate cost function. This cost function includes a fixed threshold that defines, in the distance matrix, where the match between two frames is to be considered “good” (low distance) or “bad” (high distance). Therefore points whose value is above the threshold are not considered, while all the points in the distance matrix whose value is below the threshold are considered. Subsequently, paths which are consecutive (close endpoints) are joined, and paths that partially or totally overlap are merged. After joining and merging, short paths (less than a certain distance between the end points) are removed.
One drawback with the technique in WO 2004/040479 A1 is that the application of dynamic programming to search linear pattern in the distance matrix may be computationally very intensive. Furthermore one should consider that dynamic programming is applied to all points in the distance matrix that fall below a certain fixed threshold. This fixed threshold may lead to a very large or very small number of candidate points. A large number of points is produced if segments in a video are strongly self-similar, i.e. the frames in the segment are very similar. In this case a fixed threshold that is too high may generate a impractically large number of points to be tracked.
In the eventuality that a repeated segment is composed of identical frames, the problem of finding a least cost path could be ill-posed since all diagonal paths connecting a point of the first segment with a point of the second segment would yield same cost. This would generate a very large number of parallel patterns. An example of these patterns is illustrated in
On the other hand, in the presence of strong non linear editing (e.g. text overlay, blur, brightening/darkening) the distance between frames may rise above the fixed threshold, resulting in an insufficient number of candidate points.
Another problem may arise when a replicated segment is partially edited, e.g. some frames of the segment are replicated with blur, or text overlay. In this case a break is generated in the path of minimal cost, resulting in two split segments even if the two segments a semantically connected.
Another problem with both WO 2007/053112 A1 and WO 2004/040479 A1 is the complexity and cost of calculating the distance matrix and storing the underlying descriptors, which become prohibitive for very large sequences when a real-time or faster operation is required. What is required is a method which alleviates these problems so as to allow fast processing of large sequences, e.g. entire programmes.
Certain aspects of the present invention are set out in the accompanying claims. Other aspects are described in the embodiments below and will be appreciated by the skilled person from a reading of this description.
An embodiment of the present invention provides a new method and apparatus for detecting similar video segments, which:
More particularly, given two video sequences, an embodiment of the invention performs processing for each frame of each sequence to:
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
A method that is performed by a processing apparatus in an embodiment of the invention will now be described. The method comprises a number of processing operations. As explained at the end of the description, these processing operations can be performed by a processing apparatus using hardware, firmware, a processing unit operating in accordance with computer program instructions, or a combination thereof.
Given two video sequences, Sa and Sb, the processing performed in an embodiment finds similar segments between the two sequences.
According to the present embodiment, video frames
F(n,m)={Fc(n,m)}, n=1K N, m=1K M, c=1K C
may be described by their pixel values in any suitable colour space (e.g. C=3 in RGB or YUV colour space, or C=1 for greyscale images), or in any suitable descriptor derived thereof.
In one embodiment of the invention, each frame in Sa and Sb is described by its pixel values. In a preferred embodiment of the invention (
Such descriptors may be calculated using the techniques described in EP 1,640,913 and EP 1,640,914, the full contents of which are incorporated herein by cross-reference. For example, such descriptors may be calculated using a multi-resolution transform (MRT), such as the Haar or Daubechies wavelet transforms. In a preferred embodiment, a custom, faster transform is used that is calculated locally on a 2×2 pixel window and is defined as
In a similar fashion to the Haar transform, this MRT is applied to every 2×2 non-overlapping window in a resampled frame of dimensions N=M=a power of 2. For a N×M frame F(n,m) it produces, for each colour channel c, (N×M)/4 LPc elements and (3×N×M)/4 HPc elements. Then, it may be applied to the LPc elements that were previously calculated, and so on until eventually only 1 LPC and (N×M−1) HPc elements remain per colour channel.
For each frame F(n,m) LP and HP elements, or a suitable subset of them are arranged in a vector (hereinafter referred as descriptor) Φ=[φd] d=1k D (step S2), where each element φd belongs to a suitable subset of LP and HP components (e.g. D=C×N×M).
Each element of the vector φd is then binarized (quantized) according to the value of its most significant bit (MSB) (step S3)
Φbin=[φdbin] d=1k D:φdbin=MSB(φd), φdεΦ
In different embodiments of the invention, different frame descriptors, or different elements of each descriptor, are subject to individual binarization (quantisation) parameters, such as MSB selection, locality-sensitive hashing (for example as described in Samet H., “Foundations of Multidimensional and Metric Data Structures”, Morgan Kaufmann, 2006), etc.
Each frame Fi(a) in Sa=[Fi(a)], i=1K A is compared against each frame Fj(b) in Sb where Sb=[Fj(b)], j=1K B by means of Hamming distance δij of their respective binarized descriptors.
The elements δij are arranged in a distance matrix (step S4)
Δ=[δij] i=1K A, j=1K B
In the preferred embodiment of the invention (
A local minimum μij at the i-th row of the j-th column of Δ indicates that the frame Fi(a) is the most similar to Fj(b) within its column-wise neighbourhood . In the simple minimum finding procedure described above, the neighbourhood is defined as ={Fj−1(a), Fj(a), Fj+1(a)}. Consequently a local minimum μij which is also global in the j-th column indicates that the frame Fi(a) is the best match to Fj(b). Local minima are evaluated against a threshold (step S7). The algorithm preserves only those minima whose value is sufficiently small i.e. that imply a sufficiently strong match between the corresponding frames in Sa and Sb.
The threshold in S7 is adaptively calculated so that a minimum amount mm and no more than a maximum amount Mm of minima are kept. However, if the number of minima found at step S6 is smaller than mm, then the threshold is consequently adapted in order to preserve all of them.
For each local minimum μ a set V of valley points is found (step S8). These are defined as the non-minima points immediately below and above (column-wise in Δ) the corresponding minimum, i.e.
∀μij=δijV=[δ(i−v)jKδ(i−1)jδ(i+1)jKδ(i+v)j]
where v is a default parameter (such as 3) or alternatively is defined heuristically. The goal of V is to provide continuity information in the neighbourhood of each μ and therefore harness discontinuity and non-colinearity that arises from any form of sampling, non-linear editing, and in general lack of “strong” matching between the two sequences Sa and Sb.
Valley points are evaluated against a threshold (step S9). The algorithm preserves only those valley points whose value is sufficiently small i.e. that imply a sufficiently strong match between the corresponding frames in Sa and Sb.
Local minima and valley points are denominated altogether as candidate matching segment points π (step S10). An example of π is illustrated in
It should be noted that in a different embodiment of the invention, local minima and valley points may be searched in an analogous fashion along rows of the distance matrix instead of columns. In yet another embodiment of the invention, local minima and valley points may be searched in an analogous fashion in both dimensions of the distance matrix.
A line segment searching algorithm is applied to the set of π (step S11). The rationale is that if a video segment of Sa is repeated in Sb, this will raise a set of consecutive (adjacent) π in Δ arranged in a line segment σ orientated at θ=tan−1(ρa/ρb) where ρa and ρb are respectively the frame rates of Sa and Sb. If frame rate does not change from Sa to Sb it follows that ρa=ρb and θ=45°.
Valley points V therefore help to fill any gap that may arise due to the presence of noise or imperfect matching due to any coarse time sampling. An example of the line segment searching algorithm is illustrated in
In a preferred embodiment of the invention, further to the line segment searching, a hysteretic line segment joining algorithm follows (
is lower than a given threshold, therefore indicating sufficient matching between the intermediate frames in Sa and Sb, then the two line segments are connected (step S13).
In a preferred embodiment, line segments σ (step S14), and therefore matching video segments, are validated according to their average value in Δ calculated as
where L(σ) is the length (number of π) of the line segment σ (step S15). Line segments yielding
In a preferred embodiment, an ambiguity resolution procedure (AR) is employed to remove multiple matches and ambiguous results. An example of the final result is provided in
The AR works in two stages as follows:
σaεζ(σb)L(σb)≧L(σa)
In one embodiment of the invention, it is considered the case where two or more video segments in Sa (in Sb) have the same match in Sb (in Sa). The corresponding line segments in Δ are said to be competing as they “compete” to associate the same frames in Sb (in Sa) with different frames in Sa (in Sb). Trivially, competing line segments do not shadow each other (this eventuality would be dealt by stage 2). Given two line segments σ1, σ2, σ1 is said to compete with σ2 if
[xxtart(σ1),xxtop(σ1)]∩[xxtart(σ2),xxtop(σ2)]≠0
[yxtart(σ1),yxtop(σ1)]∩[yxtart(σ2),yxtop(σ2)]=0
[xxtart(σ1),xxtop(σ1)]∩[xxtart(σ2),xxtop(σ2)]=0
[yxtart(σ1),yxtop(σ1)]∩[yxtart(σ2),yxtop(σ2)]≠0
Although competing frame segments may occur, the presence of competing line segments may in fact betray a false result by the algorithm, and therefore they are assessed as follows:
In different embodiments of the invention, and according to the target application, either Stage 1 or Stage 2 or the entire AR procedure may be omitted.
In one embodiment of the invention, the two video sequences Sa and Sb are one and the same, i.e. Sa=Sb=S, and the method is aimed at finding repeated video segments within S. In that case only the upper-triangular part of Δ requires processing, since Sb=Sa trivially implies that Δ is symmetric, and the main diagonal is a locus of global minima (self-similarity). So we have to guarantee that given a line segment σ={xxtart, xxtop, yxtart, yxtop} then xxtart<yxtart, xxtop<yxtop. Furthermore, to avoid detection of self-similarity we have to ensure that any detected line segment infer two non-overlapping time-intervals in Sa and Sb. In other words yxtop<xxtart i.e. the repeated video segment in Sb must start after the end of its copy in Sa. Since however yxtart<yxtop, xtart<xxtop, the condition yxtop<xxtart is sufficient as it also implies that the segment lies in the upper triangular part. In an alternative embodiment of the invention, the lower-triangular part of the distance matrix may be processed instead of the upper-triangular part in an analogous fashion.
In different embodiments of the invention, Sa and Sb may be described by multiple descriptors, e.g. separately for different colour channels and/or for LP and HP coefficients, resulting in multiple distance matrices Δ. This is understood to better harness the similarity between frames by addressing separately the similarity in colour, luminosity, detail, average colour/luminosity, etc.
In a preferred embodiment, we consider the YUV colour space, and we separate HP and LP coefficients for the Y-channel, and retain only the LP coefficients of the U- and V-channels. This results in three distance matrices ΔY-HP, ΔY-LP, and ΔUV-LP. In such an embodiment, each distance matrix may be processed individually. For example, the minima and valley points found on the ΔY-HP may be further validated according to their value in ΔY-LP and ΔL-HP. In a similar fashion, line segments σ may be validated according to their average values in the three matrices, i.e. according to
In different embodiments of the invention, the descriptor elements are not binarised but quantised to a different number of bits, e.g. 2 or 3 bits, in which case the Hamming distance is replaced by a suitable distance measure, e.g. L1, which may be efficiently implemented using table lookup operations, in a fashion similar to the commonly employed for the Hamming distance.
In different embodiments of the invention, one or more of the aforementioned multiple descriptors may be calculated from only a portion, e.g. the central section, of the corresponding frames. This can reduce computational costs and may improve accuracy.
In different embodiments of the invention, the frame descriptors may be calculated from spatially and/or temporally subsampled video, e.g. from low-resolution video frame representations, and employing frame skipping. In one embodiment, Sa and/or Sb are MPEG coded and frame matching is performed based on the DC or subsampled DC representations of I-frames. This means that no video decoding is required, which results in a great increase in computational efficiency.
A data processing apparatus 1 for performing the processing operations described above is shown in
The apparatus 1 comprises conventional elements of a data processing apparatus, which are well-known to the skilled person, such that a detailed description is not necessary. In brief, the apparatus 1 of
Although the processing apparatus 1 described above performs processing in accordance with computer program instructions, an alternative processing apparatus can be implemented in any suitable or desirable way, as hardware, software or any suitable combination of hardware and software. It is furthermore noted that the present invention can also be embodied as a computer program that executes one of the above-described methods of processing image data when loaded into and run on a programmable processing apparatus, and as a computer program product, e.g. a data carrier storing such a computer program.
The foregoing description of embodiments of the invention has been presented for the purpose of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Alterations, modifications and variations can be made without departing from the spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
09 012 63.4 | Jan 2009 | GB | national |