Reference is made to commonly assigned, co-pending U.S. patent application Ser. No. ______ (Docket K000877), entitled: “Video representation using a sparsity-based model”, by Kumar et al., which is incorporated herein by reference.
This invention relates generally to the field of video understanding, and more particularly to a method to determining scene boundaries in a video using a sparse representation.
With the development of digital imaging and storage technologies, video clips can be conveniently captured by consumers using various devices such as camcorders, digital cameras or cell phones and stored for later viewing and processing. Efficient content-aware video representation models are critical for many video analysis and processing applications including denoising, restoration, and semantic analysis.
Developing models to capture spatiotemporal information present in video data is an active research area and several approaches to represent video data content effectively have been proposed. For example, Cheung et al. in the article “Video epitomes” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 42-49, 2005), teach using patch-based probability models to represent video content. However, their model does not capture spatial correlation.
In the article “Recursive estimation of generative models of video” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 79-86, 2006), Petrovic et al. teach a generative model and learning procedure for unsupervised video clustering into scenes. However, they assume videos to have only one scene. Furthermore, their framework does not model local motion.
Peng et al., in the article “RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 763-770, 2010), teach a sparsity-based method for simultaneously aligning a batch of linearly correlated images. Clearly, this model is not suitable for video processing as video frames, in general, are not linearly correlated.
Another method taught by Baron et al., in the article “Distributed compressive sensing” (preprint, 2005), models both intra- and inter-signal correlation structures for distributed coding algorithms.
In the article “Compressive acquisition of dynamic scenes” (Proc. 11th European Conference on Computer Vision, pp. 129-142, 2010), Sankaranarayanan et al. teach a compressed sensing-based model for capturing video data at much lower rate than the Nyquist frequency. However, this model works only for single scene video.
In the article “A compressive sensing approach for expression-invariant face recognition” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1518-1525, 2009), Nagesh et al. teaches a face recognition algorithm based on the theory of compressed sensing. Given a set of registered training face images from one person, their algorithm estimates a common image and a series of innovation images. The innovation images are further exploited for face recognition. However, this algorithm is not suitable for video modeling as it was designed explicitly for face recognition and does not preserve pixel-level information.
There remains a need for a video representation framework that is data adaptive, robust to noise and different content, and can be applied to wide varieties of videos including reconstruction, denoising, and semantic understanding.
The present invention represents a method for determining a scene boundary location between a first scene and a second scene in an input video sequence including a time sequence of input video frames, the input video frames in the first scene including some common scene content that is common to all of the input video frames in the first scene and some dynamic scene content that changes between at least some of the input video frames in the first scene and the input video frames in the second scene including some common scene content that is common to all of the input video frames in the second scene and some dynamic scene content that changes between at least some of the input video frames in the second scene, comprising:
defining a set of basis functions for representing the dynamic scene content;
determining a scene boundary location dividing the input video sequence into the first and second scenes responsive to a merit function value, wherein the merit function value is a function of the candidate scene boundary location and is determined by:
storing an indication of the determined scene boundary location in a processor-accessible memory;
wherein the method is performed at least in part using a data processing system.
The present invention has the advantage the use of the sparse combination technique makes the process of determining the scene boundary locations robust to image noise.
The disclosed method has the additional advantage that it does not require the computation of motion vectors or frame similarity metrics, which are generally computationally complex and less reliable.
The invention is inclusive of combinations of the embodiments described herein. References to “a particular embodiment” and the like refer to features that are present in at least one embodiment of the invention. Separate references to “an embodiment” or “particular embodiments” or the like do not necessarily refer to the same embodiment or embodiments; however, such embodiments are not mutually exclusive, unless so indicated or as are readily apparent to one of skill in the art. The use of singular or plural in referring to the “method” or “methods” and the like is not limiting.
The phrase, “digital content record”, as used herein, refers to any digital content record, such as a digital still image, a digital audio file, or a digital video file.
It should be noted that, unless otherwise explicitly noted or required by context, the word “or” is used in this disclosure in a non-exclusive sense.
The data processing system 110 includes one or more data processing devices that implement the processes of the various embodiments of the present invention, including the example processes of
The data storage system 140 includes one or more processor-accessible memories configured to store information, including the information needed to execute the processes of the various embodiments of the present invention, including the example processes of
The phrase “processor-accessible memory” is intended to include any processor-accessible data storage device, whether volatile or nonvolatile, electronic, magnetic, optical, or otherwise, including but not limited to, registers, floppy disks, hard disks, Compact Discs, DVDs, flash memories, ROMs, and RAMs.
The phrase “communicatively connected” is intended to include any type of connection, whether wired or wireless, between devices, data processors, or programs in which data may be communicated.
The phrase “communicatively connected” is intended to include a connection between devices or programs within a single data processor, a connection between devices or programs located in different data processors, and a connection between devices not located in data processors at all. In this regard, although the data storage system 140 is shown separately from the data processing system 110, one skilled in the art will appreciate that the data storage system 140 may be stored completely or partially within the data processing system 110. Further in this regard, although the peripheral system 120 and the user interface system 130 are shown separately from the data processing system 110, one skilled in the art will appreciate that one or both of such systems may be stored completely or partially within the data processing system 110.
The peripheral system 120 may include one or more devices configured to provide digital content records to the data processing system 110. For example, the peripheral system 120 may include digital still cameras, digital video cameras, cellular phones, or other data processors. The data processing system 110, upon receipt of digital content records from a device in the peripheral system 120, may store such digital content records in the data storage system 140.
The user interface system 130 may include a mouse, a keyboard, another computer, or any device or combination of devices from which data is input to the data processing system 110. In this regard, although the peripheral system 120 is shown separately from the user interface system 130, the peripheral system 120 may be included as part of the user interface system 130.
The user interface system 130 also may include a display device, a processor-accessible memory, or any device or combination of devices to which data is output by the data processing system 110. In this regard, if the user interface system 130 includes a processor-accessible memory, such memory may be part of the data storage system 140 even though the user interface system 130 and the data storage system 140 are shown separately in
An initialize intermediate digital video step 204 is used to initialize an intermediate digital video 205. The intermediate digital video 205 is a modified video estimated from the input digital video 203.
A get video segments step 206 detects the scene boundaries (i.e., the scene change locations) in the intermediate digital video 205. The intermediate digital video 205 is divided at the scene change locations to provide a set of video segments, which are collected in a video segments set 207.
A select video segment step 208 selects a particular video segment from the video segments set 207 to provide a selected video segment 209.
A get affine transform coefficients step 210 determines an affine transform having a set of affine transform coefficients for each input video frame of the selected video segment 209. The sets of affine transform coefficients for each video frame are collected in an affine transform coefficients set 211. The affine transform coefficients of the video frames corresponding to the selected video segment 209 are used to align the common scene content present in the selected video segment 209.
Finally, a get common and dynamic video frames step 212 uses the selected video segment 209 and the affine transform coefficients set 211 to determine a common frame and a set of dynamic frames. The common video frame represents the common scene content that is common to all of the video frames of the selected video segment 209. The set of dynamic video frames represent the scene content that changes between at least some of the video frames of the selected video segment 209. The common video frame and dynamic video frames are collected in a common and dynamic video frames set 213.
The individual steps outlined in
The get video segments step 206 analyzes the intermediate digital video 205 to provide the video segments set 207. The video segments set 207 represents the scene boundary locations in the intermediate digital video 205. Mathematical algorithms for determining scene boundary locations are well-known in the art. Any such method can be used in accordance with the present invention. In a preferred embodiment, the get video segments step 206 uses the method for determining scene boundary locations that will be described below with respect to
The select video segment step 208 selects a video segment from the video segments set 207 to provide the selected video segment 209. The selected video segment 209 can be selected in any appropriate way known to those skilled in the art. In a preferred embodiment, a user interface is provided enabling a user to manually select the video segment to be designated as the selected video segment 209. In other embodiments, the video segments set 207 can be automatically analyzed to designate the selected video segment 209 according to a predefined criterion. For example, the video segment depicting the maximum amount of local motion can be designated as the selected video segment 209.
The get affine transform coefficients step 210 determines an affine transform defined by a set of affine transform coefficients for each video frame of the selected video segment 209. Let T(Θi) be the affine transform having the set of affine transform coefficients Θi corresponding to the ith video frame of the selected video segment 209, where 1≦i≦n. The affine transform coefficients Θi include parameters for displacement along x- and y-axis, rotation and scaling for the ith video frame of the selected video segment 209. In a preferred embodiment of the present invention, Θi contains only the displacements along the x- and y-axis (i.e., Θi={xi, yi}, where xi, and yi are global displacements along x- and y-axis, respectively) for the ith video frame of the selected video segment 209. The affine transform T(Θi) is a spatial transform that can be applied to a given input image z(p,q) to provide a transformed image z(p′,q′). Functionally this can be expressed as T(Θi)z(p,q)=z(p′,q′), where
The affine transform coefficients Θi (1≦i≦n) are collected in the affine transform coefficients set 211. The estimation of Θi is explained next.
x
i
=x
i−1
+Δx
i−1 (2)
and
y
i
=y
i−1
+Δy
i−1 (3)
where 1≦i≦n. Furthermore, it is assumed that Δx0=Δy0=0.
In a determine measurement vector step 304, a set of measurement vectors is determined responsive to the selected video segment 209. The determined measurement vectors are collected in a measurement vector set 305. In the preferred embodiment, the determine measurement vector step 304 computes the global displacements in x- and y-directions between successive video frames of the selected video segment 209. Mathematical algorithms for determining global displacements between pair of images are well-known in the art. An in-depth analysis of image alignment, its mathematical structure and relevancy can be found in the article by Brown entitled “A survey of image registration techniques” (ACM Computing Surveys, Vol. 24, issue 4, pp. 325-376, 1992), which is incorporated herein by reference.
An estimate affine transform coefficients step 306 uses the measurement vector set 305 and transform coefficients model set 303 to determine the affine transform coefficients set 211. The affine transform coefficients set 211 can be determined in any appropriate way known to those skilled in the art. In a preferred embodiment, the affine transform coefficients set 211 is determined using a sparse representation framework where the measurement vector set 305 and the auto regressive model of the transform coefficients model set 303 are related using a sparse linear relationship. The affine transform coefficients set 211 is then determined responsive to the sparse linear relationship as explained next.
Let f1, f2, . . . , fn be the video frames of the selected video segment 209. Furthermore, let X=[X1, X2, . . . , Xn−1]T, and Y=[Y1, Y2, . . . , Yn−1]T be the elements of the measurement vector set 305 corresponding to the selected video segment 209 representing global displacements along x- and y-axis, respectively. The ith (1≦i≦n−1) element of X represents the global displacement between video frames fi and fi+1 in x-direction. Similarly, ith element of Y represents the global displacement between video frames fi and fi+1 in y-direction. In equation form, the sparse linear relationship between X and the auto regressive model stored in the video segments set 207 (Eqs. (2) and (3)) can be expressed using Eq. (4):
where [X1, X2, . . . , Xn−1]T are known and [x1, Δx1, . . . Δxn−1]T are unknowns. Clearly, there are more unknowns than the number of equations. Furthermore, video frames corresponding to the same scene are expected to display smooth transitions. Therefore, vector [x1, Δx1, . . . Δxn−1]T is expected to be sparse (i.e., very few elements of this vector should be non-zero). Therefore, in the preferred embodiment of the present invention, [x1, Δx1, . . . Δxn−1]T is estimated by applying sparse solver on Eq. (4). Mathematical algorithms for determining sparse combinations are well-known in the art. An in-depth analysis of sparse combinations, their mathematical structure and relevancy can be found in the article entitled “From sparse solutions of systems of equations to sparse modeling of signals and images,” (SIAM Review, pp. 34-81, 2009) by Bruckstein et al., which is incorporated herein by reference.
Similarly, [y1, Δy1, . . . Δyn−1]T is estimated by solving the linear equation given by Eq. (5) using a sparse solver:
Note that, from Eqs. (2), and (3), it is clear that knowledge of [x1, Δx1, . . . Δxn−1]T, and [y1, Δy1, . . . Δyn−1]T is sufficient to determine xi, and yi, respectively, ∀i, 1≦i≦n. The affine transform coefficients set 211 is determined by collecting vectors [x1, Δx1, . . . Δxn−1]T, and [y1, Δy1, . . . Δyn−1]T.
A determine common video frame step 404 determines a common video frame 405 in response to the first set of basis functions 403 as given by Eq. (6) below:
C=ψβ (6)
where C is a vector representation of the common video frame 405 and ψ is a matrix representation of the first set of basis functions 403. β is a sparse vector of weighting coefficients where only a minority of the elements of β are non-zero. The matrix ψ can be determined in any appropriate way known to those skilled in the art. In a preferred embodiment, ψ is a discrete cosine transform (DCT) matrix.
In a define second set of basis functions step 406, a set of basis functions that can be used to estimate a set of dynamic scenes for the selected video segment 209 is defined. The set of basis functions produced by the define second set of basis functions step 406 is collected as second set of basis functions 407. In a preferred embodiment, the second set of basis functions 407 is the same set of DCT basis functions that were used for the first set of basis functions 403. However, in other embodiments a different set of basis functions can be used.
A determine dynamic video frames step 408 determines a dynamic video frames set 409 responsive to the second set of basis functions 407. The dynamic video frames set 409 can be determined in any appropriate way known to those skilled in the art. In a preferred embodiment, a set of sparse linear combinations of the basis functions of the second set of basis functions 407 is determined to represent the dynamic video frames set 409 as given by Eq. (7) below:
D
i=φαi; 1≦i≦n (7)
where Di is the vector representation of the dynamic scene corresponding to fi and φ is the matrix representation of the second set of basis functions 407, and αi(1≦i≦n) are sparse vectors of weighting coefficients. In a preferred embodiment, φ is assumed to be same as ψ (i.e., φ=ψ).
A determine common and dynamic video frames step 410 produces the common and dynamic video frames set 213 responsive to the affine transform coefficients set 211, the selected video segment 209, the common video frame 405, and the dynamic video frames set 409. The common and dynamic video frames set 213 can be determined in any appropriate way known to those skilled in the art. In a preferred embodiment, the determine common and dynamic video frames step 410 solves Eq. (8) to determine the common and dynamic video frames set 213.
From Eq. (8), it is clear that fi=T(Θi)C+Di, where Θi={xi, yi}, C=ψβ, and Di=φαi=ψαi. Due to the sparse nature of β and αi, vector [β, α1, . . . , αn]T is estimated using a sparse solver. Mathematical algorithms to solve the linear equation of the form shown in Eq. (9) for determining sparse vector are well-known in the art. An in-depth analysis of sparse solvers, their mathematical structures and relevancies can be found in the aforementioned article by Bruckstein et al. entitled “From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images.” The common and dynamic video frames set 213 is determined by collecting the common video frame C and the dynamic video frames Di (1≦i≦n), where C=ψβ, and Di=ψαi.
The common and dynamic video frames set 213, in conjunction with the affine transform coefficients set 211, contain sufficient information to reconstruct the selected video segment 209.
{circumflex over (f)}
i
=T(Θi)C+Di (9)
where {circumflex over (f)}i is the reconstructed estimate of the ith video frame, fi, of the selected video segment 209. The reconstructed video frames {circumflex over (f)}i (1≦i≦n) are collected in the reconstructed video segment set 603. Due to the noise robustness property of sparse solvers, the reconstructed video segment set 603 is robust to noise. In other words, denoising is automatically achieved during the video reconstruction process.
In addition to reconstruction and denoising, the proposed algorithm can be used for many useful video editing and tracking applications without performing motion estimation and compensation. A preferred embodiment of a method for modifying the common scene content of the selected video segment 209 is shown in
The reconstructed video segment set 807 can be determined in any appropriate way known to those skilled in the art. In a preferred embodiment, the reconstruct video segment step 806 uses Eq. (10) to produce the reconstructed video segment set 807:
f
i
R
=νC
N
+ρD
i (10)
where fiR is the reconstructed version of the ith video frame, fi, of the selected video segment 209, CN is the value of the new common video frame 805, and ν and ρ are constants. In a preferred embodiment, ν and ρ are pre-determined constants that control the visual quality of fuR. The reconstructed video frames fiR(1≦i≦n) are collected in the reconstructed video segment set 807.
Similar to the application described in
where T is a threshold. The threshold T can be determined in any appropriate way known to those skilled in the art. In some embodiments, the threshold T is a predetermined constant. However, it has been found that in many cases it is preferable for the threshold T to be video dependent. A user interface can be provided enabling the user to specify a heuristically determined threshold T that works best for a particular selected video segment 209. The co-ordinates corresponding to |Di(r,s)|=1 are collected in the moving objects set 905.
The method described earlier with respect to
Returning to a discussion of
An evaluate merit function step 1006 evaluates a merit function for a set of candidate scene boundary locations 1007. The evaluate merit function step 1006 analyzes the digital video section 1003 responsive to the basis functions set 1005 for each of the candidate scene boundary locations 1007 to determine corresponding merit function values 1009. The merit function values 1009 provide an indication of the likelihood that a particular candidate scene boundary location 1007 corresponds to a scene boundary. A preferred form for the merit function will be described relative to
A scene boundary present test 1010 evaluates the determined merit function values 1009 to determine whether a scene boundary is present in the digital video section 1003. Let S={π1, π2, . . . , πω} be the candidate scene boundary location 1007, wherein each πiε[1, . . . , N], 1≦i≦ω. The corresponding set of merit function values 1009 can be represented as Π={MFπ
If the scene boundary present test 1010 determines that no scene boundary is present (i.e., Πmax/Πmin<TS), then a no scene boundary found step 1012 is used to indicate that the digital video section 1003 does not include a scene boundary.
If the scene boundary present test 1010 determines that a scene boundary is present, then a determine scene boundary location step 1014 determines a scene boundary location 1015 which divides the digital video section 1003 into first and second scenes responsive to the merit function values 1009.
In a preferred embodiment, the scene boundary location 1015 is defined to be the candidate scene boundary location 1007 corresponding to the minimum merit function value in the set of merit function values 1009. The determine scene boundary location step 1014 selects πmin, which is the element of S that corresponds to the minimum merit function value Πmin=Min(Π)=MFπ
The discussion above describes the case where the candidate scene boundary locations 1007 includes each of the video frames in the digital video section 1003. This corresponds to performing an exhaustive search of all of the possible candidate scene boundary locations. One skilled in the art will recognize that other embodiments can use other search techniques to identify the candidate scene boundary location 1007 producing the minimum merit function value Πmin For example, an iterative search technique, such as the well-known golden section search technique, can be used to converge on the desired solution for the scene boundary location 1015. Such iterative search techniques have the advantage that they require fewer computations as compared to the exhaustive search technique.
The method discussed relative to
The evaluate merit function step 1006 evaluates a predefined merit function for each of the candidate scene boundary locations 1007 to determine corresponding merit function values 1009.
The set of candidate scene boundary locations 1007 that are evaluated can be determined using any method known to those skilled in the art. In a preferred embodiment of the present invention, each of the video frames in the digital video section 1003 are evaluated as a candidate scene boundary location 1007. Let ζ1, ζ2, . . . , ζN be the video frames stored in the digital video section 1003. Let π be the value of the candidate scene boundary location 1007, then it πε{1, 2, . . . , N} where N is the total number of video frames in the digital video section 1003.
A determine left and right video frames sets step 1104 partitions the digital video section 1003 into a left video frames set 1105 and a right video frames set 1107 by dividing the digital video section 1003 at the candidate scene boundary location 1007 (π). Accordingly, the left video frames set 1105 contains the video frames of the digital video section 1003 preceding the candidate scene boundary location 1007 (i.e., ζ1, ζ2, . . . , ππ−1). Similarly, the right video frames set 1107 contains the video frames following the candidate scene boundary location 1007 (i.e., ζπ, ζπ+1, . . . , ζN).
A get left dynamic content step 1108 uses the basis functions set 1005 to determine left dynamic content 1109 providing an indication of the dynamic scene content in the left video frames set 1105. In a preferred embodiment, the dynamic scene content for each of the video frames ζ1, ζ2, . . . , ζπ−1 in the left video frames set 1105 is represented using a sparse combination of the basis functions in the basis functions set 1005, wherein the sparse combination of the basis functions is determined by finding a sparse vector of weighting coefficients for each of the basis function in the basis functions set 1005. The sparse vector of weighting coefficients for each video frame in the left video frames set 1105 can be determined using any method known to those skilled in the art. In a preferred embodiment, the same method that was discussed relative to
Similarly, a get right dynamic content step 1110 uses the basis functions set 1005 to determine right dynamic content 1111 providing an indication of the dynamic scene content in the right video frames set 1107. In a preferred embodiment, the dynamic scene content for each of the video frames ζπ, ζπ−1, . . . , ζN in the right video frames set 1107 is represented using a sparse combination of the basis functions in the basis functions set 1005, wherein the sparse combination of the basis functions is determined by finding a sparse vector of weighting coefficients for each of the basis function in the basis functions set 1005. The sparse vector of weighting coefficients for each video frame in the right video frames set 1107 can be determined using any method known to those skilled in the art. In a preferred embodiment, the method that was discussed relative to
A compute merit function value step 1112 determines the merit function value 1009 by combining the left dynamic content 1109 and the right dynamic content 1111. The compute merit function value step 1112 can use any method known to those skilled in the art to determine the merit function value 1009. In a preferred embodiment, the weighting coefficients in the left dynamic content 1109 and the right dynamic content 1111 are concatenated to form a combined vector of weighting coefficients. The compute merit function value step 1112 the computes an l−1 norm of the combined vector of weighting coefficients to determine the merit function value 1009 as given by Eq. (12):
MF
π=∥[α1L, . . . , απ−1L, απR, . . . , αNR]T∥1 (12)
where MFπ is the merit function value 1009 for the candidate scene boundary location 1007 (π), and ∥∥1 denotes l−1 norm.
In a preferred embodiment, the get video segments step 206 of
It is to be understood that the exemplary embodiments disclosed herein are merely illustrative of the present invention and that many variations of the above-described embodiments can be devised by one skilled in the art without departing from the scope of the invention. It is therefore intended that all such variations be included within the scope of the following claims and their equivalents.