The present invention relates to digital video processing and editing, and, in particular embodiments, to apparatus and methods for video foreground-background segmentation with multi-view spatial temporal graph cuts.
Digital image/video processing and editing includes the separation of image foreground from background in digital images and videos, from multiple viewpoints. The separation process is referred to as foreground-background segmentation. Image foreground-background segmentation is used in many video applications, such as video editing and composition, TV broadcasting, video surveillance, augmented reality, and other applications. For example, in the TV and movie industry, foreground-background segmentation is typically achieved using a green screening approach, in which the background is covered with green cloth, and actors (in the foreground) wear different colors. Thus, a simple threshold in the green color channel, during video editing, can be used to separate the foreground and the background. This approach has high accuracy. However, it can only be applied in a controlled color scheme environment while the image/video is being generated. Other approaches to background-foreground segmentation include using a background model or marking the background/foreground. However, such approaches have varying accuracy depending on the image colors, or can be costly due to the required marking. There is a need for an effective image/video foreground-background segmentation that provides accurate results and can be applied in various environments, e.g., independent of image/video generation.
In accordance with an embodiment, a method for image foreground and background segmentation includes obtaining a plurality of video frames corresponding to a plurality of views for a video stream over time, and generating a graph-cut model for the video frames belonging to each one of the views using both color and image difference. The method further includes adding temporal links to the graph-cut model for each one of the views, and then generating a four-dimensional graph-cut model for the video frames by adding spatial links to the graph-cut model across the plurality of views. Foreground-background segmentation is then performed in the plurality of video frames using the four-dimensional graph-cut model.
In accordance with another embodiment, a method for image foreground and background segmentation includes generating, using first color and image feature models, a first graph-cut model for a plurality of first video frames belonging to a first view of a video stream, and generating, using second color and image feature models, a second graph-cut model for a plurality of second video frames belonging to a second view of a video stream. First temporal links are then added to the first graph-cut model, and second temporal links are added to the second graph-cut model. The method further includes adding spatial links across the first graph-cut model and the second graph-cut model to generate a four-dimensional graph-cut model for the first video frames with the second video frames. Foreground-background segmentation is then performed in the first video frames and the second video frames using the four-dimensional graph-cut model.
In accordance with yet another embodiment, an apparatus for image foreground and background segmentation includes at least one processor coupled to a memory, and a non-transitory computer readable storage medium storing programming for execution by the at least one processor. The programming includes instructions to obtain a plurality of video frames corresponding to a plurality of views for a video stream over time, and generate a graph-cut model for the video frames belonging to each one of the views using both color and image difference. The programming also includes instructions to add temporal links to the graph-cut model for each one of the views, and generate a four-dimensional graph-cut model for the video frames by adding spatial links to the graph-cut model across the plurality of views. Instructions to perform foreground-background segmentation in the plurality of video frames using the four-dimensional graph-cut model are also included in the programming.
The foregoing has outlined rather broadly the features of an embodiment of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of embodiments of the invention will be described hereinafter, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiments disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of the embodiments and are not necessarily drawn to scale.
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
Background subtraction is one approach in image/video processing for performing foreground-background segmentation. In background subtraction, each input image is subtracted from a known background image to get a difference image. Typically, the known background image is a still or fixed image over time. For example, images of moving vehicles are subtracted from known road images, or images of people walking along a hallway are subtracted from a known hallway image. The difference image is used for classifying whether a pixel in the image for segmentation belongs to the foreground or background. This approach can be useful in video surveillance, for example. Several extensions on building background models for this approach have been proposed. However, this approach is still sensitive when foreground color is similar or close to background color.
In another approach for image foreground-background segmentation, markers such as line strokes are added to annotate the foreground and background of the images. This annotation is used to build foreground and background color models, for instance Gaussian Mixture Models (GMMs). A graph-cut based algorithm is then used to segment the remaining pixels into foreground and background, by minimizing both the cost to the GMMs and the cost of a color term, e.g., the color smoothness. In addition to requiring annotation as input, the performance of this approach is limited when applied to video foreground-background segmentation.
Other approaches for image foreground-background segmentation may involve considering the similarity and constraints among multiple images and segmenting them simultaneously. Foreground-background segmentation for multiple images simultaneously is also referred to as co-segmentation. The multiple images can be frames from a video, images that contain the same foreground object but different backgrounds, or images from multiple viewpoints. Such methods usually require explicit three-dimensional (3D) image point reconstruction and involve iterative solutions.
System and method embodiments are provided herein for achieving multi-view video foreground-background segmentation with spatial-temporal graph cuts. A co-segmentation algorithm is used where a four-dimensional (4D) graph is constructed by adding links across neighboring views over space and for consecutive frames over time. The links enforce the segmentation consistency across multiple viewpoints and over time. The algorithm does not involve reconstructing 3D point graphs and adding them to the graph cuts. Instead, spatial links are added using pair-wise matched feature points between multi-views (e.g., from different cameras) of the same object. This approach avoids 3D reconstruction problems such as occlusion, and adds more spatial constraints to achieve segmentation. The co-segmentation uses both the color values of each input image and the image difference between the input image and the background image. By using the background subtraction results as the initial segmentation seed, no user annotation is needed to perform co-segmentation. The algorithm significantly improves the performance and robustness for foreground-background segmentation.
Based on the observation above, an algorithm is implemented to jointly use both color values and the image difference for foreground segmentation. Accordingly, a graph-cut problem can be formulated as an energy function in the form:
E(x,ωC,ωD,z,d)=U(x,ωC,ωD,z,d)+V(x,z,d),
where x is a label of 1 or 0, ωC is a pixel color, ωD is the pixel difference, z is the color value, and d is the image difference. The term Uis a data term representing the image difference and can be defined as:
U(x,ωC,ωD,z,d)=−Σp(αC log hBC(zp)+αD log hBD(dp))[xp=0]−Σp(αC log hFC(zp)+αD log hFD(dp))[xp=1].
The term V is a color smoothness term and can be defined as:
V(x,z,d)=Σ(p,qεN)dis(m,n)−1(γcexp{βC∥zp−zq∥2}+γDexp{−βD∥dp−dq∥2})[xp≠xq].
In the energy function E(x,ωC,ωD,z,d), both the data term U and the smoothness term V are weighted linear combinations of the color part and the image difference part. Color GMMs and image difference GMMs can be trained to match pixel labels according to the energy function in a pre-processing step. The weight terms αC, αD, γC, γD are used to control the relative importance of the color and the image difference in the formulation. The weights are nonnegative values, where αC+αD=1, and γC+γD=1.
To evaluate the performance of the algorithm 500, a database with ground truth segmentation is constructed. The database has three scenes: a Yoga scene representing a slow motion video; a KongFu scene representing a fast motion video; and a Two-Person-Game scene representing occlusion cases. In the KongFu and Two-Person-Game scenes, the subjects' dress color is similar to parts of the background, which makes the scenes more realistic and challenging. Each scene has four cameras at oriented 0, 90, 180, and 270 degrees. For each camera and each scene, the foreground and background are labeled as the ground truth. The images are captured with PointGrey Cricket cameras, at 1920×1080 resolution with 30 frames per second (fps).
where Area(.) returns the number of non-zero pixels in the region. The Ratio, as defined above, is a value from 0 to 1. The higher the value of the Ratio, the better the segmentation matches the ground truth.
The CPU 1310 may comprise any type of electronic data processor. The memory 1320 may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, the memory 1320 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs. In embodiments, the memory 1320 is non-transitory. The mass storage device 1330 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus. The mass storage device 1330 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
The video adapter 1340 and the I/O interface 1360 provide interfaces to couple external input and output devices to the processing unit. As illustrated, examples of input and output devices include a display 1390 coupled to the video adapter 1340 and any combination of mouse/keyboard/printer 1370 coupled to the I/O interface 1360. Other devices may be coupled to the processing unit 1301, and additional or fewer interface cards may be utilized. For example, a serial interface card (not shown) may be used to provide a serial interface for a printer.
The processing unit 1301 also includes one or more network interfaces 1350, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or one or more networks 1380. The network interface 1350 allows the processing unit 1301 to communicate with remote units via the networks 1380. For example, the network interface 1350 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the processing unit 1301 is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like.
While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.
In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.