The present disclosure generally relates to video processing, and more particularly, to a system and method for tracking objects utilizing a contour weighting map.
Over the years, digital content has gained increasing popularity with consumers. With the ever-growing amount of digital content available to consumers through the Internet using computers, smart phones, and other sources, consumers have access to a vast amount of content. Furthermore, many devices (e.g., smartphones) and services are readily available that allow consumers to capture and generate video content.
Upon capturing or downloading video content, the process of tracking objects is commonly performed for editing purposes. For example, a user may wish to augment a video with special effects where one or more graphics are superimposed onto an object. In this regard, precise tracking of the object is important. However, challenges may arise when tracking objects, particularly as the object moves from frame to frame. This may cause, for example, the object to vary in shape and size. Additional challenges may arise when the object includes regions or elements that easily blend in with the background. This may be due to the thickness of the elements, the color make-up of the elements, and/or other attributes of the elements.
Briefly described, one embodiment, among others, is a method for tracking an object in a plurality of frames, comprising obtaining a reference contour of an object in a reference frame and estimating, for a current frame after the reference frame, a contour of the object. The method further comprises comparing the reference contour with the estimated contour and determining at least one local region of the reference contour in the reference frame based on a difference between the reference contour and the estimated contour. Based on the difference, at least one corresponding region of the current frame is determined. The method further comprises computing a degree of similarity between the at least one corresponding region in the current frame and the at least one local region in the reference frame, adjusting the estimated contour in the current frame according to the degree of similarity, and designating the current frame as a new reference frame and a frame after the new reference as a new current frame.
Another embodiment is a system for tracking an object in a plurality of frames, comprising a processing device. The system further comprises an object selector executable in the processing device for obtaining a reference contour of an object in a reference frame and a contour estimator executable in the processing device for estimating, for a current frame after the reference frame, a contour of the object. The system further comprises a local region analyzer executable in the processing device for: comparing the reference contour with the estimated contour, determining at least one local region of the reference contour in the reference frame based on a difference between the reference contour and the estimated contour, determining at least one corresponding region of the current frame based on the difference, and computing a degree of similarity between the at least one corresponding region in the current frame and the at least one local region in the reference frame. The contour estimator adjusts the estimated contour in the current frame according to the degree of similarity and designates the current frame as a new reference frame and a frame after the new reference as a new current frame.
Another embodiment is a non-transitory computer-readable medium embodying a program executable in a computing device, comprising code that generates a user interface and obtains a reference contour of an object in a reference frame, code that estimates, for a current frame after the reference frame, a contour of the object, code that compares the reference contour with the estimated contour and code that determines at least one local region of the reference contour in the reference frame based on a difference between the reference contour and the estimated contour. The program further comprises code that determines at least one corresponding region of the current frame based on the difference, code that computes a degree of similarity between the at least one corresponding region in the current frame and the at least one local region in the reference frame, code that adjusts the estimated contour in the current frame according to the degree of similarity, and code that designates the current frame as a new reference frame and a frame after the new reference as a new current frame.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the Office upon request and payment of the necessary fee.
Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
The process of tracking one or more objects within a video stream may be challenging, particularly when the object moves from frame to frame as the object may vary in shape and size when moving from one position/location to another. Additional challenges may arise when the object includes regions or elements that tend to blend in with the background. In order to produce high quality video editing results, an object tracking system should accurately estimate the contour of the object as the object moves. However, the object tracking process may occasionally yield erroneous results. For example, in some cases, one or more portions of the object being tracked will not be completely surrounded by the estimated contour that corresponds to an estimation of where and how the object is positioned. As temporal dependency exists in the object tracking process, an erroneous tracking result will, in many cases, lead to a series of erroneous results, thereby affecting video editing process that follows.
In some cases, the user can reduce the number of erroneous results by manually refining the estimated contour on a frame-by-frame basis as needed and then allowing the tracking system to resume object tracking based on the refinements made by the user. However, if a portion of the object is difficult to track due to its color, shape, contour, or other attributes, the object tracking algorithm may continually yield erroneous results for the portions of the object that are difficult to track. This results in the user having to constantly refine the tracking results in order to produce an accurate, estimated contour of the object. This, of course, can be a time consuming process.
Various embodiments are disclosed for improving the tracking of objects within an input stream of frames, particularly for objects that include elements or regions that may be difficult to track by conventional systems due to color, shape, contour, and other attributes. For some embodiments, the position and contour of the object is estimated on a frame-by-frame basis. The user selects a frame in the video and manually specifies the contour of an object in the frame. As described in more detail below, for the video frames that follow, the object tracking system iteratively performs a series of operations that include refining estimated contours based on the contour in a previous frame.
First, an object contour in the current video frame is received from the user and designated as a reference contour. An object tracking algorithm is then utilized to estimate the object contour in the next video frame, and a tracking result is generated whereby an estimated contour is derived. The object tracking system compares the generated tracking result with the recorded reference contour, and a “local region” corresponding to a region containing the difference in contour between the two is derived. Based on the content of the local region in the current video frame and the content of the local region in the next video frame, the object tracking system computes the similarity of the corresponding local regions between the two video frames, and refines the tracking result (i.e., the estimated contour) of the next frame according to the degree of similarity. The iterative tracking process continues until all the frames are processed or until the user stops the tracking process.
A description of a system for facilitating object tracking is now described followed by a discussion of the operation of the components within the system.
For embodiments where the video editing system 102 is embodied as a smartphone 109 or tablet, the user may interface with the video editing system 102 via a touchscreen interface (not shown). In other embodiments, the video editing system 102 may be embodied as a video gaming console 171, which includes a video game controller 172 for receiving user preferences. For such embodiments, the video gaming console 171 may be connected to a television (not shown) or other display 104.
The video editing system 102 is configured to retrieve, via the media interface 112, digital media content 115 stored on a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein the digital media content 115 may then be stored locally on a hard drive of the video editing system 102. As one of ordinary skill will appreciate, the digital media content 115 may be encoded in any of a number of formats including, but not limited to, Motion Picture Experts Group (MPEG)-1, MPEG-2, MPEG-4, H.264, Third Generation Partnership Project (3GPP), 3GPP-2, Standard-Definition Video (SD-Video), High-Definition Video (HD-Video), Digital Versatile Disc (DVD) multimedia, Video Compact Disc (VCD) multimedia, High-Definition Digital Versatile Disc (HD-DVD) multimedia, Digital Television Video/High-definition Digital Television (DTV/HDTV) multimedia, Audio Video Interleave (AVI), Digital Video (DV), QuickTime (QT) file, Windows Media Video (WMV), Advanced System Format (ASF), Real Media (RM), Flash Media (FLV), an MPEG Audio Layer III (MP3), an MPEG Audio Layer II (MP2), Waveform Audio Format (WAV), Windows Media Audio (WMA), or any number of other digital formats.
As depicted in
The digital camera 107 may also be coupled to the video editing system 102 over a wireless connection or other communication path. The video editing system 102 may be coupled to a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks. Through the network 118, the video editing system 102 may receive digital media content 115 from another computing system 103. Alternatively, the video editing system 102 may access one or more video sharing websites 134 hosted on a server 137 via the network 118 to retrieve digital media content 115.
The object selector 114 in the video editing system 102 is configured to obtain an object contour selection from the user of the video editing system 102, where the user is viewing and/or editing the media content 115 obtained by the media interface 112. For some embodiments, the objection selection is used as a reference contour where a local region is derived for purposes of refining subsequent contour estimations, as described in more detail below.
The contour estimator 116 is configured to estimate a contour on a frame-by-frame basis for the object being tracked. The local region analyzer 119 determines a local region based on a difference between the reference contour and the estimated contour. As referred to herein, a “local region” generally refers to one or more areas or regions within a given frame corresponding to a portion or element of an object that is lost or erroneously added during the tracking process. To further illustrate the concept of a local region, reference is made briefly to
As shown in
Turning now to
The processing device 202 may include any custom made or commercially available processor, a central processing unit (CPU) or an auxiliary processor among several processors associated with the video editing system 102, a semiconductor based microprocessor (in the form of a microchip), a macroprocessor, one or more application specific integrated circuits (ASICs), a plurality of suitably configured digital logic gates, and other well known electrical configurations comprising discrete elements both individually and in various combinations to coordinate the overall operation of the computing system.
The memory 214 can include any one of a combination of volatile memory elements (e.g., random-access memory (RAM, such as DRAM, and SRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.). The memory 214 typically comprises a native operating system 217, one or more native applications, emulation systems, or emulated applications for any of a variety of operating systems and/or emulated hardware platforms, emulated operating systems, etc.
The applications may include application specific software which may comprise some or all the components (media interface 112, object selector 114, contour estimator 116, local region analyzer 119) of the video editing system 102 depicted in
Input/output interfaces 204 provide any number of interfaces for the input and output of data. For example, where the video editing system 102 comprises a personal computer, these components may interface with one or more user input devices via the I/O interfaces 204, where the user input devices may comprise a keyboard 106 (
In the context of this disclosure, a non-transitory computer-readable medium stores programs for use by or in connection with an instruction execution system, apparatus, or device. More specific examples of a computer-readable medium may include by way of example and without limitation: a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory), and a portable compact disc read-only memory (CDROM) (optical).
With further reference to
Reference is made to
Although the flowchart of
Beginning with block 310, the object selector 114 (
The user utilizes the region selection tool to specify or define the contour of the object to be tracked in a video stream. After the tracking results are generated as described in more detail below, the tracking results may then be utilized for video editing. For example, the user may elect to adjust the color and/or brightness of the object or augment frames with the object with content from another video stream.
In block 315, the contour estimator 116 (
In block 330, the local region analyzer 119 (
In block 340, the local region analyzer 119 determines a local region based on a difference between the reference contour and the estimated contour. Referring back to the illustration of
With reference back to
The sum of absolute difference (SAD) metric used to compute the degree of similarity is described in connection with
Determination of the SAD metric comprises computing the absolute difference of pixel values for every pair of pixels and then accumulating the absolute differences as a measurement between the two regions. A smaller SAD value indicates a higher similarity between two regions, while a larger SAD value indicates that the two regions are different. In the examples shown in
Many times during the tracking process, however, the local regions cannot be located precisely due, for example, to a small shift or deformation between the video frames or an error in the contour estimation. An example of such a scenario is shown in
Thus in accordance with various embodiments, a robust measurement is utilized where the SAD metric accurately evaluates local regions that are slightly misaligned while still accurately identifying local regions with a low degree of similarity. To achieve this, an alternative SAD technique is implemented for various embodiments. With reference to
In the example shown, a local search reveals that pixel B′ has the same value as anchor pixel A′ and is therefore selected for purposes of computing the absolute difference. A local search is performed for a plurality of pixel pairs to match a pixel in one frame to another pixel in the other frame. A reasonable range of the local search should be small enough to identify the local regions with obviously different content while also taking into account the misalignment of local regions in one or two pixels. In this example, multiple searches are performed for the regions 1206 and 1208 to compute their SAD value. Each search yields a pixel pair from one region to the other region. Each local search may also select a pixel with a different position relative to the anchor pixel used for the search. For example, the selected pixel B is one pixel left to the anchor pixel A′, but the selected pixel in another search may involve a pixel in a different position where the pixel is not located one pixel left to the anchor pixel. This allows pixel matching between two regions where slight deformation occurs, which is typical during video tracking.
Based on the disclosed local search mechanism, the final SAD value is computed based on the following formula:
SAD(R1,R2)=Σp
where R1, R2 are the two regions, P1 is a set of pixels which can be all pixels or a subset of pixels in R1. For each pixel pi in P1, anchor(pi) is the anchor pixel in the video frame containing R2. The anchor pixel corresponds to the pixel pi and can be determined by the locations of two regions in the video frames. S(anchor(pi)) represents a set of pixels as the search region according to anchor(pi), and the search is performed for each pixel qj in the search region. The values of pixel pi, qj are represented as v(pi), v(qj), and D(v(pi), v(qj)) is a metric for computing the absolute difference of the values such that v(pi)={vi(pi), . . . , vn(pi)}, v(qj))={v1(qj), . . . , vn(qj)}.
In various embodiments, each pixel contains a fixed number of channels and there is a value for each channel. Each pixel contains at least one channel with a value, wherein D(v(pi), v(qj)) corresponds to the absolute difference of the values according to one of the following formulas:
D(v(pi),v(qj))=Σk=1n∥vk(pi)−vk(qj)∥,
D(v(pi),v(qj))=Σk=1n(vk(pi)−vk(qj))2, or
D(v(pi),v(qj)=√{square root over (Σk=1n(vk(pi)−vk(qj))2)}{square root over (Σk=1n(vk(pi)−vk(qj))2)},
where ∥x∥ is the absolute value of x. The metric corresponds to computing the absolute difference between the values of the two pixels for each channel and then accumulating the absolute differences among all channels. However, in some cases, another metric may be used represent the discrimination of pixel values, such as computing the square values of the differences and then accumulating the squared values. The pixel qj that contributes to the summation in SAD(R1, R2) is the pixel which results in the minimal absolute difference within the search region. By leveraging this revised SAD technique, the SAD value computed from local regions 1206, 1208 is a relatively small value and indicates a high degree of similarity between the local regions 1206, 1208.
Thus, when the local regions 602a, 602b are very similar across two frames, an estimated contour with the local region(s) omitted will likely be an erroneous estimate as the estimated contour differs substantially from the previously estimated contour. In cases where there is not a large degree of similarity of the local regions 602a, 602b across two frames, this typically means that the object has moved significantly or the shape of the object has changed substantially between frames. For such cases, no further refinement is made to the estimated contour.
In block 360, based on the degree of similarity, the contour estimator 116 adjusts or further refines the estimated contour. In cases where there is a large degree of similarity between the local regions 602a, 602b across two frames and where the respective estimated contours differ substantially (e.g., where one of the estimated contours is missing the local region), the contour estimator 116 may be configured to incorporate the missing local region(s) into the erroneous estimated contour as part of the refinement process.
To further illustrate the operations discussed above for blocks 350 and 360, reference is made to
In the example of
At decision block 370, a determination is made on whether the last frame in the video stream has been processed or whether the user wishes to stop the tracking process. If neither condition is true, the tracking process resumes back at block 315, where the contour estimation and local region comparison operations outlined in the blocks that follow are repeated. Returning back to decision block 370, if at least one of the conditions is true, then the object tracking process stops, and the user may then perform other operations via the video editing system 102, such as editing the tracked object based on the tracking results derived in the remaining blocks above.
To further illustrate the various concepts disclosed, reference is made to
The contour drawn around the object is represented by the outline surrounding the object. For the video frames that follow (as shown in
Typically, the object being tracked moves or the shape of the object changes over time. However, the amount of movement tends to be fairly small within a short amount of time. Successive frames in a video are typically spaced apart by approximately 1/30th of a second. Thus, even if the object is moving or if the shape of the object changes, the rate of change is relatively small on a frame-by-frame basis.
Based on the information represented by the arrows in
Assume, for purposes of illustration, that the object tracking algorithm loses track of one or more portions/regions of the object 902. As shown in
As shown in
Note that the refinement technique disclosed may also remove regions that are erroneously included in a contour estimation. Reference is made to
With reference to
In some cases, certain restrictions may be implemented during the object tracking process disclosed in order to further enhance the accuracy of generating an estimated contour. For embodiments of the object tracking technique disclosed, a major assumption is that the previous tracking result contains an accurate estimation of the contour. Based on this assumption, the estimated contour may be further refined on a frame-by-frame basis.
Over time, however, the contour of the object may change substantially, thereby resulting in erroneous adjustments made based on an erroneous contour. As such, comparison of other attributes other than the local regions may also be used, where such attributes include, for example, the color of the object and the color of the background. If the color of the region is close to the background color, then refining the estimated contour using this region may lead to an erroneous refinement due to the color of the local region matching the color of the background. As such, by utilizing other comparisons, the refinement process may be improved.
To further illustrate, reference is made to
As shown in the example of
For some embodiments, the original contour shape 1102 specified by the user is compared to the reference contour 1110 by calculating a degree of similarity between the original contour shape 1102 and the reference contour 1110 to determine whether the two are substantially similar. If the reference contour 1110 is substantially similar to the original contour 1102 specified by the user, then looser restrictions are applied, otherwise stricter restrictions are applied.
It should be emphasized that the above-described embodiments of the present disclosure are merely possible examples of implementations set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
This application claims priority to, and the benefit of, U.S. Provisional Patent Application entitled, “Systems and Methods for Tracking Objects,” having Ser. No. 61/724,389, filed on Nov. 9, 2012, which is incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5940538 | Spiegel et al. | Aug 1999 | A |
7142600 | Schonfeld et al. | Nov 2006 | B1 |
7164718 | Maziere et al. | Jan 2007 | B2 |
20090324012 | Sun et al. | Dec 2009 | A1 |
20100158378 | Wu et al. | Jun 2010 | A1 |
20110291925 | Israel et al. | Dec 2011 | A1 |
Entry |
---|
Chiueh et al. “Zodiac: A history-based interactive video authoring system” Multimedia Systems 8: 201-211 (2000). |
Singh et al. “Annotation Supported Contour Based Object Tracking with Frame Based Error Analysis” 2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011). |
Daras et al. “MPEG-4 Authoring Tool Using Moving Object Segmentation and Tracking in Video Shots” EURASIP Journal on Applied Signal Processing 2003:9, 861-877, Nov. 22, 2002. |
Number | Date | Country | |
---|---|---|---|
20140133701 A1 | May 2014 | US |
Number | Date | Country | |
---|---|---|---|
61724389 | Nov 2012 | US |