Embodiments of the invention relate image processing and, in particular examples, to mean shift-based visual tracking in target representation and localization.
Systems and methods have been developed for defining an object in video and for tracking that object through the frames of the video. In various applications, a person may be the “object” to be tracked. For example, sports images and applications using surveillance cameras are interested in following the actions of a person.
Previously related work mostly applies the background information to realize a discrimination measure. For example, some related work searches for the best scale in the scale space by Difference of Gaussian filters or level set functions that are time consuming. A simple method looks for the scale by searching based on the same metric in location estimation which results in the shrinkage problem. Some other related work uses multiple kernels to model the relationship between the target appearance and its motion characteristics that yields complex and noise-sensitive algorithms. Some related work addresses template update only and uses the Kalman filtering or adaptive alpha-blending to update the histogram, but still results in accumulation errors.
In a first embodiment, an image processing method is performed on a video image that includes an initial frame and a plurality of subsequent frames. An object is located within the initial frame of the video image and a histogram related to the object is generated. A foreground map that includes the object is also generated. For each subsequent frame, a mean shift iteration is performed to adjust the location of the object within the current frame. The histogram related to the object and the foreground map can then be updated.
In certain embodiments, the mean shift iteration includes performing first, second and third searches for the object within the current frame. The first search is performed with an original scale of the object, the second search is performed with an enlarged scale of the object, and the third search is performed with a shrunken scale of the object. The best match of the three searches can then be selected.
Other embodiments and features are described herein.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
Embodiments of the present invention contribute up to three features for color histogram-based visual object tracking: 1) adoption of a mean shift-based object tracking method with soft constraints of the foreground map, which can be generated from either motion segmentation, background modeling and subtraction, or the field model; 2) update of the target histogram in a conservative way by aligning it with the initial target model; and/or 3) scale adaption where the optimal scale in histogram matching should make the tracked target most discriminative from its surrounding background.
In various embodiments, the present invention provides methods for overcoming shortcomings of the mean shift-based method in tracking the fast moving object. The methods include adding soft constraints of the foreground map from segmentation into the mean shift-based tracking framework which alleviates the shortcoming in handling the fast moving object. The methods also include adapting the scale in histogram-based visual tracking by a proposed discriminant function which takes into account of discrimination of the target from its surrounding background to avoid the “shrinkage” problem in scale determination. The method also includes updating the color histogram-based appearance model in visual tracking to cope with the drifting artifacts in a conservative way, where the drifting errors are further reduced.
Numerous benefits are achieved using aspects of the present invention over conventional techniques. Several embodiments, which can be utilized individually or in various combinations, are provided. For example, a foreground map is introduced into the mean shift iteration, modifying the distribution and mean shift vector only. A feature of searching the scale in tracking is based on discrimination of the tracked target from its background. A feature aligns the color histogram of the target model with the initial target model.
Embodiments of the invention may be potentially applied in interactive television (iTV), internet protocol television (IPTV), surveillance, smart room, and event analysis, as just some examples. Embodiments of the invention may provide value added services for IPTV, interactive TV, interactive video, personalized IPTV, social TV, and tactics analysis in sports video data.
In hyperlinked video, objects are selectable resulting in an associated action, akin to linked rich media content about the objects of interest. Possible venues for hyperlinked video include broadcast TV, streaming video and published media such as DVD. Hyperlinked video offers interaction possibilities with streaming media.
Interactive TV is a popular application area of hyperlinked video with the convergence between broadcast and network communications. For example, the European GMF4iTV (Generic Media Framework for Interactive Television) project has developed such a system where active video objects are associated to metadata information, embedded in the program stream at production time and can be selected by the user at run time to trigger the presentation of their associated metadata. Another European PorTiVity (Portable Interactivity) project is developing and experimenting with a complete end-to-end platform providing Rich Media Interactive TV services for portable and mobile devices, realizing direct interactivity with moving objects on handheld receivers connected to DVB-H (broadcast channel) and UMTS (unicast channel).
IPTV (Internet Protocol Television) is a system where a digital television service is delivered using Internet Protocol over a network infrastructure, which may include delivery by a broadband connection. An IP-based platform also allows significant opportunities to make the TV viewing experience more interactive and personalized. Interactive TV services will be a key differentiator for the multitude of IPTV offerings that are emerging. Interactivity via a fast two-way connection will lift IPTV ahead of today's television.
Localization of objects of interest is of interest for interactive services in IPTV systems, so that a regular TV broadcast (MPEG-2/-4) is augmented with additional information (MPEG-7 encoded) that defines those objects in the video, along with additional content to be displayed when they are selected. Specification of objects with additional content (metadata) is usually implemented by an authoring tool, which includes such functions as extraction of shots and key frames, specification of the interactive regions, and tracking of the specified regions to get the region locations in all frames. Therefore, an object tracking module is used in the authoring tool for the realization of interactive services in IPTV. Actually, visual object tracking is also important for other kinds of applications, such as visual surveillance, smart rooms, video compression and vision-based interfaces, etc.
Two major components can be distinguished in a typical visual object tracker. Target Representation and Localization is mostly a bottom-up process that also has to cope with the changes in the appearance of the target. Filtering and Data Association is mostly a top-down process dealing with the dynamics of the tracked object, learning of scene priors, and evaluation of different hypotheses. The most abstract formulation of the filtering and data association process is through the state space approach for modeling discrete-time dynamic systems, such as Kalman filter and Particle filter. Algorithms for target representation and localization are specific to images and related to registration methods. Both target localization and registration maximizes a likelihood type function. Mean shift-based visual tracking approaches fall into the target representation and localization as a gradient-based searching process for histogram matching. However, an apparent drawback of mean shift-based tracking methods is their strong requirement of significant overlap on the target kernels in consequent frames. Background/foreground information would help overcome this shortcoming by adding constraints into the mean shift iteration.
Various approaches, such as appearance-based approaches, template matching approaches, histogram matching approaches, are used. For example, appearance-based approaches for visual tracking vary from approaches that strictly preserve region structure-templates, to approaches that totally ignore the region structure and track based on feature distributions—histogram. In order to take into account the variations of the visual appearances, the appearance-based visual tracking method needs to update the target model in one way or another. Otherwise the drifting artifacts will happen, eventually resulting in the loss of the target. Drifting artifacts are caused by the accumulation of small errors in the appearance model, introduced each time the appearance is updated. Frequent appearance updates are required to keep the target model up-to-date with the changing target appearance; on the other hand, hasty updates of the appearance model will damage its integrity in face of drift errors. Therefore, appearance updates should be carefully designed.
Moreover, template matching approaches require pixel-wise alignment between the target template and the target candidate region and perform well for tracking rigid bodies and have been generalized to track deformable objects if the deformation models are known. Scale adaption in template matching-based tracking is easier to do since the target motion can be modeled clearly.
Furthermore, histogram matching approaches have great flexibility to track deformable objects as being robust to partial occlusion, but can lose the tracked region to another region with similar feature distribution, because the histogram matching approaches are less discriminative to appearance changes and less sensitive to certain motions. Histogram matching-based tracking handles the scale change with more difficulty since the structure information has been deemphasized.
Referring to
The background extraction 205 includes background pixel detection 210, connected component analysis 215, morphological filtering (e.g., dilation, erosion) 220 and size filtering 225.
In some applications, such as sports videos, there are many shots where the majority of the image is the background or playfield. Based on this observation, an unsupervised segmentation technique can obtain the background model. However, the background area in any frame is not always big enough to make the assumption of dominant color valid. Therefore, supervised methods for learning the playfield model can be used. A drawback of the supervised methods is the requirement of many labeled data, where hand-labeling is tedious and expensive.
In one embodiment, two options are defined. The first option is a small set of labeled data, i.e., the pixels in a given background area, is used to generate a rough background model with a single Gaussian or a mixture of Gaussian distributions (for the latter one, more labeled data is compulsory). Then, this model can be modified by collecting more background pixels based on an unsupervised method using dominant color detection.
In a second option, one frame, where the dominant color assumption is satisfied, is selected. Then its dominant mode is extracted to generate the initial background model. Like the first option, this model can be modified by collecting more playfield pixels based on dominant color detection.
The determination of the background model is discussed in greater detail in commonly-owned and invented provisional patent application Ser. No. 61/144,386 (HW 09FW005P) and non-provisional patent application Ser. No. 12/686,902 (HW 09FW010), which applications are incorporated herein by reference. Further information can be derived from these applications in their entirety. Further aspects and embodiments of these co-pending applications can also be used in conjunction with the aspects and embodiments disclosed herein.
The framework in
Returning to
The playfield is shown visually in several types of sports videos, such as soccer, football, basketball, baseball and tennis For example, the playfield is grass in soccer, baseball and football videos. Given a playfield (grass) model learned from labeled data (it can be updated on-line if possible), for each pixel in a frame of the sports video, probability of being a playfield or non-playfield can be estimated. The playfield model can be represented by single Gaussian, mixture of Gaussian or color histogram ratio (playfield and non-playfield).
For example, assume the playfield model in the RGB color space to be a single Gaussian distribution as
pi(x)=N(x; μi, σi), i=R,G,B (1)
where N denotes the pdf value at x of a Gaussian with mean μi and covariance σi . Thus, the possibility of a pixel y with RGB value [r,g,b] in the frame is
F(y)=p(playfield/[r,g,b])=N(r; μR, σR)·N(g; μG, σG)·N(b; μB, σB). (2)
A binary classification for the pixel y to be playfield or non-playfield can yield a weight mask as
where T is the weight for the foreground (T>1) and t is scaling factor (1.0<t<3.0).
Correspondingly, the possibility of a pixel to be foreground in the frame is given by 1.0−F(y).
In the tracking initialization, the target model (normalized color histogram) qt can be obtained as
qt,u=CqΣi=1n
where m is the number of bins in histogram (in a preferred embodiment, a 8×8×8 bins in RGB color space), δ is the Kronecker delta function, k(x) is the kernel profile function, {xi*}i=1˜n
Cq=1/(Σi=1n
q0 is defined as the normalized color histogram of the initial target model that is obtained at the first frame in tracking It will be used later for target model updating.
The next step in
In this example, the mean shift iteration includes performing first, second and third searches for the object within the current frame. The first search is performed with an original scale of the object, the second search is performed with an enlarged scale of the object, and the third search is performed with a shrunken scale of the object. The best match of the three searches can then be selected.
The histogram matching approaches can be either exhaustive searching or a gradient descent searching based on a given similarity measure. In order to cope with the changing scale of the target in time, searching in scale is also necessary. The target region is assumed to be a rectangle and its size is w×h at the previous frame, then histogram matching in the current frame is repeated using the window sizes of original, plus or minus θ percent of the original size. In the example shown in
1. Initialize the location of the candidate target in the current frame with ŷ0 , compute its normalized color histogram p(y0) at location ŷ0 as
pu(ŷ0)=CpΣi=1n
with h as the bandwidth of the kernel profile, {xi}i=1˜n
Cp=1/(Σi=1n
Afterwards the following expression can be evaluated:
ρ[qt, p(ŷ0)]=Σu=1m√{square root over (qt,upu(ŷ0))}. (8)
Formula (8) is called the Bhattacharyya coefficient.
2. Derive the weight according to
3. Find the next location of the target candidate according to
with g(x)=−k′(x).
4. Compute the normalized color histogram p(y1) at location ŷ1 and evaluate
ρ[p(ŷ1),qt]=Σu=1m√{square root over (pu(ŷ1)qt,u)}. (11)
5. While ρ[p(ŷ1),qt]<ρ[p(ŷ0),qt],
Do, ŷ1=(ŷ0+ŷ1)/2,
Evaluate ρ[p(ŷ1),qt]
6. If ∥ŷ0−ŷ1∥≦ε stop.
Otherwise set ŷ0←ŷ1 and go to step 2.
It is noted that to save computation cost, an alternative for the above modification is to replace 1−F(x) with G(x). If F(x)=0.0, the proposed method turns back to be a traditional mean shift-based tracking method.
The best scale is chosen by evaluating a measurement function that is a discriminant of the target from its background.
ρ[qt,pf(y)]=Σu=1m√{square root over (qt,upf,u(y))}. (12)
But this simple metric cannot stop the scale shrinking when used for scale adaptation. Here a proposed metric is defined below.
First, a minimal non-zero value in elements of the background histogram pb(y) is denoted as θ, a weight function is calculated as
This weight function is employed to define a transformation for the representations of the target model and candidates. It diminishes the importance of those features which are prominent in the background.
Then, we calculate the weighted target model q′t as
q′t,u=CqvuΣi=1n
where the normalization constant Cq is expressed as
Cq=1/(Σi=1n
Correspondingly, the weighted target candidate model p′f(y) is given by
p′f,u(y)=CfvuΣi=1n
with h as the bandwidth of the kernel profile and the normalized constant Cf expressed as
Cf=1/(Σi=1n
Consequently, the best scale is obtained by maximizing the defined similarity function as follows
maxρ[q′t,p′f(y)]=Σu=1m√{square root over (q′t,up′f,u(y))}. (18)
A benefit for this measure in scale adaption lies in that it weights the target histogram with the background and thus the discrimination of the target from its background is enhanced.
Eventually, an alpha-blending is used to smooth the adapted scale: if the previous scale is hprev and the adapted scale based on the defined metric (7) is hopt, then the new scale hnew is given as
hnew=αhopt+(1−α)hprev, (19)
with the blending factor as 0<α<0.5 .
Referring back to
First, the final estimated location ŷ0* is recorded. Then, ŷ0 and qt are replaced with ŷ0* and q0 (the initial histogram), respectively, and the mean shift iteration is run again. The location estimated in the second iteration is denoted as ŷ1*. The histogram update strategy is defined as: If ∥ŷ0*−ŷ1*∥≦ε (as a small threshold that enforces the second gradient descent iteration does not diverge too far from the result of the first iteration), the normalized color histogram p(ŷ1*) is calculated at location ŷ1* as the updated target model qt+1, i.e., qt+1=p(ŷ1*). Otherwise, acting conservatively, the target model is not updated, i.e. qt+1=qt.
Eventually, the similarity measure is checked again. If it is very low, the object is lost in tracking; otherwise, the recursion is kept to the next frame.
The video is provided at an input of image processor 620. The image processor 620 is typically a computer system that includes a processor, e.g., a microprocessor or digital signal processor, programmed to perform the imaging processing steps, e.g., those algorithms and methods disclosed herein. The image processor 620 generally includes a memory for storing program code to cause execution of the processing and further memory for storing the image data during processing. The memory can be a single unit for all functions or multiple memories.
The processed video image, which may now include metadata relating to the location of the object or objects being tracked, can be used for a number of purposes.
In the embodiment, the user is registered with the IMS infrastructure. The TV content is to be enhanced with metadata information for the playfield description. The IPTV client is enhanced with such a service, which implies an environment to run additional services and respectively execute advanced program code on the IPTV client for on-line player localization (segmentation or tracking).
The IPTV Service Control Function 150 manages all user-to-content and content-to-user relationships and controls the Content Delivery and Storage 140 and the Content Aggregator 110. The IPTV Application Function 145 supports various service functions and provides an interface to the user 160 to notice the IPTV service information and accept the service request of the user (such as registration or authentication). The IPTV Application Function 145, in conjunction with the Service Control Function 150 provides users with the value added services they request.
The Content Preparation 130 sends a content distribution request to the Content Delivery Control 135. The Content Delivery Control 135 produces a distribution task between Content Preparation 130 and the Content Delivery and Storage 140 according to the defined distribution policy when it receives the request of content distribution. The Content Delivery and Storage 140 delivers aggregated and metadata-enhanced content to the user 160, and may perform player localization in implementations where these tasks are not performed at the IPTV Client 155. The system may further perform team/player classification functions as described in co-pending application Ser. No. 12/686,902 (HW 09FW010).
The Content Aggregator 110 links the content 120 to the metadata 125 via the Authoring Tool 115 and aggregates content that is enhanced with metadata information for interactive service purposes. The Authoring Tool 115 runs play field learning and generates the MPEG-7 metadata.
A specific example of an interactive television system will now be described with respect to
This scenario describes a rich media interactive television application. It focuses on new concepts for interaction with moving objects in the sport programs. Based on direct interaction with certain objects, the viewer can retrieve rich media content about objects of his choice.
The interaction is based on the combination of information prepared on the IPTV server side and real time object localization (detection/tracking) on the IPTV client side. The information on the server side is stored as metadata in the MPEG-7 format and describes the playfield. The client side does the real time object processing and presents the related media information on a screen for user interaction.
The TV content is enhanced with metadata information. For example, a description of the field is represented by a color histogram. The user has to be registered with the IMS infrastructure. The IPTV client has to be enhanced with such a service, which implies an environment to run additional services and respectively execute advanced program code on the IPTV client for content processing and object highlighting. Charging can be used for transaction and accounting.
Referring now to
The IPTV client 820, for example a set top box (STB), is responsible to provide the viewer 830 with the functionality to make use of the interaction, in terms of real time object processing, to spot high lighting of objects containing additional content, to select objects and to view additional content. The IMS based IPTV client 820 is enabled with techniques such as real time object processing for providing the interactive service. In another example, if the video content is not enhanced with the metadata information, the IPTV client 820 can provide a user interface to the user 830 for collecting such information.
The user 830 makes use of the service by selecting objects, and consuming additional content. The delivery system 840, typically owned by the service provider 810, delivers aggregated and metadata-enhanced content to the user 830, provides trick functions and highly efficient video and audio coding technologies.
The content aggregator 850 links the content 860 to the metadata 870 via the authoring tool 880. This aggregator 850 aggregates content, which is enhanced with metadata information for interactive service purposes. The content aggregator 850 provides the delivery system 840 with aggregated content and attaches them with enhanced content. Therefore, MPEG7 as standard for multimedia metadata descriptions should be considered. The authoring tool 880 disposes algorithms for field learning in video streams and an MPEG-7 metadata generator.
In the operation of the system 800, the user 830 registers with the service provider 810 and requests the desired service. For this example, the user 830 is able to click on a player to start tracking the player.
In response to the request from the user 830, the service provider 810 causes the aggregator 850 to prepare the enhanced content. In doing so, the aggregator 850 communicates with the authoring tool 880, which processes the content image and enhances the content 860 with the metadata 870. The aggregator 850 can then provide the aggregated content to the delivery system 840.
The delivery system 840 forwards the enhanced content to the IPTV client 820, which interacts with the user 830. The user 830 also provides stream control to the delivery system 840, either via the IPTV client 820 or otherwise.
Features of each of the functional units shown in
Features of the service provider 810 include:
Features of the IPTV client 820 include
Features of the user 830 include:
Features of the delivery system 840 include:
Features of the aggregator 850 include:
Features of the authoring tool 880 include:
Although the present invention targets interactive services in IPTV systems, the invention is not so limited. The proposed scheme can be used in other video delivery systems with improved accuracy and low computational complexity.
While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or embodiments.
This application claims the benefit of U.S. Provisional Application No. 61/144,393, filed on Jan. 13, 2009, entitled “Mean Shift-Based Object Tracking with Scale Adaptation and Target Model Updating,” which application is hereby incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5638465 | Sano et al. | Jun 1997 | A |
5961571 | Gorr et al. | Oct 1999 | A |
6363160 | Bradski et al. | Mar 2002 | B1 |
6587593 | Matsuoka et al. | Jul 2003 | B1 |
7251364 | Tomita, Jr. et al. | Jul 2007 | B2 |
20030107649 | Flickner et al. | Jun 2003 | A1 |
20060280335 | Tomita, Jr. et al. | Dec 2006 | A1 |
20060291696 | Shao et al. | Dec 2006 | A1 |
20070250901 | McIntire et al. | Oct 2007 | A1 |
20080101652 | Zhao et al. | May 2008 | A1 |
20080298571 | Kurtz et al. | Dec 2008 | A1 |
20090034932 | Oisel et al. | Feb 2009 | A1 |
20090238406 | Huang et al. | Sep 2009 | A1 |
20100177969 | Huang et al. | Jul 2010 | A1 |
Number | Date | Country |
---|---|---|
101098465 | Jan 2008 | CN |
101159859 | Apr 2008 | CN |
101212658 | Jul 2008 | CN |
Number | Date | Country | |
---|---|---|---|
20100177194 A1 | Jul 2010 | US |
Number | Date | Country | |
---|---|---|---|
61144393 | Jan 2009 | US |