Automated recognition of human actions in video clips has many useful applications, including surveillance, health care, human computer interaction, computer games, and telepresence. In general, a trained action classifier (model) processes the video clips to determine whether a particular action takes place.
To learn an effective action classifier model, previous approaches rely on a significant amount of labeled training data, i.e., training labels. In general, this works well for one dataset, but not another. For example, the background, lighting, and so forth may be different across datasets.
As a result, to recognize the actions in a different dataset, heretofore labeled training data approaches have been used to retrain the model, using new labels. However, labeling video sequences is a very tedious and time-consuming task, especially when detailed spatial locations and time durations are needed. For example, when the background is cluttered and there are multiple people appearing in the same frame, the labelers need to provide a bounding box for every subject together with the starting/ending frames of an action instance. For a video as long as several hours, the labeling process may take on the order of weeks.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which unlabeled data of a target dataset is used to adaptively train an action model based upon similar actions in a source dataset. In one aspect, the target dataset containing (e.g., unlabeled) video data is processed into a background model. The background model is processed into an action model by using a source dataset with video data (e.g., labeled) that includes action of interest data. The target dataset is searched to find one or more detected regions (subvolumes) where similar action of interest data occurs. When found, the action model is updated based on the detected regions. The action model is refined by iteratively searching with the most current action model, then updating that action model based upon the action of interest regions, and so forth, for a number of iterations. The action model may then be output for use as a classifier.
In one aspect, the background model and action model comprise spatial-temporal interest points modeled as Gaussian mixture models. Searching the target dataset comprises performing a localization task based on scoring a function, in which the searching may comprise branch and bound searching.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards an adaptive action detection approach that combines model adaptation and action detection. To this end, the technology trains a background model from unlabeled data (in a target dataset), and uses this background model and labeled training data (in a source dataset) to extract a foreground (action of interest) model (a classifier). The action of interest model may be iteratively refined via further data processing. In this way, the technology effectively leverages unlabeled target data in adapting the action of interest model in a cross-dataset action detection approach. As can be readily appreciated, such cross-dataset action detection is valuable in many scenarios, such as surveillance applications.
It should be understood that any of the examples herein are non-limiting. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and video processing in general.
The mechanism 102 processes the source dataset 104, such as containing labeled data in a relatively simpler dataset, e.g., containing only video clips recorded with clean backgrounds, with each video clip involving only one type of repetitive action of a single person. In contrast, in another dataset (e.g., a target dataset 106), the background may be cluttered, and there may be multiple people moving around with occasional occlusions. As described herein, based on the source dataset 104, a background model 108 and an iterative procedure, the adaptive action detection mechanism 102 processes the target dataset 106 to recognize similar actions therein, and iteratively repeats the processing to generate and refine an action of interest model 110 for action classification.
As represented in
As represented in
With the adapted action of interest model 110, the adaptive action detection mechanism 102 estimates the location of an action instance (block 112) in the target dataset 106 by differentiating between the STIPs in the background and the STIPs for an action of interest. This is generally represented in
Turning to additional details, one example approach combines model adaptation and action detection into a Maximum a Posterior (MAP) estimation framework, which explores the spatial-temporal coherence of actions and good use of the prior information that can be obtained without supervision. The technology described herein combines action detection and classifier adaptation into a single framework, which provides benefits in cross-dataset detection. Note that cross-dataset learning and semi-supervised learning learn a model with limited amount of labeled data. However, semi-supervised learning assumes the labeled data and unlabeled data are generated from the same distribution, which is not the case in cross-dataset learning. In cross-dataset learning, the actions of interest are assumed to share some similarities across the datasets, but are not exactly the same, and further, the background models are typically significantly quite different from dataset to dataset. Because of this, semi-supervised learning is not suitable, and the technology described herein instead treats the action model in the source dataset as prior, and employs maximum a posterior estimation to adapt the model to the target dataset. Moreover, the model provides both classification and detection (spatial and temporal localization), while conventional semi-supervised learning algorithms considers only classification.
Another aspect is directed towards the spatial-temporal coherence nature of the video actions, using a three-dimensional (3D) subvolume to represent a region in the 3D video space that contains an action instance. As generally described in U.S. patent application Ser. No. 12/481,579, herein incorporated by reference, a 3D subvolume is parameterized as a 3D cube with six degrees of freedom in (x, y, t) space. Spatial and temporal localization of an action in a video sequence is rendered as searching for the optimal subvolume.
The technology described herein simultaneously locates the action and updates the GMM parameters. The benefits of doing so include that action detection provides more useful information to the user, and that locating the spatial-temporal subvolumes allows iteratively filtering out the STIPs in the background, thus refining the model estimation.
A video sequence may be represented as a collection of spatial-temporal interests points (STIPs), where each STIP is represented by a feature vector q. To model the probability of each STIP, a Gaussian Mixture Model represents a universal background distribution. If a GMM contains K components, the probability can be written as
where (•) denotes the normal distribution, and μk and Σk denote the mean and variance of the kth normal component, respectively. Each component is associated with a weight wk that satisfies Σk=1K wk=1. The parameter of the GMM is denoted by θ={μk, Σk, wk}, of which the prior takes the form:
where Pr(μk, Σk) is a normal-Wishart density distribution representing the prior information. In cross-dataset detection, Pr(θ) represents the prior information obtained from the source dataset 104. It is generally likely that in different datasets the action information may be correlated, and thus the prior Pr(θ) from the source dataset is likely beneficial for the action detection in the target dataset 106.
The task of action detection needs to distinguish the action of interest from the background. To this end, two GMM models are employed, namely a background model 108, θb={μbk, Σbk, wbk}, and the model 110 for the actions of interest, θc={μck, Σck, wck}. The corresponding prior distributions are denoted as Pr(θb) and Pr(θc), respectively. The task of action detection is modeled as finding 3D subvolumes in spatial and temporal domains that contain the actions of interest.
Let Q={Q1, Q2, . . . } denote the set of subvolumes, each of which contain an instance of the action. The union of Q is UQ=UQεQQ and let
To detect action in the target dataset, the mechanism 102 needs to find the optimal action model θc together with the action subvolume Q in the target dataset 106.
However, directly optimizing equation (3) is intractable. An effective approach in practice is to find the solution in an iterative way:
where θc* and Q* represent the updated action model and subvolumes, respectively. Note that the background model θb is fixed in one implementation, and is thus referred to as a universal background model.
Equation (4) provides a tool to incorporate the cross-dataset information. By way of example, consider a labeled source dataset S and an unlabeled target dataset T. The process can estimate θb by fitting the GMM with the STIPs in T. However, it is difficult to estimate θc because there is no label information of Q in T. However, θc may be obtained by applying equation (4) to the source dataset S. With an initial θc, the label information in S may be used to update the estimation of θc and Q in T. This approach adapts the action model from S to T and is referred to as adaptive action detection. The following algorithm, also represented in
Iterative Adaptive Action Detection
In general, the objective function is optimized by fixing the action model and updating the set of subvolumes Q containing the action, then fixing the set of subvolumes and optimizing the action model θc, and so on until convergence is detected (e.g., if the measured improvement between iterations is below some threshold) or some fixed number of iterations is performed. The action model may then be output for use as a classifier, along with data corresponding to the subvolumes (the detected regions of action of interest), e.g., the frames and locations therein.
Turning to computing the updated action model θc*, the optimal parameter θc maximizes equation (2):
When Q is given and the background model θb is fixed, the problem is simplified as:
The model of μck in the source dataset is taken as prior. Because Gaussian distribution is the conjugate prior for Gaussian, the MAP estimation for equation (6) is obtained in a simple form:
μck=αkEck(x)+(1−αk)μck
Σck=βkEck(x2)+(1−βk)(Σck+μck
where αk represents the weights that adjust the contribution of the prior model to the updated model. The variable Eck is the weighted summation of samples in the target dataset. Note that for faster speed and robustness, only μk (and not Σk) is updated. The variable Eck can be estimated as:
Note that the weighting parameter αk also may be simplified as:
where r is the controlling variable for adaptation. The adaptation approach effectively makes use of the prior information from source dataset 104, and requires only a small amount of training data to obtain the adaptation model.
Turning to subvolume detection, given the model θc, the best subvolume containing the action of interest can be found.
The second term is constant given the universal background model 108. Thus, a simplified form of subvolume detection may be used:
Assigning each STIP a score:
allows defining a scoring function for a subvolume Q:
It is known that the localization task based on the scoring function in (11) can be accomplished efficiently by a well-known Branch and Bound (BB) search method. Denoting UQ as the collection of 3D subvolumes gives Q ε UQ. Assuming that there are two subvolumes Qmin and Qmax, such that Qmin ⊂ Q ⊂ Qmax, let ƒ+ and ƒ− be two functions defined as:
which gives:
ƒ(Q)≦ƒ0(Q)=ƒ+(Qmax)+ƒ−(Qmin) (12)
for every Q ε UQ, which is the basic upper bound function used in BB search.
In this manner, there is provided cross-dataset learning that adapts an existing classifier from a source dataset to a target dataset, while using only a small amount of labeling samples (or possibly no labels at all). The cross-dataset action detection is possible even though videos may be taken on different occasions, with different backgrounds, with actions that may appear differently with different people, and with different lighting conditions, scales and action speeds, and so forth. Notwithstanding, the actions in different datasets still share some similarities to an extent, and the classifier adaptation leverages the spatial and temporal coherence of the individual actions in the target dataset.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 310 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 310 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 310. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 331 and random access memory (RAM) 332. A basic input/output system 333 (BIOS), containing the basic routines that help to transfer information between elements within computer 310, such as during start-up, is typically stored in ROM 331. RAM 332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 320. By way of example, and not limitation,
The computer 310 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 380. The remote computer 380 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 310, although only a memory storage device 381 has been illustrated in
When used in a LAN networking environment, the computer 310 is connected to the LAN 371 through a network interface or adapter 370. When used in a WAN networking environment, the computer 310 typically includes a modem 372 or other means for establishing communications over the WAN 373, such as the Internet. The modem 372, which may be internal or external, may be connected to the system bus 321 via the user input interface 360 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 310, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 399 (e.g., for auxiliary display of content) may be connected via the user interface 360 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 399 may be connected to the modem 372 and/or network interface 370 to allow communication between these systems while the main processing unit 320 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6597738 | Park et al. | Jul 2003 | B1 |
20040153288 | Tovinkere et al. | Aug 2004 | A1 |
20080208791 | Das et al. | Aug 2008 | A1 |
20090262984 | Hildreth et al. | Oct 2009 | A1 |
20090278937 | Botchen et al. | Nov 2009 | A1 |
20100042563 | Livingston et al. | Feb 2010 | A1 |
20100104158 | Shechtman et al. | Apr 2010 | A1 |
20110263946 | el Kaliouby et al. | Oct 2011 | A1 |
20120121132 | Asahara et al. | May 2012 | A1 |
Entry |
---|
Piotr et al, (“Behavior recognition via sparse Spatio-temporal features”, IEEE workshop on VS-PETS, pp. 1-8, 2005). |
Bruzzone, et al., “Domain Adaptation Problems: A DASVM Classification Technique and a Circular Validation Strategy”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4803844 >>, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, No. 5, May 2010, pp. 770-787. |
Tur, Gokhan., “Co-Adaptation: Adaptive Co-Training for Semi-Supervised Learning”, Retrieved at << http://www.speech.sri.com/people/gokhan/pubs/ICASSP09.pdf >>, IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 19-24, 2009, pp. 3721-3724. |
Hu, et al., “Action Detection in Complex Scenes with Spatial and Temporal Ambiguities”, Retrieved at << http://vipbase.net/homepage/iccv09—action.pdf >>, In Proceedings of International Conference on Computer Vision (ICCV '09), Oct. 2009, pp. 8. |
Duh, et al., “Learning to Rank with Partially-Labeled Data”, Retrieved at << http://ssli.ee.washington.edu/people/duh/papers/sigir.pdf >>, Annual ACM Conference on Research and Development in Information Retrieval, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Jul. 20-24, 2008, pp. 8. |
Nigam, Kamal P., “Using Unlabeled Data to Improve Text Classification”, Retrieved at << http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/2001/CMU-CS-01-126.pdf >>, Technical Report, CMU-CS-01-126, May 2001, pp. 138. |
Wu, Ying., “Vision and Learning for Intelligent Human-Computer Interaction”, Retrieved at << http://vision.eecs.northwestern.edu/papers/thesis/Wu—Ying—PhD—2001.pdf >>, 2001, pp. 173. |
Klinkenberg, Ralf., “Using Labeled and Unlabeled Data to Learn Drifting Concepts”, Retrieved at << http://reference.kfupm.edu.sa/content/u/s/using—labeled—and—unlabeled—data—to—lear—108780.pdf >>, In Workshop notes of IJCAI-01 Workshop on Learning from Temporal and Spatial Data, 2001, pp. 9. |
Yang, et al., “Cross-Domain Video Concept Detection Using Adaptive SVMS”, Retrieved at << http://lastlaugh.inf.cs.cmu.edu/alex/p188-yang.pdf >>, International Multimedia Conference, Proceedings of the 15th international conference on Multimedia, Sep. 25-29, 2007, pp. 188-197. |
Number | Date | Country | |
---|---|---|---|
20110305366 A1 | Dec 2011 | US |