A significant number of images and videos uploaded to the Internet, such as YouTube videos or images, contain scenes of people interacting with people. However, currently there is no automatic, method to classify/search these images based on different social interactions and activities. Thus, digital apps can not automatically sort out or arrange photos in a digital photocollections based on social activities/interactions. In addition, image search engines photos can't search for images with various social activities unless the some one manually, provides text tags with the images.
We present a unified framework called 3D Visual Proxemics Analysis (VPA3D) for detecting and classifying people interactions in unconstrained user generated images. VPA3D first estimates people/face depths in 3D and camera pose, then performs perspective rectification to map people locations from the scene space to the 3D space. The 3D layout is computed by a novel algorithm that robustly fuses semantic constraints into a linear camera model. To characterize human interaction space, we introduce visual proxemes; a set of prototypical patterns to characterize social interactions. Finally, a set of spatial and structural features are used to detect and recognize a variety of social interactions, including people dining together, family portraits, people addressing an audience, etc.
Vast amounts of Internet content captures people-centric events. Detecting and classifying people interactions in images and videos could help us to automatically tag, retrieve and browse large archives using high-level, concept based keywords. Such a representation would support queries such as “find me the video segment where we are walking down the aisle” or “find me the photos where I am curling the birthday cake”. Such queries are very hard to be represented using low-level features. To bridge this semantic gap between human defined phrases and image features, we present an intermediate representation using visual proxemes. Briefly the invention can have the following commercial impact: 1) The invention can be used in digital apps to sort out or arrange photos in a digital photocollections. Social networking sites (e.g. Facebook, and Google Picasa photo album) directly benefit from this. 2) The application can aid image search for photos depicting various types of social activities in large image databases. 3) The invention can be used to accurately determine the physical distance between people in a photograph. This information can potentially be used for forensic analysis from photographs or CCD images.
Group behavior based on people tracks has been studied. Due to the accuracy of face detection systems, detected faces are used to localize people and detect their layout in the image. Our contributions vis-à-vis current state-of-the-art is two-fold. First, we localize the explicit 3D positions of people in real world, which improves the understanding of relative distances between people. To achieve an accurate layout, we develop an algorithm that robustly fuses semantic constraints about human interpositions into a linear camera model. Previous work has only considered 2D layout. Additionally, we also compute the camera location and pose to make our solution view invariant; this capability is not present in previous works. Second, we analyze a large number of people configurations and provide statistical and structural features to capture them. Previous work only considered simple posing photoshoots of people.
A significant number of images and videos uploaded to the Internet, such as YouTube videos or Flicks images, contain scenes of people interacting with people. Studying people interactions by analyzing their spatial configuration, also known as Proxemics in anthropology, is an important step towards understanding web images and videos. However, recognizing human spatial configurations (i.e., proxemes) has received relatively little attention in computer vision, especially for unconstrained user generated content.
A number of research groups [10, 4, 7, 5] have conducted insightful studies for understanding people interactions in images and videos, though with limited scope. Most of these approaches [10, 4] perform their analysis in the 2D camera space. Although these approaches demonstrated their effectiveness, their robustness is fundamentally limited by the 2D analysis paradigm and can not handle the diversity in camera pose and people depths often seen in user generated Internet content.
In recent works, [7]proposes to estimate 3D location of people using faces and use these locations to detect social interaction among people. In [5], location of faces in the 3D space around a camera wearing person are used to detect attention patterns. However, these approaches only attempt to detect a very limited set of human interactions and their 3D) estimation can not effectively handle the diversity of people in terms of age (big adults vs. small children), height (tall vs. short), and the diversity of peoples poses such as sitting, standing and standing on platforms. Additionally, these approaches do not take camera location and pose into account when analyzing people interactions, which can be an important cue about the intent of the shot.
The theory of Proxemics [8] studies the correlation between human's use of space (proxemic behavior) and interpersonal communication. It provides a platform to understand the cues that are relevant in human interactions. Proxemics has been applied in the field of cinematography where it is used for optimizing the scene layout and the position of the camera with respect to the characters in the scene. We believe these concepts are relevant beyond cinematic visuals and pervade all types of images and videos captured by people. Inspired by the role of Proxemics in visual domain, we propose to analyze and recognize human interactions using the attributes studied in this theory.
In this paper, we propose a unified framework called 3D Visual Proxemics Analysis (VPA3D), for detecting and classifying people interactions from a single image. VPA3D first estimates people/face depths in 3D, then performs perspective rectification to map people locations from the scene space to the 3D space. Finally, a set of spatial and structural features are used to detect and recognize the six types of people interaction classes.
The proposed VPA3D approach surpasses state-of-the-art people configuration analysis in the following three aspects. First, PA3D uses 3D reasoning for robust depth estimation in the presence of age, size, height and human pose variation in a single image. Second, a set of shape descriptors derived from the attributes of Proxemics is used to capture type of people interaction in the eyes of each individual participant not only for robust classification but also for classification of individuals role in a visual proxeme. Additionally, the types of camera pose are used as a prior indicating possible intent of the camera-person who took the picture. Third, to characterize the human interaction space, we introduce visual proxemes; a set of prototypical patterns that represent commonly occurring people interactions in social events. The source of our visual proxemes is the NIST TRECVID Multimedia Event Detection dataset [2] which contains annotated data for 15 high-level events. A set of 6 commonly occurring visual proxemes (shown in
Broadly, our 3D Visual Proxemic Analysis formulates a framework that unifies three related aspects, as illustrated in the system pipeline (
Proxemics is a branch of cultural anthropology that studies man's use of space as a way for nonverbal communication [8]. In this work, we leverage the findings in Proxemics to guide us in our analysis and recognition of human interactions in visual media including images and videos. We call this Visual Proxemics and summarize our taxonomy of attributes in
A key concept in Proxemics is “personal space” that associates inter-person distance with the relationships among people. It is categorized into four classes: “Intimate distance” for close family, “personal distance” for friends, “social” distance for acquaintances and “public distance” for strangers. Additionally, people configuration needs to support the communicative factors such as physical contacts, touching, visual, and voice factors needed in an interaction [1]. Based on these factors, we can see that certain types of the interactions will result in distinct shape configurations in 3D. For example, in
One area of interest is the application of proxemics to cinematography where the shot composition and camera viewpoint is optimized for visual weight [1.]. In cinema, a shot is either a long shot, a medium shot or a close-up depending on whether it depicts “public proxemics”, “personal proxemics” or “intimate proxemic”, respectively. Similarly, the camera viewpoint is chosen based on the degree of occlusion allowed in the scene. To assure full visibility of every character in die scene, a high-angle shot is chosen whereas for intimate scenes and closeups, an eye-level shot or low-angle shot is more suitable.
From these attributes, we can see that each of the interactions specified in
Given the 2D face locations in an image, the goal is to recover the camera height and the face positions in the X-Z plane relative to the camera center. These parameters are computed by using the camera model described in [9] and iterating between the following two steps—1.
Initializing the model with coarse parameter estimates through a robust estimation technique. In addition to the parameters, we also detect outliers; face locations that do not fit the model hypothesis of uniform people heights and poses. This is described as the outlier detection step. 2. Refining the parameter estimates by 3D reasoning about position of outliers in relation to the inliers based on domain constraints that relate people's heights and poses. This is called the outlier reasoning step. The model alternates between estimating camera parameters and applying positional constraints until convergence is reached. We illustrate this approach in
This section describes an algorithm to estimate face depths, horizon line and camera height from 2D face locations in an image. Our model is based on the camera model described in [9]. The derivation is variously adapted from the presentations in [9, 7, 3]. We provide the derivation explicitly for the sake of completeness.
The coordinate transformation of a point using a typical pinhole camera model with uniform aspect ratio, zero skew and restricted camera notation is given by,
where a (ui, vi) are its image coordinates, (xw, yw, zw) are its 3D coordinates, and (ucw, vcw) are the coordinates of the camera center1. We assume that the camera is located at (xcw=0, zcw=0) and tilted slightly along x axis by θxw−ycw is the camera height and fw is the focal length.
At this stage some simplifying assumptions are made—(a) faces have constant heights, (b) faces rest on ground plane, which implies yw=0. The grounded position projects onto the bottom edge of the face bounding box in the image, ui=vbi, vi=vbi. (c) camera tilt is small, which implies cos θx˜1 and sin θx˜θx˜tan θx˜(vcw−v0i)/f, where v0i is the height of the horizon line (also known as vanishing line) in image coordinates. By applying these approximations, we estimate zw and xw respectively,
The estimated zw is the 3D distance in depth from the camera center zcw and xw is the horizontal distance from the camera center xcw. Using these (xw, zw) coordinates, we can undo the perspective projection of the 2D image and recover the perspective rectified face layout in the 3D coordinate system. Substituting the value of zw into the equation for yw and ignoring small terms we get,
y
w(vbi−v0i)=ycw(vi−vbi) (2)
This equation relates the world height of a face (yw) to its image height (vi−vbi) through its vertical position in the image (vbi) and through two unknowns—the camera height (ycw) and the horizon line (v0i). In general, given N>=2 faces in an image, we have the following system of linear equations.
Thus, given an image with at least two detected faces, we can simultaneously solve for the two unknowns by minimizing the linear least squares error.
To get meaningful camera parameters, it is essential to filter out irregular observations that violate the model hypothesis. We use Random Sample Consensus (RANSAC [6]) to reject these so-called outliers to get robust estimates. RANSAC is an iterative framework with two steps. 1Superscript w indicates 3D coordinates and i indicates image coordinates.
First, a minimal sample set (2 face locations) is selected and model parameters ({circumflex over (z)}w, ŷc, {circumflex over (v)}0) are computed by least squares estimator (as explained above). Next, each instance of the observation set is checked for consistency with the estimated model. We estimate the face height in the image according to the model using ĥi=hw(vb−{circumflex over (v)}0)/ŷc and compute the deviation from be observed height using eMi=∥hi−ĥi∥ to find the estimator error for that face. Outliers are instances whose summed errors over all the iterations exceed a pre-defined threshold.
The linearized model is based on the hypothesis that all faces are (a) located on the same plane and (b) of the same size. However these assumptions do not always hold in practice. The faces that violate these assumptions axe detected as outliers in the RANSAC step. Conventionally, outliers are treated as noisy observations and rejected from estimates. However, outlier faces may occur because of variations in face sizes and heights arising due to difference in age, pose (sitting versus standing) and physical planes of reference (ground level or on a platform). Hence, instead of eliminating them from consideration, we attempt to reason about them and restore them in our calculations. For doing this, we make use of semantics of Visual Proxemics to constrain the possible depth orderings of the outlier faces in the image. In particular, we consider two types of constraints—visibility constraint and localized pose constraint, as explained below.
Consider the pose configuration in
δ(xi−xi*)(yi−yi*)({circumflex over (z)}w−{circumflex over (z)}wτ)≦0, (4)
where δ(a−b) is 1 when a=b and {circumflex over (z)}w is the RANSAC estimate of depth. For each outlier in the image, we determine if it shares a visibility constraint with any of the inliers and maintain an index of all such pairs. Each such (outlier, inliers) pair is assumed to share a common ground plane (are standing/sitting on the same ground level). Based on this assumption the height estimates of the outliers are refined, as described in Section 3.2.3.
This constraint assumes that the people who are physically close to each other also share the same pose. Consider the group photo in
Formally, let Nx
For each outlier in the image, we perform this constraint test in determine (outlier, inliers) pairs that satisfy the localized pose constraint. These are used to refine the height estimates of the outliers in the following section.
The height estimates of the outliers are refined using the semantically constrained set of inliers. Specifically, we make use of a piecewise constant ground plane assumption in the image to estimate the outlier heights in the world. By assuming that the outliers are located at the same level as the related inliers, the world height (hw) of the outliers can calculated in proportion to the inliers. Let Bjout is the body height of an outlier and Ĝkin be the ground plane approximation for a neighboring inlier. The ground level is calculated by translating the vertical position of the face by a quantity proportional to the image height (we assume face size is 7 times the body size). The body height of the outlier is based on the average ground plane estimated from its inliers. The face height is then calculated as a fraction of the estimated body height.
The new height ratios are inputs to the next round of RANSAC step that produce new estimates of face depths and camera heights. We perform this iteration 3-4 times in our model.
To capture the spatial arrangement of people, we construct features based on the attribute taxonomy of Visual Proxemics presented in Section 3.1 (
The raw features measure different types of statistics and thus lie on different scales. To fit the distribution of each feature within a common scale, we use a sigmoid function that converts feature values into probabilistic scores between zero and one. Additionally, some of these features are meaningful within a certain range of values. Shifting a sigmoid function according to the threshold value allows soft thresholding. If σ is the threshold for feature x and c is the weight, we get the following expression for the sigmoid function.
To compute an aggregate feature from all the faces in an image, we consider the mean and variance values of each feature and then fit the sigmoid function to re-adjust the values. The feature corresponding to an image is a concatenated vector of these probability scores.
In this paper we present 3D Visual Proxemics Analysis, a framework that integrates Visual Proxemics with 3D arrangements of people to identify typical social interactions in Internet images. Our results demonstrate that this unified approach surpasses the state-of-the-art both in 3D estimation of people layout from detected faces as well as in classification of social interactions. We believe that inclusion of semantics allowed us to estimate better 3D layout than the purely statistical approaches. A better 3D geometry, in turn, allowed us to define features informed by Proxemics that improved our semantic classification. In future, we hope to delve deeper into this synergistic approach by adding other objects and expanding our semantic vocabulary of Visual Proxemics.
Further embodiments of the invention will utilize algorithmic proxemic analysis of image to provide a variety of functions and in a variety of systems/applications, including:
This disclosure is to be considered as exemplary and not restrictive in character, and all changes and modifications that come within the spirit of the disclosure are desired to be protected.
The present application claims the benefit of and priority to U.S. Provisional Application Ser. No. 61/787,375, filed Mar. 15, 2013, which is incorporated herein by this reference in its entirety.
This invention was made in part with government support under contract number D11PC20066 award by the United States Department of the Interior—National Business Center. The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
61787375 | Mar 2013 | US |