Referring now to
The posture estimation apparatus includes a posture dictionary A that stores information on various postures, an image capture unit 1 that captures images, an image feature extracting unit 2 that extracts image features such as a silhouette or an edge from an image acquired by the image capture unit 1, a posture prediction unit 3 that predicts postures in a current frame using the result of estimation in the previous frame and information in the posture dictionary A, and a tree structure posture estimation unit 4 that estimates a current posture using the information of the predicted posture and the image features extracted by the image feature estimation unit 2 on the basis of the tree structure of the posture stored in the posture dictionary A.
The posture estimation apparatus is realized by, for example, using a general computer apparatus as a basic hardware. That is, the image feature extracting unit 2, the posture prediction unit 3, and the tree structure posture estimation unit 4 are realized by causing a processor mounted in the computer apparatus to execute a program. At this time, the posture estimation apparatus may be realized by installing the program into the computer apparatus in advance, or may be realized by installing the program in the computer apparatus as needed by storing the program in a storage medium such as a CD-ROM, or by distributing the program through a network. The posture dictionary A is realized by utilizing a memory provided externally or integrally with the computer apparatus, a hard disk, or storage media such as CD-R, CD-RW, DVD-RAM, DVD-R and so on as needed.
In this specification, the term “prediction” means to obtain information on the current posture only from information on the postures in the past. The term “estimation” means to obtain the information on the current posture from the information on the predicted current posture and an image of the current posture.
The posture dictionary A is prepared in advance before performing the posture estimation. The posture dictionary A stores a tree structure data including a plurality of nodes each including joint angle data A1 for various postures, an image feature with occluded information A2 obtained from the three-dimensional shape data of a body of a person whose posture is estimated relating to the respective postures, and representing posture information A3 indicating representing posture of the various postures in the respective nodes.
A method of preparing the posture dictionary A by the dictionary generating unit 10 will be described.
A posture acquiring unit 101 collects the joint angle data A1 and includes a commercially available motion capture system using markers or sensors or the like.
Since redundant postures are included in the acquired postures, the similar postures are deleted as follows.
Each of the joint angle data A1 is a set of three rotational angles rx, ry, rz (Euler angles) about three-dimensional space axes of the respective joints. Assuming that human body has joints by the number Nb, posture data Xa of a posture “a” is expressed as: Xa={rx1, ry1, rz1, rx2, . . . , r (Nb) }. The difference between two posture data Xa and Xb is defined as a maximum absolute difference of the respective elements of the posture data, that is, as a maximum absolute difference of the respective rotational angles of the joint angles, and one of the postures is deleted when the difference of the postures is smaller than a certain value.
A three-dimensional shape acquiring unit 102 measures a person whose posture is to be estimated by a commercially available three-dimensional scanner or the like, and acquires vertex position data of polygons which approximates the shape of the surface of the human body.
When there are too many polygons, the number of vertexes is reduced, and a three-dimensional shape model of a human body is generated by setting positions of the joints (such as elbows, knees, shoulders) of the human body and portions (such as upper arms, head, chest) of the human body to which all the polygons belong.
Although such operation may be performed by any methods, in general, it is manually performed using commercially available software for computer graphics. Reduction of the vertexes may be achieved automatically by a method of thinning the vertexes at regular distances or by a method of thinning the vertexes more from a portion of the surface having a smaller curvature. It is also possible to prepare a plurality of three-dimensional shape models of standard body shapes instead of the person whose posture is actually estimated as described above, and select a three-dimensional shape model which is most similar to the body shape of the person to be estimated.
A three-dimensional shape deforming unit 103 changes positions of vertexes of the polygons which constitute the three-dimensional model by setting the joint angles in the respective postures acquired by the posture acquiring unit 101 to the respective joints of the three-dimensional shape model of the human body generated by the three-dimensional shape acquiring unit 102, so that the three-dimensional shape model is deformed to the respective postures.
A virtual image capture unit 104 generates the projected images of the three-dimensional shape model in the respective postures by projecting the polygons which constitute the three-dimensional shape models deformed into the respective postures by the three-dimensional shape deforming unit 103 onto an image plane with a virtual camera which is configured in a computer having the same camera parameters as the image capture unit 1 while taking the occlusion relations thereof into consideration.
When projecting the polygons into the image, index numbers of portions of a human body is set to be values of pixels to which the polygons projected so that a projected image with portion indexes is generated as shown in
An image feature extracting unit 105 extracts a silhouette and an outline from the projected image with the portion indexes generated by the virtual image capture unit 104 as image features, and prepares a “model silhouette” and a “model outline”. These image features are stored in the posture dictionary A in coordination with the joint angle data of the posture.
As shown in
As shown in
As shown in
An occlusion detection unit 106 obtains an area (number of pixels) for the respective portions using the projected image with portion indexes, and extracts the portions having an area of 0 or an area smaller than a threshold value as occluded portions.
When storing these portions in the posture dictionary A, flags are prepared for each portion, and the flags of the occluded portions are turned on. These flags are coordinated with the joint angle data of the respective postures and are stored in the posture dictionary A.
A tree structure generating unit 107 generates a tree structure of the posture so that the distance between the image features (that is, similarity) of the respective nodes is reduced as it goes to the lower levels on the basis of the image feature distance between the postures defined on the basis of the image features extracted by the image feature extracting unit 105.
The image feature distance d (a, b) between a posture “a” and a posture “b” is calculated on the basis of the outline information extracted by the image feature extracting unit 105 as follows.
A plurality of evaluation points Ra are set on the outline of the posture “a”. The evaluation points may be composed of all the pixels Ca on the outline, or the pixels obtained by thinning at adequate distances. Distances from a respective point pa of these evaluation points to the closest point among points pb on an outline Cb of the posture “b” are calculated to obtain an average value of all the evaluation points, which corresponds to an image feature distance between the posture “a” and the posture “b”.
where, Nca represents the number of the evaluation points included in Ra. The image feature distance is zero when the two postures are the same, and increases according to difference between projected images of the posture “a” and the posture “b”.
Referring now to
An uppermost level, which corresponds to a root of the tree structure, is determined as a current layer, and a node is generated. All the postures acquired by the posture acquiring unit 101 are registered to this node.
The current layer is transferred to the level which is one step lower.
When the current layer exceeds a defined maximum number of levels, generation of the tree structure is ended. The following procedures are repeated for each of the nodes (parent nodes) of the upper level of the current level.
The image feature distances between an arbitrary posture (for example, a posture which is registered first in a parent node) in the postures registered in the parent node (referred to as “parent postures”) and remaining postures are calculated and a histogram of the image feature distance is prepared. A posture which is the closest to the most frequent value of the histogram is determined as the first selected posture.
A minimum value of the image feature distance between the parent postures which are not selected yet and the selected postures which are already selected is calculated, and is referred to as “selected posture minimum distance.” A posture whose selected posture minimum distance is the largest is determined as a new selected posture.
When there is no selected posture minimum distance exceeding the predetermined threshold value which is specified for each level, the posture selection step is ended. By setting the threshold value so as to be smaller as it goes to the lower levels, the tree structure which has more nodes as it goes to the lower levels can be generated.
The nodes are generated for the respective selected postures and the selected postures are registered to the corresponding nodes. The generated nodes are connected to the parent nodes. The parent postures which are not selected as the selected postures are registered to a node to which a selected posture at the minimum image feature distance therefrom belongs.
When the processing is not ended for all the parent nodes, the next parent node is selected and the procedure goes back to the first posture selecting step. If it is ended, the procedure goes back to the lower level transfer step.
Referring now to
The joint angle data A1, the model silhouette and the model outline extracted by the image feature extracting unit 105, and the occlusion flags obtained by the occlusion detection unit 106 are stored for the respective postures acquired by the posture acquiring unit 101. The model silhouette, the model outline, and the occlusion flags are referred to as the image feature with occlusion information A2 in combination. Addresses are assigned to the respective postures, and hence all the data are accessible by referring the addresses.
The addresses are assigned also to the respective nodes of the tree structure, and the addresses of the postures which are registered to the corresponding node, and the addresses of the nodes connected thereto on the upper level and the lower level (which are referred to as parent nodes and child nodes respectively) are stored in each node. The posture dictionary A stores the set of these data relating to all the nodes as the image feature tree structure.
A method of posture estimation performed from the image obtained from a camera using the posture dictionary A will be described.
The image capture unit 1 in
The image feature extracting unit 2 detects the silhouette and edge for the respective images transmitted from the image capture unit 1, which are referred to as an observed silhouette and an observed edge, respectively, as shown in
An observed silhouette extracting unit 21 acquires a background image without a person whose posture is to be estimated in advance, and the difference in luminance or color from the image of the current frame is calculated. The observed silhouette extracting unit 21 generates the observed silhouette by assigning a pixel value 1 to pixels having the difference larger than a threshold value and a pixel value 0 to other pixels. The description given above is the most basic background difference calculus, and other background difference calculus may be employed.
An observed edge extracting unit 22 calculates gradient of the luminance or each color bands by applying a differential operator such as Sobel operator to the image of the current frame, and detects a set of pixels whose gradient assumes the maximum value as the observed edge. The description above is one of the most basic edge detection method, and other edge detection methods such as Canny edge detector can be employed.
The posture prediction unit 3 predicts the posture of the current frame using a dynamic model from the posture estimation results of a previous frame.
The posture prediction may be represented by a form of a distribution of the probability density, and the state transition probability density in which the posture (joint angle) of a previous frame Xt−1 is changed to the posture Xt in the current frame may be expressed by p(Xt|Xt−1). To determine the dynamic model corresponds to determine the probability density distribution. The simplest dynamic model is a normal distribution having a predetermined certain variance-covariance matrix in which the posture of the previous frame is obtained as an average value.
p(Xt|Xt−1)=N(Xt−1, Σ)
where, N ( ) represents the normal distribution. That is, the dynamic model includes a parameter that determines a representative value of the predicted posture, and a parameter relating to a range of the predicted posture. In the case of the expression 2, the parameter that determines the representative value is a constant 1, which is a coefficient of the Xt−1. The parameter which relates to determination of the range of the predicted posture is a variance-covariance matrix Σ.
In addition, there are a method of linearly predicting the average value with a constant speed of the previous frame and a method of predicting the same with a constant acceleration. All these dynamic models are based on an assumption that the posture is not significantly changed from the posture of a frame one frame before.
The variance represents certainness of the prediction, and the larger the variance is, the larger the variation of the predicted posture becomes in the current frame. Assuming that the variance-covariance matrix Σ is constant, the following problem occurs when the occlusion of the portions occurs.
The current posture is determined considering the prediction (a priori probability) and conformity (likelihood) with observation obtained from the image. However, while a portion is occluded by another portion, and hence is not visible from the image capture unit 1, it cannot be observed from the image, and hence the posture of the current frame is determined by the prediction on the basis of the dynamic models. In a case in which the variance of the dynamic models is constant, when the occluded portion appears and its posture is out of the range predictable on the basis of the dynamic models, the prior probability of such a current posture is very low. Consequently, even though the conformity with the observation obtained from the image is high, the actual posture of the current frame cannot be obtained, and hence the posture estimation is failed.
This problem is solved by increasing only the variance of the occluded portion. The respective postures in the posture dictionary A include the occlusion flags of the respective portions stored therein, the occluded portion is specified using the occlusion flag relating to the posture Xt−1 of a previous frame, and the joint angle of the occluded portion is predicted by using variance larger than the portions which are not occluded. It is also possible to set a variable variance which increases gradually in proportion to the length of the occluded time of the occluded portion. For example, the upper limit value of the variance is preset, and the variance is increased in proportion to the length of the occluded time until it reaches the upper limit value, so that the variable time variance is achieved.
The tree structure posture estimation unit 4 estimates the current posture while referring the tree structure of the posture dictionary A using a result of prediction of the posture by the posture prediction unit 3 and the observed silhouette and the observed edge as the image features extracted by the image feature extracting unit 2. Details of the posture estimating method using the tree structure are described in the above-described document by B. Stenger et.al, and an outline of this method will be described briefly below.
The respective nodes of the tree structure stored in the posture dictionary A are composed of a plurality of postures whose image features are close to each other. A posture whose sum of the image feature distance from another posture belonging to a certain node is the smallest is determined as a representing posture, and the image feature of the representing posture is determined as a representative image feature of the corresponding node. This representative image feature corresponds representing posture information
A calculating node reducing unit 41 obtains a priori probability that the representative image feature is observed as the image feature of the current frame using a posture prediction of the posture prediction unit 3 and an estimation result of a previous frame. When the priori probability is sufficiently small, it is set not to perform the subsequent calculation.
In a case in which the probability of the posture estimation result of the current frame (calculated by a posture estimating unit 43) in the upper level is obtained, it is set not to perform the subsequent calculation for the nodes of the current level which is connected to the node whose probability is sufficiently small.
A similarity calculating unit 42 calculates the image feature distance between the representative image features of the respective nodes and the observed image feature extracted by the image feature extracting unit 2.
The image feature distances are calculated for the various positions and scales in the vicinity of the estimated position and scale in the previous frame in order to estimate the 3D position of a person to be recognized.
The movement of the position on the image corresponds to the movement in the three-dimensional space in the direction parallel to the image plane, and the change of the scales corresponds to the parallel movement in the direction of the optical axis.
In the case of the outline, the image feature distance shown in the tree structure generating unit 107 can be used. Furthermore, a method of dividing the outline into a plurality of bands on the basis of the edge direction (for example, dividing into four bands of the horizontal direction, the vertical direction, the direction inclined rightward and upward, and the direction inclined leftward and upward) and calculating the outline distance with respect to the respective bands is often used.
In the case of the silhouette, an exclusive OR is calculated for each pixel of the model silhouette and the observed silhouette, and the sum of the values of the exclusive OR which takes 1 or 0 is determined as a silhouette distance. In addition, there is also a method of weighting to work out the sum as it approaches the center of the observed silhouette when calculating the sum of the values of the exclusive OR.
The Gauss distribution is assumed as the likelihood model using the silhouette distance and the outline distance to calculate the likelihood (the likelihood of the observation given a certain node).
In this apparatus, the calculation of the similarity, which is the processing of the similarity calculating unit 42, requires the largest amount of computational resources because it is preformed for a large number of nodes. With the posture dictionary A stored in this apparatus configured on the basis of the image feature distances, the postures whose image features are similar are registered in the same node even though the joint angle is significantly different from each other, and hence it is not necessary to calculate the similarity separately for these postures, so that the amount of calculation is reduced and efficient search is achieved.
The posture estimation unit 43 obtains firstly the posterior probability of the respective node given the current observed image feature based on Bayes estimation from the priori probabilities and the likelihoods of the respective nodes.
The distribution of this probabilities itself corresponds to the estimation result of the current level. However, in the case of the lowest level, the current posture may be determined uniquely. In this case, the node which has the highest probability is selected.
When the selected node in the lowest level includes a plurality of postures, the state transition probability is calculated between the postures registered in the selected node and the estimated posture in the previous frame, and the posture having the highest transition probability is outputted as the current posture.
Since the posture prediction unit 3 performs prediction while taking the occluded portions into consideration, the priori probability does not become low even though the posture is significantly different before and after the occlusion, and stable posture estimation is achieved even though the occlusion occurs.
Lastly, a level renewing unit 44 transfers the processing to the lower level if the current level is not the lowest level, and terminates the posture estimation if it is the lowest level.
With the apparatus configured as described above, the efficient and stable posture estimation of the human body is achieved.
The number of cameras is not limited to one, and a plurality of the cameras may be used.
In this case, the image capture unit 1 and the virtual image capture unit 104 consist of the plurality of cameras, respectively. Accordingly, the image feature extracting unit 2 and the image feature extracting unit 105 perform processing for the respective camera images, and the occlusion detection unit 106 sets the occlusion flags for the portions occluded from all the cameras.
The image feature distances (the silhouette distance or the outline distance) calculated by the tree structure generating unit 107 and the similarity calculating unit 42 are also calculated for the respective camera images, and an average value is employed as the image feature distance. The silhouette information, the outline information to be registered in the posture dictionary A, and the background information used for the background difference processing by the observed silhouette extracting unit 21 are held separately for the respective camera images.
When performing the search using the tree structure, a method of calculating the similarity using a low resolution for the upper levels and a high resolution for the lower levels is also applicable.
With the adjustment of the resolution as such, the calculation cost for calculating the similarity in the upper levels is reduced, so that the search efficiency may be increased.
Since the image feature distance between the nodes is large in the upper levels, the risk of obtaining a local optimal solution increases if the search is performed by calculating the similarity with the high resolution. In terms of this point, the adjustment of the resolution as described above is effective.
When the plurality of resolutions are employed, the image features relating to all the resolutions are obtained by the image feature extracting unit 2 and the image feature extracting unit 105. The silhouette information and the outline information on all the resolutions are also registered in the posture dictionary A. When transferring the processing to the next level by the level renewing unit 44, the resolution used in the next level is selected.
Although the silhouette and the outline are used as the image features in the embodiment shown above, it is also possible to use only the silhouette or only the outline.
When only the silhouette is used, the silhouette is extracted by the image feature extracting unit 105, and the tree structure is generated on the basis of the silhouette distance by the tree structure generating unit 107.
The outline may be divided into to boundaries; a boundary with the background (the thick solid line in
The invention is not limited to the embodiments shown above, and may be embodied by modifying components without departing from the scope of the invention in the stage of implementation. Various embodiments may be configured by combining the plurality of components disclosed in the embodiments shown above as needed. For example, several components may be eliminated from all the components shown in the embodiments. Alternatively, the components in the different embodiments may be combined as needed.
Number | Date | Country | Kind |
---|---|---|---|
2006-140129 | May 2006 | JP | national |