1. Field of the Invention
The present invention relates to an apparatus for behavior analysis and the method thereof. More particularly, it relates especially to an apparatus, algorithm, and method thereof of behavior analysis, irregular activity detection and video surveillance for specific objects such as humankind.
2. Prior Arts
Behavior analysis, such as for humankind, is an important task in various applications like video surveillance, video retrieval, human interaction system, medical diagnosis, and so on. This result of behavior analysis can provide important safety information for users to recognize suspected people, to detect unusual surveillance states, to find illegal events, and thus to know all kinds of human daily activities from videos. In the past, there have been many approaches proposed for analyzing human behaviors directly from videos. For example, a visual surveillance system is proposed to model and recognize human behaviors using HMMs (Hidden Markov Models) and the trajectory feature. Also, a trajectory-based recognition system is proposed to detect pedestrians in outdoors and recognized their activities from multiple views based on a mixture of Gaussian classifier. In addition to trajectory, there are more approaches using human postures or body parts (such as head, hands, torso, and feet) to analyze human behaviors. For example, the complex 3-D models and multiple video cameras are used to extract 3-D voxels for 3-D posture analysis; the 3-D laser scanners and wavelet transform are used to recognize different 3-D human postures. Although 3D features are more useful for classifying human postures in more details, the inherent correspondence problem and the expensive cost of 3D acquisition equipments make them unfeasible for real-time applications. Therefore, more approaches are proposed for human behavior analysis based on 2D postures. For example, a probabilistic posture classification scheme is provided for classifying human behaviors, such as walking, running, squatting, or sitting. In addition, a 2D posture classification system is presented for recognizing human gestures and behaviors by HMM framework. Furthermore, a Pfinder system based on a 2D blob model is used for tracking and recognizing human behaviors. The challenge in incorporating 2D posture models in human behavior analysis is the ambiguities between the used model and real human behaviors caused by mutual occlusions between body parts, loose clothes, or similar colors between body articulations. Thus, in spite that the cardboard model is good for modeling articulated human motions, the requirement of body parts being well segmented makes it unfeasible for analyzing real human behaviors.
In order to solve this problem of body part segmentation, a dynamic Bayesian network for segmenting a body into different parts is based on the concept of blob to model body parts. This blob-based approach is very promising for analyzing human behaviors up to a semantic level, but it is very sensitive to illumination changes. In addition to blobs, another larger class of approaches to classify postures is based on the feature of human silhouette. For example, the negative minimum curvatures can be tracked along body contours to segment body parts and then recognized body postures using a modified ICP algorithm. Furthermore, a skeleton-based method is provided to recognize postures by extracting different skeleton features along the curvature changes of human silhouette. In addition, different morphological operations are exerted to extract skeleton features from postures and then recognized them using a HMM framework. The contour-based method is simple and efficient for making a coarse classification of human postures. However, it is easily disturbed by noise, imperfect contours, or occlusions. Another kind of approaches to classifying postures for human behavior analysis is using Gaussian probabilistic models. Such as in some methods, a probabilistic projection map is used to model each posture and performed a frame-by-frame posture classification to validate different human behaviors. This method used the concept of state-transition graph to integrate temporal information of postures for handling occlusions and making the system more robustly for handling indoors environments. However, the projection histogram used in this system is still not a good feature for posture classification owing to its dramatic changes under different lighting or viewing conditions.
The present invention provides an apparatus and method thereof via a new posture classification system for analyzing different behaviors, such as for humankind, directly from video sequences using the technique of triangulation.
Via applying the present invention in the human behavior analysis, each human behavior consists of a sequence of human postures, which have different types and change rapidly at different time. For well analyzing the postures, first, the technique of Delaunay triangulation is used to decompose a body posture to different triangle meshes. Then, a depth-first search is taken to obtain a spanning tree from the result of triangulation. From the spanning tree, the skeleton features of a posture can be very easily extracted and further used for a coarse posture classification.
In addition to the skeleton feature, the spanning tree can also provide important information for decomposing a posture to different body parts like head, hands, or feet. Thus, a new posture descriptor, which is also called as a centroid context for describing a posture up to a semantic level, is provided to record different visual characteristics viewed from the centroids of the analyzed posture and its corresponding body parts. Since the two descriptors are complement to each other and can describe a posture not only from its syntactic meanings (using skeletons) but also its semantic ones (using body parts), the present invention can easily compare and classify all desired human postures very accurately. According to the outstanding discriminating abilities of these two descriptors of the present invention, a clustering technique is further proposed to automatically generate a set of key postures for converting a behavior to a set of symbols. The string representation integrates all possible posture changes and their corresponding temporal information. Based on this representation, a novel string matching scheme is then proposed for accurately recognizing different human behaviors. Even though each behavior has different time scaling changes, the proposed matching scheme still can recognize all desired behavior types very accurately. Extensive results reveal the feasibility and superiority of the present invention for human behavior analysis.
The various objects and advantages of the present invention will be more readily understood from the following detailed description when read in conjunction with the appended drawing, in which:
a) is the sampling of control points—Point with a high curvature.
b) is the sampling of control points—Points with high curvatures but too close to each other.
a) is the triangulation result of a body posture—Input posture.
b) is the triangulation result of a body posture—Triangulation result of
a) is the skeleton of human model—Original image.
b) is the skeleton of human model—Spanning three of
c) is the skeleton of human model—Simple skeleton of
a) is the distance transform of a posture—Triangulation result of a human posture.
b) is the distance transform of a posture—Skeleton extraction of
c) is the distance transform of a posture—Distance map of
a) is the body component extraction—Triangulation result of a posture.
b) is the body component extraction—A spanning tree of
c) is the body component extraction—Centroids of different body part extracted by taking off all the branch nodes.
a) is the multiple centroid contexts using different numbers of sectors and shells—4 shells and 15 sectors.
b) is the multiple centroid contexts using different numbers of sectors and shells—8 shells and 30 sectors.
a) is the three kinds of different behaviors with different camera views—Walking.
b) is the three kinds of different behaviors with different camera views—Picking up.
c) is the three kinds of different behaviors with different camera views—Fall.
a) is the irregular activity detection—Five key postures defining several regular human actions.
b) is the irregular activity detection—A normal condition is detected.
c) is the irregular activity detection—Triggering a warning message due to the detection of an irregular posture.
a) is the irregular posture detection—Regular postures were detected.
b) is the irregular posture detection—Irregular ones were detected due to the unexpected “shooting” posture.
c) is the irregular posture detection—Regular postures were detected.
d) is the irregular posture detection—Irregular ones were detected due to the unexpected “climbing wall” posture.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
In this invention, an apparatus for behavior analysis and method thereof, which is especially related to a novel triangulation-based system to analyze human behaviors directly from videos, is disclosed. The apparatus for behavior analysis of the present invention is based on a posture recognition technique. An apparatus for posture recognition comprises a triangulation unit and a recognition unit. The triangulation unit is responsible for dividing a posture captured by a background subtraction into several triangular meshes. Then, the recognition unit forms a spanning tree correspond to the triangular meshes from the triangulation unit. According to the postures analyzed via the apparatus for posture recognition, the apparatus for behavior analysis then receives the time-varied postures to build a behavior. The apparatus for behavior analysis comprises a clustering unit, coding unit and a matching unit. The clustering unit is able to merge the time-varied postures iteratively to obtain several key postures. Then, the coding unit translates the key postures into correspondent symbols, which are unscrambled through the matching unit as a behavior.
Furthermore, a system for irregular human action analysis based on the present invention introduced later comprises an action recognition apparatus and a judging apparatus, wherein the action recognition apparatus is in the basis of the abovementioned posture and behavior apparatus and is bale to integrate the behaviors clustered from the postures into an human action. According to the human action obtained, the judging apparatus identifies whether the human action is irregular or not. If the result of identification is regular, no alarm will be given. However, if the result of identification is irregular or suspicious, the warning unit is going to send an alarm to such as a surveillance system to arouse the guard or any correspondent person.
As shown in
The present invention assumes that all the analyzed video sequences are captured by a still camera. When the camera is static, the background of the analyzed video sequence can be constructed using a mixture of Gaussian functions. Then, different human postures can be detected and extracted by background subtraction. After subtraction, a series of simple morphological operations are then applied for noise removing. In this section, the description is stated for the technique of constrained Delaunay triangulation for dividing a posture to different triangle meshes. Then, two important posture features, i.e., the skeleton one and the centroid contexts can be extracted from the triangulation result for more accurate posture classification.
Assume that P is the analyzed posture which is a binary map extracted by image subtraction. To triangulate P, a set of control points should be extracted in advance along its contour. Let B be the set of boundary points extracted along the contour of P. In the present invention, a sampling technique is exerted to detect all the points with higher curvatures from B as the set of control points. Let α(p) be an angle of a point p in B. Shown in
d
min
≦|p−p
+
|≦d
max and dmin≦|p−p−|≦dmax, (1)
where dmin and dmax are two thresholds and set to |B|/30 and |B|/20, respectively, and |B| the length of B. With p+ and p−, the angle α(p) can be decided as the Eq. (2) below:
If α is larger than a threshold Tα, i.e., 150°, p is selected as a control point. In addition to Eq. (2), it is expected that two control points should be far from each other. This enforces that the distance between any two control points should be larger than the threshold dmin defined in Eq. (1). Referring to
Referring to the
As what illustrated in
v
k
εU
ij, where Uij={vεV|e(vi,v)⊂Φ, e(vj,v)⊂Φ} (i)
C(vi,vj,vk)∩Uij=Ø (ii)
where C is a circum-circle of vi, vj, and vk. That is, the interior of C(vi,vj,vk) includes no vertex vεUij.
According to the abovementioned definition, a divide-and-conquer algorithm was developed to obtain the constrained Delaunay triangulation of V in O(n log n) time. The algorithm works recursively. When V contains only three vertexes, V is the result of triangulation. When V contains more than three vertexes, choose an edge from V and search the corresponding third vertex satisfying the properties disclosed in the Eq. (i) and Eq. (ii). Then subdivide V to two sub-polygons Va and Vb. The same division procedure is recursively applied to Va and Vb until only one triangle is included in the processed polygon. Details of the algorithm perform the following four steps and are shown in
S04: Repeat Steps 1-3 on Va and Vb until the processed polygon consists of only one triangle.
At last, the
In the present invention, two important posture features are extracted from the result of triangulation, i.e., the skeleton and centroid context ones. This section will discuss the method of skeleton extraction using the triangulation technique. Traditional methods to extract skeleton features, which different feature points with negative minimum curvatures are extracted along the body contours of a posture for constructing its body skeletons, are mainly based on body contours In order to avoid the drawbacks of the heuristic and noise-disturbed skeleton construction, a graph search scheme is disclosed to find a spanning tree which corresponds to a specified body skeleton. Thus, in the present, different postures can be recognized using their skeleton features.
In the section of deformable triangulations, a technique is presented to triangulate a human body to different triangle meshes. By connecting all the centroids of any two connected meshes, a graph will be formed. Though the technique of depth first search, the desired skeleton from this graph for posture recognition is found.
Assume that P is a binary posture. According to the technique of triangulation, P will be decomposed to a set ΩP of triangle meshes, i.e.,
Each triangle mesh Ti in ΩP has the centroid CT
Further, in what follows, details of the algorithm for skeleton extraction are summarized.
First, the procedures of the triangulation-based simple skeleton extraction shown in
Actually, the spanning tree of P obtained by the depth search also is a skeleton feature. Referring to
In the previous section, a triangulation-based method has been proposed for extracting skeleton features from a body posture. Assume that SP and SD are two skeletons extracted from a testing posture P and another posture D in database, respectively. In what follows, a distance transform is applied to converting each skeleton to a gray level image. Based on the distance maps, the similarity between SP and SD can be compared.
First, assume that DTS
where d(r,q) is the Euclidian distance between r and q. In order to enhance the strength of distance changes, Eq. (3) is further modified as the Eq. (4):
where κ=0.1. As shown in
where |DTS
In the previous section, a skeleton-based method is proposed to analyze different human postures from video sequences. This method has advantages in terms of simplicity of use and efficiency in recognizing body postures. However, skeleton is a coarse feature to represent human postures and used here for a coarse search in posture recognition. For recognizing different postures more accurately, this section will propose a new representation, i.e., the centroid context for describing human postures in more details.
The present invention provides a shape descriptor to finely capture postures' interior visual characteristics using a set of triangle mesh centroids. Since the triangulation result may vary from one instance to another, the distribution is identified over relative positions of mesh centroids as a robust and compact descriptor. Assume that all the analyzed postures are normalized to a unit size. Similar to the technique used in shape context, a uniform sample in log-polar space is used for labeling each mesh, where m shells are used for quantifying radius and n sectors for quantifying angle. Then, the total number of bins used for constructing the centroid context is m×n. For the centroid r of a triangle mesh in an analyzed posture, a vector histogram is constructed and satisfied with Eq. (6) below:
h
r=(hr(1), . . . , hr(k), . . . , hr(mn)). (6)
In this embodiment, hr(k) is the number of triangle mesh centroids resides in the kth bin by considering r as the reference original. The relationship of hr(k) and r is shown as Eq. (7):
h
r(k)=#{q≠r,(q−r)εbink}, (7)
where bink is the kth bin of the log-polar coordinate. Then, the distance between two histograms hr
where Kbin is the number of bins and Nmesh the number of meshes fixed in all the analyzed postures. With the help of Eq. (6) and Eq. (7), a centroid context can be defined to describe the characteristics of a posture P.
In the previous section, a tree searching method is presented to find a spanning tree TdfsP from a posture P according to its triangulation result. Referring to
Given a path pathiP, a set ViP of triangle meshes can be collected along pathiP. Let ciP be the centroid of the triangle mesh, which is the closest to the center of this set of triangle meshes. As shown in
P={h
c
}i=0, . . . , |V
where |VP| is the number of elements in VP. According to
where wiP and wjQ are area ratios of the ith and jth body parts reside in P and Q, respectively. Based on Eq. (10), an arbitrary pair of postures can be compared. In what follows, the algorithm shown in
The skeleton feature and centroid context of a given posture can be extracted via using the techniques described in sections of triangulation-based skeleton extraction and centroid context of postures, respectively. Then, the distance between any two postures can be measured using Eq. (5) (for skeleton) or Eq. (10) (for centroid context). The skeleton feature is for a coarse search and the centroid context feature is for a fine search. For receiving better recognition results, the two distance measures should be integrated together. We use a weighted sum to represent the total distance, it is represented as follows:
Error(P,Q)=wdskeleton(P,Q)+(1−w)dCC(P,Q), (11)
where Error(P,Q) is the total distance between two postures P and Q and w is a weight used for balancing dskeleton(P,Q) and dCC(P,Q). is the integrated distance between two postures P and Q and w a weight for balancing the two distances dskeleton(P,Q) and dcc(P,Q). However, this weight w is difficult to be automatically decided, and even, different settings of w will lead to different performances and accuracies of posture recognition.
In the present invention, each behavior is represented by a sequence of postures which will change at different time. For well analyzing, the sequence is converted into a set of posture symbols. Then, different behaviors can be recognized and analyzed through a novel string matching scheme. This analysis requires a process of key posture selection. Therefore, in what follows, a method is disclosed to automatically select a set of key postures from training video sequences. Then, a novel scheme string matching is proposed for effective behavior recognition.
In the present invention, different behaviors are directly analyzed from videos. For a video clip, there should be many redundant and repeated postures, which are not properly used for behavior modeling. Therefore, a clustering technique is used to select a set of key postures from a collection of training video clips.
Assuming that all the postures have been extracted from a video clip, each frame has only one posture and P, is the posture extracted from the tth frame. Two adjacent postures Pt−1 and Pt with a distance dt calculated via using Eq. (10), where w is set to 0.5, are provided in this embodiment. Based on the assumption that Td is the average vale of dt for all pairs of adjacent postures, a posture change event occurs for a posture Pt when dt is greater than 2Td. Through collecting all the postures, which hit an event of posture change, a set SKPC of key posture candidates can be got. However, SKPC still contains many redundant and repeated postures, which will degrade the effectiveness of behavior modeling. To tackle this problem, a clustering technique will be proposed for finding another better set of key postures.
Initially, each element ei in SKPC forms a cluster zi. Then, two cluster elements zi and zj in SKPC are selected and the distance between these two cluster elements is defined by Eq. (12):
where Error(.,.) is defined in Eq. (11) and |zk| the number of elements in zk. According to Eq. (12), an iterative merging scheme is performed to find a compact set of key postures from SKPC. zit and Zt are the ith cluster and the collection of all these clusters zit at the tth iteration. At each iteration, a pair of clusters zit and zjt are chosen and the distance, dcluster(zi,zj), between zit and zjt is the minimum for all pairs in Zt, which is satisfied with the following Eq. (13):
As the abovementioned, when dcluster(zi,zj) is less than Td, the two clusters zit and zjt are merged together for forming a new cluster and thus constructing a new collection Zt+1 of clusters. The merging process is iteratively performed until no pair of clusters is merged. Based on the assumption that
As referring to Eq. (14) and checking all clusters in
According to the result of key posture selection and posture classification, different behaviors with strings can be modeled. For example, in
Assume that Q and D are two behaviors whose string representations are SQ and SD, respectively. The edit distance between SQ and SD, which is defined as the minimum number of edit operations required to change SQ into SD, is used to measure the dissimilarity between Q and D. The operations include replacements, insertions, and deletions. For any two strings SQ and SD, the definition of DS
D
S
,S
e(i,j)=min[DS
where the “insertion”, “deletion”, and “replacement” operations are the transition from cell (i−1,j) to cell (i,j), the one from cell (i,j−1) to cell (i,j), and the other one from cell (i−1,j−1) to cell (i,j), respectively.
Assume that the query Q is a walking video clip whose string representation is “swwwwwwe”. However, the string representation of Q is different to the one of
D
S
,S
e(i,j)=min[DS
In the present invention, the “replacement” operation is considered more important than the “insertion” and “deletion” ones since a replacement means a change of posture type. Thus, the costs of “insertion” and “deletion” are chosen cheaper than the one of “replacement” and assumed to be ρ, where ρ<1. According to this, when an “insertion” is adopted in calculating the distance DS
D
S
,S
e(i,j)=min[DS
where Ci,jI=ρ+(1−ρ)α(i−1,j) and Ci,jD=ρ+(1−ρ)α(i,j−1). In the present invention, one “replacement” operation means a change of posture type. It implies that ρ should be much smaller than 1 and thus set to 0.1 in this invention. The setting makes the method proposed be nearly scaling-invariant.
In order to analyze the performance of our approach, a test database containing thirty thousands of postures, which come from three hundreds of video sequences, was constructed. Each sequence records a specific behavior.
In addition to posture classification, the proposed method can be also used to analyze irregular or suspicious human actions for safety guarding. The task first extracts a set of “normal” key postures from training video sequences for learning different human “regular” actions like walking or running. Then, different input postures can be judged whether they are “regular”. If the irregular or suspicious postures appear continuously, an alarm message will be trigged for safety warming. For example, in
In the final embodiment, the performance of the proposed algorithm for behavior analysis with string matching is disclosed. The present invention collects three hundreds of behavior sequences for measuring the accuracy and robustness of behavior recognition using our proposed string matching method. Ten kinds of behavior types are included in this set of behavior sequences. Thus, each behavior type collects thirty testing video sequences for behavior analysis. Table 1 lists the details of comparisons among different behavior categories. Each behavior sequence has different scaling changes and wrong posture types caused by recognition errors. However, the proposed string method of the present invention still performed well to recognize all behavior types.