The present invention relates to video surveillance, and, more particularly, to using motion-texture analysis to perform video analytics.
The field of video surveillance has become increasingly important in the recent years following terrorist actions and threats. In particular, demand has increased for intelligent video surveillance, which involves high-level event detection (ergo, detection of the activity of people, such as people falling, loitering, etc.). Traditionally, high-level event detection is performed using low-level image-processing modules (e.g., motion-detection modules such as motion detection and object tracking). In such a motion-detection module, each pixel in an input image is separated and grouped into either a foreground region or a background region. Pixels grouped into the foreground region may represent a moving object in the input image. Typically, these foreground regions are tracked over time and analyzed to recognize activity.
However, there are problems associated with using these low-level image-processing modules. For instance, such a module can be ineffective when performing video analytics in a crowded area. As an example, in crowded scenes, people and other moving objects are more likely to be grouped into a single moving region. When a group of people are grouped into a single moving region, using video analytics to perform activity recognition of an individual within the single moving region may become more difficult.
Embodiments of the invention are described herein with reference to the drawings, in which:
Methods of using motion-texture analysis to perform video analytics are disclosed. According to an example, a method may include segmenting regions in a video sequence that display consistent patterns of activities. According to the method, the method includes partitioning a given frame in a video sequence into a plurality of patches, forming a vector model for each patch by analyzing motion textures associated with that patch, and clustering patches having vector models that show a consistent pattern. Clustering patches (i.e., segmenting a region in the frame) that show a consistent pattern may individually segment an object that is moving as a single block with other objects. Hence, for a group of objects moving as a single block, each object may be individually distinguished.
According to another example, a method may include using motion textures to recognize activities of interest in a video sequences. According to the method, the method includes selecting a plurality of frames from a video sequence, analyzing motion textures in the plurality of frames to identify a flow, extracting features from the flow, and characterizing the extracted features to perform activity recognition. Performing activity recognition may assist a user to identify the movement of a particular object in a crowded or sparse scene, or isolate a particular type of motion of interest (e.g., loitering, falling, running, walking in a particular direction, standing, and sitting) in a crowded or sparse scene, as examples.
According to another example, a method may include using motion textures to detect abnormal activity. According to the method, the method includes selecting a first plurality of frames from a first video sequence, analyzing motion textures in the first plurality of frames to identify a first flow, extracting first features from the first flow, comparing the first features with second features extracted during a previous training phase, and based on the comparison, determining whether the first features indicate abnormal activity. Determining whether the first features indicate abnormal activity may alert a user that an object is moving in an unauthorized direction (e.g., entering an unauthorized area), for example.
These as well as other aspects and advantages will become apparent to those of ordinary skill in the art by reading the following sections, with appropriate reference to the accompanying drawings.
The method 100 may include segmenting regions in a video sequence that display consistent patterns of activities. As depicted in
At block 102, the method includes partitioning a given frame in a video sequence into a plurality of patches. The given frame may be part of a plurality of frames in the video sequence. For instance, T frames of the video sequence may be selected from a sliding window of time (e.g., t+1, . . . , t+T). A given frame in the video sequence may include one or more objects, such as a person or any other type of object that may move, or be moved, over the course of the time period set by the sliding window. Further, the given frame includes a plurality of pixels, with each pixel defining a respective pixel position and intensity value.
Partitioning a given frame into a plurality of patches may include spatially partitioning the frame into n patches. Each patch in the plurality of patches is adjacent to neighboring patches. Further, each of the patches may overlap with one another.
Additionally, the patches may take any of a variety of shapes, such as squares, rectangles, or pentagons. Further, each patch includes a corresponding group of pixels. Also, the pixel size of the patches may vary. For instance, the patch size may range from a 5×5 pixel dimension to a 40×40 pixel dimension. As a given object may intersect with a plurality of patches, the pixel size of each patch may be the spatial resolution of the segmentation of each object.
At block 104, the method includes forming a vector model for each patch by analyzing motion textures associated with that patch. The vector model for each patch may be formed in any of a variety of ways. For instance, forming the vector model may include (i) estimating motion-texture parameters for each patch in the plurality of patches, (ii) for each given patch in the plurality of patches and for each neighboring patch to the given patch, calculating a motion-texture distance between the motion-texture parameters of the given patch and the motion-texture parameters of the neighboring patch, and (iii) based on the motion-texture-distance calculations for each patch in the plurality of patches, forming a vector model for each patch in the plurality of patches.
Estimating motion-texture parameters for each patch in the plurality of patches may be done using any of a variety of techniques, such as the Soatto suboptimal method of matrices estimation. Further details regarding Soatto's suboptimal method of matrices estimation are provided in S. Soatto, G. Doretto, and Y. N. Wu, “Dynamic Textures,” International Journal of Computer Vision, 51, No. 2, 2003, pp. 91-109 (“Soatto”), which is hereby incorporated by reference in its entirety.
In one embodiment, before estimating motion-texture parameters, each of the patches of the frame may be reshaped. This may include reshaping each patch into a multi-dimensional array (Y) that includes dimensions xp (e.g., a horizontal axis), yp (e.g., a vertical axis), and T (e.g., a time dimension). After each patch is reshaped in such a way, the motion-texture parameters for each patch may then be estimated. However, motion-texture parameters for each patch may be estimated without reshaping each patch as well.
To estimate motion-texture parameters for each patch, motion textures may first be mathematically approximated. For instance, motion textures may be associated with an auto-regressive, moving average process of a second order with an unknown input. As an example, the following equations may cooperatively represent a motion texture:
In the above equations, y(t) represents the observation vector. The observation vector y(t) may correspond to a respective intensity value for each pixel, the intensity value ranging from 0 to 255, for instance. Additionally, x(t) represents a hidden state vector. As opposed to the observation vector, y(t), the hidden state vector is not observable. Further, A represents the system matrix, and C represents the output Matrix. Additionally, v(t) represents the driving input to the system, such as Gaussian white noise, and w(t) represents the noise associated with observing the intensity of each pixel, such as the noise of the digital picture intensity, for instance. Further details regarding the variables of the auto-regressive, moving average process equations can be found in Soatto.
Once the respective motion texture for each of the patches is mathematically approximated, the motion-texture parameters for each patch may then be estimated. For example, the motion-texture parameters may be represented by the matrices A, C, Q (the driving input covariance matrix, which represents the standard deviation of the driving input, v(t)), and R (the covariance matrix of the measurement noise, which represents the standard deviation of the Gaussian noise, w(t)). To obtain estimations for the matrices A, C, Q, and R, the Soatto suboptimal method of matrices estimation may be used. In such a method of matrices estimation, let m>>n, rank(C)=n, and CTC=In, so as to identify a unique model from a sample path y(t), where In is the identity matrix. The suboptimal method of matrices estimation is shown as follows:
(1) First, perform singular value decomposition on Y, such that:
Y=UΣVT
(2) Then, estimate matrix C as:
Ĉ(τ)=U
(3) Next, the sequence of states X is estimated as {circumflex over (X)}=ΣVT
(4) Then, the matrix A is estimated as:
where Ir-1 is the identity matrix of the dimension (r−1)×(r−1)
(5) Next, estimate the driving input as:
v(k)=x(k)−Ax(k−1)
(6) Then, estimate the driving input covariance matrix Q as:
(7) Finally, compute the covariance matrix of the measurement noise R as:
R=Y−C*X.
Hence, estimations may be obtained for the matrices A, C, Q, and R, and the estimations of these matrices may be used to cooperatively represent the respective motion-texture parameters for each of the patches.
Next, for each given patch in the plurality of patches and for each neighboring patch to the given patch, forming the vector model may include calculating a motion-texture distance between the motion-texture parameters of the given patch and the motion-texture parameters of the neighboring patch. Motion-texture distances for each patch may be determined in any of a variety of ways. For instance, calculating the motion-texture distances may include comparing the motion-texture parameters of the given patch with the motion-texture parameters of the neighboring patch.
As another example, calculating a motion-texture distance between the motion-texture parameters of the given patch and the motion-texture parameters of the neighboring patch may include determining a respective Mahalanobis distance between the motion-texture parameters of the given patch (i.e., the given patch's observation) and the motion-texture parameters of the neighboring patch (i.e., the respective observation of the neighboring patch). The Mahalanobis distance between the motion-texture parameters of a given patch and the motion-texture parameters of the neighboring patch may be calculated using the method disclosed in A. Chan and N. Vasconcelos, “Mixtures of Dynamic Textures,” Intl. Conf. on Computer Vision, 2005 (“Chan”), which is hereby incorporated by reference in its entirety. Using Chan's method, a calculation is made as to the probability that a measured sequence Y is generated by motion textures with particular notion-texture parameters. Specifically, this probability is computed as the Mahalaniobis distance of a measurement y(t) and an estimated ŷ(t) of a distribution Σ. The Mahaniaobis distance may be defined as MDC(ŷ,y)=√{square root over ((ŷ−y)TΣ(ŷ−y))}{square root over ((ŷ−y)TΣ(ŷ−y))}, where Σ=C*E(t)*C′+R, and E(t) is the error covariance matrix computed by a Kalman filter.
Next, forming a vector model for each patch may include forming a vector model for each patch based on the motion-texture distance calculations for each patch. Each patch may be represented by its respective vector model. For example, when an eight-neighborhood is used to form a vector model for a given patch, forming a vector model for the given patch may include selecting at least one neighboring patch. A selected neighboring patch may include motion-texture parameters that define the shortest motion-texture distance between the motion-texture parameters of the given patch and the motion-texture parameters of each of the neighboring patches. Further, the vector model may originate from approximately the center of the given patch and may generally point towards the one or more selected neighboring patches. Additionally, the vector model includes a magnitude that may represent the motion-texture distance between the motion-texture parameters of the given patch and the motion-texture parameters of the one or more selected neighboring patches.
Further,
where k is along the x-direction and l is along the y-direction. The magnitude, s, of the vector model, V, is given by s=√{square root over (k2+l2)}, and the angle of the vector model, α, is given by
The magnitude of the vector model may reflect the distance between actual patch 408 and its neighboring patches. Further, the vector model may point towards the patch that is most similar to the actual patch 408. As a result of this calculation, the vector model for the patch 408 may be formed.
Next, at block 106, the method includes clustering patches having vector models that show a consistent pattern A consistent pattern of vector models may be shown in any of a variety of ways. For example, vector models that show a consistent pattern may include vector models that are concentric around a given patch. To illustrate, the vector models for each patch in a frame may cooperatively define a vector-model map, and the vector-model map may include a center. The patches that have vector models that generally point toward the center may be clustered.
A center in the vector-model map may be defined as a patch that has a threshold number of neighboring patches that each have vector models that are angled toward the patch. As an example of determining a center in a vector-model map,
Each of the above angles corresponding to each of vector models 504, 506, 508, 510, 512, 514, 516, and 518 represent ideal angles that may be used to determine whether a given vector model is angled toward patch 502. In this ideal situation, patch 502 is a center because (i) all eight of the surrounding vector models are (ii) angled toward patch 502 (additionally, patch 502 may be a center because the vector model for patch 502 is approximately zero). However, patch 502 may still be determined to be a center even if all eight of the surrounding vector models are not angled toward patch 502. For instance, patch 502 may be determined to be a center so long as a threshold number of surrounding vector models are angled toward it. The threshold number of vector models may range from 4 to 8, for example.
Furthermore, a given surrounding vector model may be angled towards patch 502 even if the given vector model is not angled at its respective ideal angle Deviations from the ideal angles are possible. As an example, an allowable angle of deviation for a given vector model may range from −θ to θ (e.g., θ can be 15°). Further, the respective allowable angle of deviation for each surrounding vector model may vary from one another.
Once a center in the vector-model map is determined, the patches that have vector models that generally point toward the center are clustered. In other words, the region that includes patches that have vector models generally pointing toward the center is segmented. Of course, the vector-model map may contain more than one center, in which case each center will be associated with its own corresponding class of vector models that generally point toward it.
There are a variety of ways to determine the vector models that generally point toward a center. To illustrate an example,
Since vector model 614, the final vector model in the linked list of vector models, is pointed toward the center 604, the trajectory of the linked list of vector models is pointed toward the center 604. Since the trajectory of the linked list of vector models is pointed toward the center 604, each vector model in the linked list of vector models (i.e., the sequence of vector models 602) is grouped into a class corresponding to the center 604.
Additionally, just as each center preferably corresponds to its own class of vector models that generally point toward the respective center, each class of vector models preferably corresponds to an object in the frame of the video sequence. Hence, if a given frame includes a plurality of objects, clustering patches having vector models that show a consistent pattern may include clustering the patches into a plurality of clusters that each correspond to a given object.
To illustrate,
Next, a representation of the one or more clusters of patches may be displayed to a user, or used as input for activity recognition. The representation of the clusters of patches may take any of a variety of forms, such as a depiction of binary objects. Further, the clusters of patches may be displayed on any of a variety of output devices, such as a graphic-user-inter face display. Displaying a representation of the one or more clusters of patches may assist a user to perform activity recognition and/or segment objects that are moving together in a frame.
The method 800 may include using motion textures to recognize activities of interest in a video sequences. As depicted in
At block 802, the method includes selecting a plurality of frames from a video sequence. The plurality of frames may include a first frame corresponding to a first time, a second frame corresponding to a second time, and a third frame corresponding to a third frame. Further, the first frame may include an object, and the second and third frames may also include the object. Additional objects may also be present in one or more of the frames as well.
At block 804, the method includes analyzing motion textures in the plurality of frames to identify a flow. The flow may define a temporal and spatial segmentation of respective regions in the frames, and the regions may show a consistent pattern of motion. Further, analyzing motion textures in the plurality of frames may to identify a flow may include (i) partitioning each frame into a corresponding plurality of patches, (ii) for each frame, identifying a respective set of patches in the corresponding plurality of patches, wherein the respective set of patches correspond to the respective region in the frame, and (iii) identifying the flow that defines a temporal and spatial segmentation of the respective set of patches in each of the frames, wherein the respective set of patches for each of the frames show a consistent pattern of motion.
By way of example,
At block 806, the method includes extracting features from the flow. Extracting features from the flow may take any of a variety of configurations. As an example, extracting features from the flow may include producing parameters that describe a movement. An example of such parameters include a set of numerical values, with a first numerical value indicating an area of segmentation for an object in a frame, a second numerical value indicating a direction of movement, and a third numerical value indicating a speed.
As another example, extracting features from the flow may include forming a movement vector (a movement vector may be an example of a more general motion-texture model). A movement vector may be formed in any of a variety of ways. By way of example, forming the first movement vector may include subtracting the intensity value of each pixel in frame 902b from the intensity value of a corresponding pixel in frame 904b to create an intensity-difference gradient. The intensity-difference gradient may include respective intensity-value differences between (1) each pixel in the first set of pixels and a corresponding pixel in frame 904b, and (2) each pixel in the second set of pixels and a corresponding pixel in frame 902b. The intensity-value differences between (1) each pixel in the first set of pixels and a corresponding pixel in frame 904b cooperatively correspond to the object 906a in the frame 902a, and the intensity-value differences between (2) each pixel in the second set of pixels and a corresponding pixel in frame 902b cooperatively correspond to the object 906b in the frame 904a.
The intensity-value differences, diff(t), may be computed where y(t) is tth frame of the patch and T is number of frames of the patch. For example, diff (t) may be computed as:
diff(t)=|y(t)−y(t−1)|, t=1, . . . , T−1
As depicted in the above equation, subtracting the intensity values may include taking the absolute value of the difference between the intensity value of each pixel in frame 902b and the intensity of the corresponding pixel in frame 904b.
To further illustrate,
Forming the first movement vector for the object may further include filtering the intensity-difference gradient by zeroing the respective intensity-value differences that are below a threshold, Zeroing the respective intensity-value differences that are below a threshold may highlight the pixel positions corresponding to the significant intensity-value differences. The pixel positions corresponding to the significant intensity-value differences may correspond to important points of the object, such as the object's silhouette. Further, zeroing the respective intensity-value differences that are below a threshold may also allow just the significant intensity-value differences to be used to form the first movement vector.
The threshold may be computed in any of a variety of ways. For instance, the intensity values corresponding to the first and second set of pixels may include a maximum-intensity value (e.g., 200), and the threshold may equal 90%, or any other percentage, of the maximum-intensity value (e.g., 180). Hence, the intensity-value differences below 180 will be zeroed, and only the intensity-values at or above 180 will remain after the filtering step. To further illustrate.
Forming the first movement vector may further include, based on the remaining intensity-value differences in the filtered intensity-difference gradient 918, determining a first average-pixel position corresponding to object 906a in frame 902a and a second average-pixel position corresponding to object 906b in frame 904a.
Next, forming the first movement vector may include forming the first movement vector such that the first movement vector originates from the first average-pixel position (which may correspond to a first patch) and ends at the second average-pixel position (which may correspond to a second patch).
As yet another example, extracting features from the flow may include forming a plurality of movement vectors. Each movement vector may correspond to a predetermined number of frames. As an example, in a plurality of frames including a first frame (frame 902a), second frame (frame 904a), and third frame (not depicted), a first movement vector that corresponds to the first and second frames may be formed, and a second movement vector that corresponds to the second and third frames may be formed. To illustrate,
Of course, a given movement vector in the plurality of movement vectors may correspond to more than two frames. As an example, a given movement vector may correspond to three frames. By way of example, the given movement vector may be formed by summing the first and second movement vectors. As shown in
At block 808, the method includes characterizing the extracted features to perform activity recognition. Characterizing the extracted features to perform activity recognition may take any of a variety of configurations. For instance, when the extracted features from the flow include parameters that describe a movement, characterizing the extracted features may include determining whether the parameters describing the movement are within a threshold to a predetermined motion model. By way of example, the parameters describing the movement may include the set of numerical values depicted in table 1502, and the predetermined motion model may include a predetermined set of numerical values, which, by way of example, is depicted in table 1504 of
As another example, when the extracted features from the flow include a movement vector (or a plurality of movement vectors), characterizing the extracted features may include estimating characteristics (e.g. amplitude and/or orientation) of the movement vector(s). Characterizing the extracted features may further include comparing the characteristics of the movement vector(s) to the characteristics of at least one predetermined vector.
As yet another example, the movement vector may traverse a patch (e.g., a patch corresponding to the first-average pixel position, second-average pixel position, or any other patch the movement vector may traverse), and characterizing the extracted features may include determining whether the movement vector is similar to a motion pattern defined by the patch.
As still yet another example, characterizing the extracted features to perform activity recognition may include performing simple-activity recognition. Simple-activity recognition may be used to determine whether each person in a crowd of people is moving in predetermined direction (or not moving), for example During simple-activity recognition, a predetermined motion model may be formed (e.g. during a training phase). The predetermined motion model may be formed in a any of a variety of ways. For example, the predetermined motion model may be selected from a remote or local database containing a plurality of predetermined motion models. As another example, the predetermined motion models may be formed by analyzing sample video sequences
The predetermined motion model may take any of a variety of configurations. For instance, the predetermined motion model may include a predetermined intensity threshold. As another example, the predetermined motion model may include one or more predetermined vectors. The one or more predetermined vectors may be selected from a database, or formed using a sample video sequence that includes one or more objects moving in one or more directions, as examples. Further, the predetermined vector may include a single predetermined vector (e.g., predetermined vector 1202 pointing to the right), or two predetermined vectors (e.g., predetermined vectors 1302 and 1304). Of course, additional predetermined vectors may also be used.
When analyzing a video sequence of an entryway into a secured area (e.g., during a testing phase), for example, every object whose respective movement vector is not in the general direction of the predetermined vector(s) (e.g., not in the exact direction as a predetermined vector, and also not within a certain angle of variance of the predetermined vector, such as plus or minus 15°) will be flagged as abnormal. Additionally or alternatively, every object in the video sequence that has an intensity threshold outside of a certain range of the predetermined intensity threshold may also be flagged as abnormal.
As another example, characterizing the extracted features to perform activity recognition may include performing complex-activity recognition. Performing complex-activity detection may include determining whether a predetermined number of simple activities have been detected. Further, determining whether a predetermined number of simple activities have been detected may include using a graphical model (e.g., a dynamic Bayesian network and/or a Hidden Markov Model).
To illustrate,
As noted, performing complex-activity detection may include determining whether a predetermined number of simple activities have been detected. By way of example, for three frames, an object's first movement vector may point to the right, and the first movement vector may count as one simple activity for the object. In the next three frames, the object's second movement vector may point to the left, and this may count as a second simple activity for the objects. In the next three frames, the object's third movement vector may point upwards, and the third movement vector may count as a third simple activity for the object. When three simple activities are detected for the object (the three simple activities may be unique to one another, or may repeat), the complex-activity detection node may be triggered. In the dynamic Bayesian network 1400, if the transition from the observation node 1414 to the observation node 1416 includes a third simple activity for the object, finish node 1406 may become a logic “1,” thus indicating a complex activity has been detected. On the other hand, if three simple activities for the object have not been detected during the transition from observation node 1414 to the observation node 1416, then the finish node may remain as a logic “0,” thus indicating that a complex activity has not been detected. Of course, other examples exist for detecting complex activity Performing activity recognition may assist a user to identify the movement of a particular object in a crowded scene, for instance.
The method 1600 may include using motion textures to detect abnormal activity. As depicted in
At block 1602, the method includes selecting a first plurality of frames from a first video sequence, Selecting a first plurality of frames from a first video sequence may be substantially similar to selecting a plurality of frames from a video sequence from block 802.
At block 1604, the method includes analyzing motion textures in the first plurality of frames to identify a first flow, Likewise, this step may be substantially similar to analyzing motion textures in the plurality of frames to identify a flow from block 804.
At block 1606, the method includes extracting first features from the first flow. Again, this step may be substantially to extracting features from the flow from block 806.
At block 1608, the method includes comparing the first features with second features extracted during a previous training phase. The training phase may take any of a variety of configurations. For instance, the training phase may include selecting second features from a plurality of predetermined features stored in a local or remote database. As another example, the training phase may include (i) selecting a second plurality of frames from a sample video sequence, (ii) analyzing motion textures in the second plurality of frames to identify a second flow, wherein the second flow defines a second temporal and second spatial segmentation of respective regions in the second plurality of frames, and wherein the regions show a second consistent pattern of motion, and (iii) extracting second features from the second flow. Of course, other examples exist for the training phase.
Further, comparing the first features with the second features may take any of a variety of configurations. For instance, the first and second features may include first and second motion-texture models, and the first and second motion-texture models may be compared. By way of example, the first and second motion-texture models may include first and second movement vectors, respectively, and the magnitude and/or direction of the first and second movement vectors may be compared. As another example, the first and second features may include first and second parameters that describe a movement (e.g., a first and second set of numerical values), respectively, the first and second parameters may be compared. Of course, other examples exist for comparing the first features with the second features.
At block 1610, based on the comparison, the method includes determining whether the first features indicate abnormal activity. Determining whether the first features indicate abnormal activity may include determining if a similarity measure between the first and second features exceeds a predetermined threshold. For instance, if the first and second features include first and second motion-texture models, abnormal activity may be determined if a similarity measure between the first and second motion-texture models exceeds a predetermined threshold. By way of example, if the first and second motion-texture models include first and second movement vectors, a similarity measure between the first and second vectors may include a measure between the respective magnitude and/or direction of the first and second movement vectors. If the difference between the magnitude and/or direction of the first and second movement vectors exceeds a predetermined threshold, then the object may be flagged as abnormal.
To illustrate, the predetermined threshold (e.g., an allowable departure from a learned motion model) may include a predetermined threshold for a feature (e.g., an angle of 25° for a movement vector). If a difference between the respective directions of the first and second movement vectors is within the predetermined threshold (e.g., 25° or less), then the first features will not indicate abnormal activity (i.e., the object characterized by the first features will not be flagged as abnormal). On the other hand, if the difference between the respective directions of the first and second movement vectors is greater than the predetermined threshold (e.g., greater than 25°), then the first features will indicate abnormal activity (i.e., the object characterized by the first features will be flagged as abnormal). Determining whether the first features indicate abnormal activity may help a user determine whether an object is entering an unauthorized area, for example.
Exemplary embodiments of the present invention have been described above. Those skilled in the art will understand, however, that changes and modifications may be made to the embodiments described without departing from the true scope and spirit of the present invention, which is defined by the claims.