Various embodiments of the invention relate to the field of classifying objects in video data.
Object classification in video data involves labeling an object as a human, a vehicle, multiple humans, or as an “Other” based on a binary blob input from the output of a motion detection algorithm. In general, the features of the blob are extracted and form a basis for a classification module, and the extracted features are subjected to various mathematical analyses that determine the label to be applied to the blob (i.e. human, vehicle, etc.).
Such classification has been addressed using a variety of methods based on supervised and/or unsupervised classification theories such as Bayesian Probability, Neural Networks, and Support Vector Machines. Up to this point in time, the applicability of these methods however has been restricted to typical ideal scenarios such as those depicted in standard video databases that are available online from various sources. However, the challenges posed by realistic video datasets and/or application scenarios have gone unaddressed in many such classification methods.
Some of the challenges in such real-life scenarios include:
Existing methods for object classification extract one or more features from the object and use a neural network classifier or modeling method for analyzing and classifying based on the features of the object. In each method the extracted features and the classifier or method used for analyzing and classifying depends on the particular application. The accuracy of the system depends on the feature type and the methodology adopted for effectively using those features for classification.
In one method, a consensus is obtained from the individual input from a number of classifiers. The method detects a moving object, extracts two or more features from the object, and classifies the object based on the two or more features using a classifier. The features extracted include the x-gradient, y-gradient and the x-y gradient. The classification method used is the Radial Basis Function Network for training and classifying a moving object.
Another object classification method known in the art uses features such as the object's area, the object's percentage occupancy of the field of view, the object's direction of motion, the object's speed, the object's aspect ratio, and the object's orientation as vectors for the classifier. The different features used in this method are labeled as scene-variant, scene-specific and non-informative features. The instance features are used to arrive at a class label for the object in a given image and the labels are observed in other frames. The observations are then used by a discriminative model—support vector machine (SVM) with soft margin and Gaussian kernel—as the instance classifier for obtaining the final label. This classifier suffers from high computational complexity in the algorithm.
In a further classification method known in the art, the classification is done in a simpler and less efficient way using only the height and width of an object. The ratio of height and width of each bounding box is studied to separate pedestrians and vehicles. For a vehicle, this value should be less than 1.0. For a pedestrian, this value should be greater than 1.5. To provide flexibility for special situations such as a running person or a long or tall vehicle, if the ratio is between 1.0-1.5, then the information from the corner list of this object is used to classify it as a vehicle or a pedestrian (i.e., a vehicle produces more corners).
Another classification scheme uses a Maximum Likelihood Estimation (MLE) to classify objects. In MLE, a classification metric is computed based on the dispersion and the total area of the object. The dispersion is the ratio of the square of the perimeter and the area. This method has difficulty classifying multiple humans as humans and may label them as a vehicle. While in this method the classification metric computation is computationally inexpensive, the estimation technique tends to decrease the speed of the algorithm.
In a slightly different approach to classifying objects, a method known in the art is a system that consists of two major parts—a database containing contour-based representations of prototypical video objects and an algorithm to match extracted objects with those database representations. The objects are matched in two steps. In the first, each automatically segmented object in a sequence is compared to all objects in the database, and a list of the best matches is built for further processing. In the second step, the results are accumulated and a confidence value is calculated. Based on the confidence value, the object class of the object in the sequence is determined. Problems associated with this method include the need for a large database with consequent extended retrieval times, and the fact that the selection of different prototypes for the database is difficult.
Thus, in the techniques known in the art for object classification, a major emphasis is placed on obtaining the classification accurately by employing very sophisticated estimation techniques while the features that are extracted are considered to be secondary. The art is therefore in need of a novel method to classify objects in video data that does not follow this school of thought of the known techniques.
In the following detailed description, reference is made to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It is to be understood that the various embodiments of the invention, although different, are not necessarily mutually exclusive. For example, a particular feature, structure, or characteristic described herein in connection with one embodiment may be implemented within other embodiments without departing from the scope of the invention. In addition, it is to be understood that the location or arrangement of individual elements within each disclosed embodiment may be modified without departing from the scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims, appropriately interpreted, along with the full range of equivalents to which the claims are entitled. In the drawings, like numerals refer to the same or similar functionality throughout the several views.
In an embodiment, an object classification system for video data emphasizes features extracted from an object rather than the actual methods of classification. Consequently, in general, the more features associated with an object—the more accurate the classification will be. Once the features are extracted from an object in an image, the classification of that object (e.g., human, vehicle, etc.) involves a simple check on a range of values based on those features.
The algorithm of an embodiment is referred to as a statistical weighted average decision (SWAD) classifier. Compared to other systems known in the art, the SWAD is not very computationally complex. This lack of complexity exploits statistical properties, as well as shape properties, as captured by a plurality of representative features drawn from different theoretical backgrounds such as shape descriptors in medical image classification, fundamental binary blob features such as those in template matching, and contour distortion features. The SWAD classifies the given binary object blobs into a human, a vehicle, an other, or an unknown classification.
In an embodiment, motion segmented results obtained from a Video Motion Detection (VMD) module coupled with a track label from a Video Motion Tracking (VMT) module form the inputs to an object classification (OC) module. The OC module extracts the blob features and generates a classification confidence for the object along the entire existence of the object in the scene or region of interest (ROI). The label (human, vehicle) obtained after attaining a sufficiently high-level of classification confidence is termed as the true class label for the object. The confidence is built temporally based on the consistency of the features generated from the successive frame object blobs associated with each unique tracked object.
These feature range values overlap for different types of blobs. Depending on the percentage of overlap, weighted values are assigned to the feature ranges of each of the classes (human, vehicle, others). These weighted values are used as scaling factors along with feature dynamic range values to formulate a voting scheme (unit-count voting and weighted-count voting) for each class—human/vehicle/others. Based on the voting results and with few heuristic strengthening, a normalized class confidence measure value is derived for the blob to be classified as human, vehicle, or other. Based on an experimental embodiment, a class confidence of 60% is sufficient (to account for the real-life scenarios mentioned above) to give a final class-label decision for each tracked object.
Referring to
The features in
Normalized MBR Length=MBR Length/Blob Rows;
Normalized MBR Width=MBR Width/Blob Columns;
Normalized MBR Area=Normalized MBR Length*Normalized MBR Width;
Normalized MBR L-W Ratio=Normalized MBR Length/Normalized MBR Width.
The blob rows and blob columns represent the number of pixels that the blob occupies in its length and width respectively.
The segment features are derived from the length 212, width 214, and area 216. The MBR SegPerimeter may be determined by summing the number of white pixels around the perimeter of the binary image. Similarly, the MBR SegArea may be determined by summing the total number of white pixels in the binary image. The features segment compactness 236 and fill ratio 238 are strong representations of the blob's density in the MBR. All these values are also normalized with respect to the image size.
Norm MBR SegPerimeter=MBR SegPerimeter/(2*(Blob Columns+Blob Rows))
Norm MBR SegArea=MBR SegArea/(Blob Columns*Blob Rows)
MBR SegComp=(MBR SegPerimeter*MBR SegPerimeter)/MBR SegArea
MBRFillRatio=MBR SegArea/MBR Area
The shape features 240 such as the circularity 242, convexity 244, and elongation indent 248 are computed using the segment area 234 and the perimeter 232.
MBR SegCircularity=4*PI*MBR SegArea/(MBR SegPerimeter)2
MBR SegConvexity=MBR SegPerimeter/sqrt(MBRSegArea)
MBR SegSFactor=MBRSegArea/(MBRSegPerimeterˆ0.589)
MBR ElongIndent=sqrt(CoSqr+SfSqr)
Where,
The computation of miscellaneous features captures class-dependent information and/or variations for the human and vehicle classes. They use row and column projection histograms on the blobs.
A projection histogram feature 225 provides a distinct measure for classifying the blobs, as the histogram values represent the shape of the object. The blob is split into four quadrants and the Row and Column projection histograms are calculated. The Standard Deviation of these projection histogram values is weighted from which the representative feature value is calculated.
In an embodiment, the Minimum Object Size (MOS) is calculated using the focal length of the image capturing device, and the vertical and horizontal distance that the object is from the device. The MOS is then used as an initial determiner of whether to classify a blob as an “Other.” The following are the measurement values used in calculating the MOS.
Total Field of View (FOV)=2 tan−1(d/2f)
The binary blobs may be misclassified due to non-availability of direction information. This is due to the varied aspect ratios of the blobs depending on their direction of motion in the scene. Hence all the blobs should be similarly oriented with respect to the center before classification. To account for this, rotation handling 135 is a pre-processing step in object classification.
In an embodiment as illustrated in
where r represents the number of rows (i.e. length) occupied by the image, c represents the number of columns (i.e. width) occupied by the image, and I(r,c) represents the center location in an image I. The summations in the numerator and denominator above are over the rows and columns of the image (i.e., 1 to the number of rows and 1 to the number of columns).
In an embodiment, the next step involves a first level classification of the blob 205. A rotated blob is subjected to a first level of analysis in which it is determined if the MBR Area of the blob satisfies the Normalized MOS. If the blob satisfies the MOS condition, it is subjected to another level of analysis for classifying as Others (otherwise the blob is labeled as Others). The another level of analysis includes verifying the fundamental feature values of the blob and using the Fourier analysis to verify whether the given blob falls under the category of Others. The fundamental features used in the first level of classification include the L/W Ratio, Segment Perimeter, Segment Compactness and Fill Ratio.
The algorithm for the Fourier based analysis for the Others classification is as follows. The input blob boundaries are padded with zeros twice, and the image is resized to a standard size. In one embodiment, that standard size is 32 by 32 pixels. The magnitude of the radix-2 Fast Fourier Transform on the resized image is calculated, and the normalized standard deviation of the FFT magnitudes is computed. A threshold value is defined for the standard deviation, and the standard deviation is computed and compared against the defined threshold.
After completing the first level of classification (in which the blob may be classified as “Others”), a second level of classification is applied to the blob. The derived features such as the circularity 242, convexity 244, elongation indent 248, and the projection histogram features 225 are computed for the second level of classification of the blob. In this second classification, ranges that the features may fall into for a class (human, vehicle, etc.) are defined, and class weights are derived based on overlap made by feature ranges for the different classes.
For example, referring to
For vehicles, the range 0.0 to 0.75 has no overlap with any other classes, so a direct weight of 0.75 is derived for vehicles. The rest of the vehicle range, 0.75 to 1.0, overlaps with the OTHER class range. Therefore, a value of 0.125 (by distributing the overlap range value 0.25 equally to the overlapping classes) is added to the direct weight value of 0.75. Consequently, the Vehicle Derived Weight Calculation is as follows:
Total Derived Weight for Vehicle (DWV)=0.75+0.125=0.875
Percentage Derived Weight (PDW)=(0.875/3.0)*100=29.2
For OTHERS the range from 1.0 to 1.5 has overlap with the Multiple Human class. Therefore, a weight value of 0.25 (dividing 0.5 by 2) is included in the derived weights. Also the range from 0.75 to 1.0 overlaps with the vehicle class. So a weight value of 0.125 (distributing the overlap range value 0.25 equally to the overlapping classes) is added to the derived weights calculation. The OTHERS Derived Weight Calculation is as follows:
Total Derived Weight for Others (DWO)=0.25+0.125=0.375
Percentage Derived Weight (PDWO)=(0.375/3.0)*100=12.5
For the Multiple Human category, the range 1.0 to 1.5 overlaps with the OTHER class. Hence, 0.25 (distributing the overlap range value 0.5 equally to the overlapping classes) is included in the derived weights. Also, the range from 1.5 to 2.0 overlaps with the HUMAN class. So a weight value of 0.25 (distributing the overlap range value 0.5 equally to the overlapping classes) is added to the derived weights. The Multiple Human Derived Weight Calculation is as follows:
Total Derived Weight for Multiple Human (DWM)=0.25+0.25=0.5
Percentage Derived Weight for Multiple Human (PDWM)=(0.5/3.0)*100=16.66
For the HUMAN range, 1.5 to 2.0 overlaps with the Multiple Human class. Hence, 0.25 (distributing the overlap range value 0.5 equally to the overlapping classes) is included in the derived weights. A value of 1.0 is added to the derived weights for the range 2.0 to 3.0. The Human Derived Weight Calculation is as follows:
Total Derived Weight for Human (DWH)=0.25+1.0=1.25
Percentage Derived Weight for Human (PDWH)=(1.25/3.0)*100=41.66
The derived weights for this example are summarized below:
The derived features from the blob 205 are validated with respect to the predefined human 440, vehicle 410, multiple human 430, and other (420) ranges as illustrated by example in
Specifically, in an embodiment, starting with the features extracted from the binary object blobs (i.e. MBR Length, MBR Width, MBR Area, etc.), initialize the values for the minimum and maximum of all features for the four classes of objects—“Human (H)”, “Vehicle (V)”, “Others (O)” and “Multiple Human (M)”. Then, for a given binary object blob that is to be classified, the following steps are performed. The feature values of the binary object blob are computed and compared against the feature value ranges for all classes and for all features. If the feature value of the blob under consideration falls in a range of a particular class, then the blob gets a “vote” for that class. These are referred to as Unit-Count votes. Then, the Unit-Count (UC) votes are accumulated for all feature values for all classes. A Weighted Unit-Count (WUC) votes is generated by multiplying the UC votes obtained above with the pre-determined feature weight-age values.
The UC and WUC votes are then summed class-wise for the binary blob under consideration. This gives us the scores corresponding to the UC and WUC for each of the classes for the given binary blob. These scores may be referred to as Scores-UC (SUC) and Scores-WUC (SWUC).
The SUC and SWUC values of each of the four classes are converted into percentage values using the following equations (following are the equations for H class):
Percentage SUC for H=PSUC_H=SUC_H/(sum of SUC for 4 classes)
A similar computation is done to obtain PSUC_H, PSUC_O & PSUC_M, and a similar computation is done to obtain the class-wise Percentage SWUC's, i.e., PWSUC_H, PWSUC_V, PWSUC_O & PWSUC_M.
Then, the final class score for the binary object blob is computed as follows:
a. Class_H_Score=(PWSUC_H+(PSUC_H/2.0));
b. Class_V_Score=(PWSUC_V+(PSUC_V/2.0));
c. Class_O_Score=(PWSUC_O+(PSUC_O/2.0));
d. Class_M_Score=(PWSUC_M+(PSUC_M/2.0)).
The given binary blob is then given the class label based on which of the above four scores is highest. This class label is treated as the class label for the current instance (which is occurring in the current frame of the video sequence) of the moving object in the video scene. The final class label is then arrived at as follows. The scores thus obtained per occurrence instance are accumulated over the sequence of video frames wherein the moving object exists. A class confidence value is computed depending on the number of instances the binary object blob has identical class labels. For example, in the following case:
In the foregoing detailed description of embodiments of the invention, various features are grouped together in one or more embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the detailed description of embodiments of the invention, with each claim standing on its own as a separate embodiment. It is understood that the above description is intended to be illustrative, and not restrictive. It is intended to cover all alternatives, modifications and equivalents as may be included within the scope of the invention as defined in the appended claims. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein,” respectively. Moreover, the terms “first,” “second,” and “third,” etc., are used merely as labels, and are not intended to impose numerical requirements on their objects.
The abstract is provided to comply with 37 C.F.R. 1.72(b) to allow a reader to quickly ascertain the nature and gist of the technical disclosure. The Abstract is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims.