Fields of the invention include image analysis, vision systems, moving object detection, driving assistance systems and self-driving systems.
Image analysis systems that can detect moving objects can be applied in various environments, such as vehicle assistance systems, vehicle guidance systems, targeting systems and many others. Moving object detection is especially challenging when the image acquisition device(s), e.g. a camera, is non-stationary. This is the case for driver assistance systems on vehicles. One or more cameras are mounted on a vehicle to provide a video feed to an analysis system. The analysis system must analyze the video feed and detect threat objects from the feed. Static objects have relative movement with respect to a moving vehicle, which complicates the detection of other objects that have relative movement with respect to the static surrounding environment.
Moving object detection can play an important role in driver assistance systems. Detecting an object moving towards a vehicle can alert a drive and/or trigger a vehicle safety system such as automatic braking assistance and avoid the collisions when the drivers are distracted. This is an area of active research. Many recent efforts focus on specific objects, such as pedestrians. See, R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten years of pedestrian detection, what have we learned?” in European Conference on Computer Vision. Springer, 2014, pp. 613-627. Such specific object systems are limited to the objects that they have been designed to detect, and can fail to provide assistance in common driving environments, e.g. expressway driving.
Semantic segmentation concerns techniques that enable identification of multiple moving objects and types of objects in one frame, e.g., vehicles, cyclists, pedestrian etc. Many semantic segmentation methods are too complicated to work in real time with modern vehicle computing power. See, L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” arXiv:1412.7062v4, 2014; J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431-3440. Real time approaches frequently suffer from significant noise and error. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1-8. Another problem inherent to segmentation methods is that such methods only identify or display objects. Without motion information, the systems cannot detect if object is moving, which is highly valuable information to trigger driver warning systems or automatic vehicle systems.
Zhun Zhong et al recently proposed methods that re-rank object proposals to include moving vehicles on KITTI dataset. Z. Zhong, M. Lei, S. Li, and J. Fan, “Re-ranking object proposals for object detection in automatic driving,” CoRR, vol. abs/1605.05904, 2016. This proposed approach uses many complex features such as semantic segmentation results, CNN (convolutional neural network) features, and stereo information. The complexity is not amenable for hardware-implementation with modern on-vehicle systems. Even with sufficient computing power, the approach is likely to perform poorly in sparsely annotated datasets such as CamVid. See, G. J. Brostow, J. Fauqueur, R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters 30(2): 88-97, 2009.
Embodiments of the invention include a method for moving objection detection in an image analysis system. The method analyzes consecutive video frames from a single camera to extract box properties and exclude objects that are not of interest based upon the box properties. Motion and structure data is obtained for boxes not excluded. The motion and structure data is sent to a trained classifier. Moving object boxes are identified by the trained classifier. The moving object box identification is provided to a vehicle system. The data sent to the classifier preferably consists of the motion and structure data. The structure data can include box coordinates, normalized height, width and box area, and a histogram of color space components. The motion data can include a histogram of direction data for each box of the boxes not excluded and a plurality of neighboring patches for each box. The box properties can include bottom y and center x coordinate, normalized height, width and box area, and aspect ratio. Boxes can be excluded, for example, when the boxes are less than a predetermined size or adjacent a frame boundary. The motion data preferably includes magnitude and direction of the motion for each pixel in boxes and for neighboring patches and the classifier determines moving object boxes based upon differences.
A preferred driver assistance system on a motor vehicle includes at least one camera providing video frames of scenes external to the vehicle. The video frames are provided to an image analysis processor, and the processor executes the method of the previous paragraph. The result of the analysis is used to trigger an alarm, a warning, a display or other indication to an operator of the vehicle, or to trigger a vehicle safety system, such as automatic braking, speed control, or steering control, or to a vehicle autonomous driving control system.
A preferred motor vehicle system includes at least one camera providing video frames of scenes external to the vehicle. An image analysis system receives consecutive video frames from the at least one camera. The image analysis system analyzes consecutive video frames from a single camera of the at least one camera to extract box properties and exclude objects that are not of interest based upon the box properties, obtains motion and structure data for boxes not excluded and sends the motion and structure data to a trained classifier. The classifier identifies moving object boxes. The data sent to the classifier consists of the motion and structure data. A driving assistance or autonomous driving system includes an object identification system and receives and responds to moving object boxes detected by the trained classifier.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Preferred embodiments of the invention include moving object detection methods and systems that provide a hardware friendly framework for moving object detection. Instead of using complex features, preferred methods and systems identify a predetermined feature set to achieve successful detection of different types of moving objects. Preferred methods train a classifier, but avoid the need for deep learning. The classifier needs only pre-selected box and motion properties to determine objects of interest. Compared to deep learning methods, a system of the invention can therefore perform detection more quickly and with less computing power than systems and methods that leverage deep learning.
A preferred system of an invention is a vehicle, such as an automobile. The vehicle includes one or more cameras. The one or more cameras provide image data to an image analysis system. The image analysis system analyzes the image data in real time separately for each of the one or more cameras, and analyzes consecutive video frames from a camera. The image analysis system provides critical data to a driving assistance or autonomous driving system, which can include acceleration, braking, steering, and warning systems. Example autonomous driving systems that can be utilized in a vehicle system of the invention are described, for example, in U.S. Pat. No. 8,260,482 assigned to Google, Inc. and Waymo, LLC, which is incorporated by reference herein. A specific preferred embodiment of the invention replaces the object detection component of the '482 patent with an image analysis system of the present invention that detects objects, or modifies the objection detection component with a method for moving object detection of the invention.
Those knowledgeable in the art will appreciate that embodiments of the present invention lend themselves well to practice in the form of computer program products. Accordingly, it will be appreciated that embodiments of the present invention may comprise computer program products comprising computer executable instructions stored on a non-transitory computer readable medium that, when executed, cause a computer to undertake methods according to the present invention, or a computer configured to carry out such methods. The executable instructions may comprise computer program language instructions that have been compiled into a machine-readable format. The non-transitory computer-readable medium may comprise, by way of example, a magnetic, optical, signal-based, and/or circuitry medium useful for storing data. The instructions may be downloaded entirely or in part from a networked computer. Also, it will be appreciated that the term “computer” as used herein is intended to broadly refer to any machine capable of reading and executing recorded instructions. It will also be understood that results of methods of the present invention may be displayed on one or more monitors or displays (e.g., as text, graphics, charts, code, etc.), printed on suitable media, stored in appropriate memory or storage, etc.
Preferred embodiments of the invention will now be discussed with respect to drawings and experiments. The drawings and experiments will be understood by artisans in view of the general knowledge in the art and the description that follows to demonstrate broader aspects of the invention.
A preferred method for moving objection detection in an image analysis system is provided and illustrated in
Particular preferred methods and systems use a set of three features: 1) box properties; 2) color and structure properties; and 3) motion properties. In a preferred embodiment, color information of typical road surfaces is leveraged by extracting LAB histogram of bottom patches of the target object. In the preferred embodiment, the three features are used for training an SVM (support vector machine) classifier (step 18). Then, for each input box, the system can detect a moving object by applying the trained SVM classifier (step 20). In a training phase (step 18), the classifier learns. In a testing (operational) phase (step 20), the trained classifier can, for example, utilize the properties of potential boxes to detect moving objects, usually vehicles.
As an example process, for boxes identified with objects (Step 12), step 16 computes the features of these boxes (bottom y and center x coordinate, normalized height, width and box area, as well as aspect ratio). Boxes that are too small or near the edge of the frame, for example, are excluded from further consideration leaving a group of candidate boxes for motion analysis. Information for the motion analysis is provided via step 12 that performs optical flow (compute magnitude and motion of each pixel). With the intuition that moving objects in candidate boxes should have different motion patterns with their surrounding area, the process in step 16 considers four neighboring patches of the candidate box with an object. In a preferred implementation, the mean magnitude difference with the four neighbors is calculated, then the direction histogram (e.g., 20 bins each) of the four neighbor patches and the candidate box are collected as the final motion features to provide when the SVM classifier is run in step 20.
In preferred methods and systems, extracting box properties includes extracting the features purely related to the box itself, which include bottom y and center x coordinate, normalized height, width and box area, as well as aspect ratio. This is illustrated in
An example is shown in
Having excluded objects by applying box properties, the color and feature analysis then analyzes objects of non-excluded boxes. The preferred example method considers color and structure information inside the boxes being analyzed. For the color feature, create a LAB histogram (CIELAB color space; other color spaces can be used) such as 20 (or another number N) bins representation for each L, A, B component, where N determines the number of discrete values for each color component. HOG (Histogram of Oriented Gradients) features are utilized for the structure information. For each pixel in the box, the histogram of oriented gradients (edge direction) is determined. See, Dalal and Triggs, “Histograms of oriented gradients for human detection,” CVPR'05. After PCA (principal component analysis), a particularly preferred method keeps a limited predetermined number of components (dominant eigenvectors to express the data), e.g., less than 100 or more preferably only 50 components without sacrificing significant accuracy. The preferred method also extracts an LAB histogram for the bottom patches (with same size of the candidate box—a bottom patch is defined as a box that has the same size as the candidate box and is directly under and adjacent to the candidate box). This operation recognizes that objects of interest are on the ground, instead of being elevated therefrom.
For the motion analysis, after applying the real-time optical flow, the method can obtain magnitude and direction of the motion for each pixel. Real-time optical flow is preferably conducted with the method of T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” PAMI, 2011. With the intuition that moving objects should have different motion pattern with its surrounding, the preferred method considers four neighboring patches of the predetermined box that are the same size of the predetermined box. First, calculate the mean magnitude difference with the four neighbors, then combine the direction histogram (N, e.g. 20 bins each) of all five patches as the final motion features. The preferred method divides 360 degrees into N=20 bins. For each pixel, the direction(angle) of the motion is computed.
Then, classification can be conducted. In preferred methods, an SVM (Support Vector machine) is used for classification, and a CamVid dataset (The Cambridge—driving Labeled Video Database) is used as a training set. Other classifiers can be used, for example. Adaboost, MLP (Multi-Layer Perceptron), and regression classifiers. Ground truth bounding boxes for the target objects (vehicles) are needed and are provided during a training phase. In a training set, features extracted from those ground truth boxes are taken as positive samples. For the negative sample, the method first applies hard negative mining The method generates candidate windows with decreasing scores using a windowing method such as EdgeBoxes. See, P. Dollar and C. L. Zitnick, “Edge boxes: Locating object proposals from edges,” ECCV, 2014. Only the windows which have less than 30% IOU (intersection over union) with any ground truth are considered as negative samples. As with R-CNN [R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014], the method also sets negative to positive sample ratio around 3:1. Then, the method learns an SVM classifier with RBF (radial basis fundction) kernel. Other windowing methods include selective search, objectness, etc. The present invention is not a deep learning method like Girschick, et al. The preferred method merely uses the ratio of negative to positive samples from that technique.
The preferred method was simulated in experiments, repeating the same box and positive and negative sample generation process in the test set. As we cannot control the number of negative sample in this step, the negative to positive sample ratio can reach 7:1. With the features we design, the overall classification accuracy is 81.4%. As EdgeBoxes still generates many overlapped boxes, non-maximum-suppression (NMS) is applied to remove those overlapped boxes and only keep the boxes with largest area in one region. After non-maximum suppression, remaining boxes with more than 50% IOU are taken as true detections. In this criterion, we can achieve 66.2% detection rate.
With regard to
The experimental results showed a satisfactory detection rate even with simple SVM (support vector machine) classifier and the example set of features. Other classifiers can be used, for example, Adaboost, MLP (Multi-Layer Perceptron), regression. Preferred embodiments avoid deep learning techniques, and the required computing power. The preferred embodiments can enable or enhance a broad range of applications for driver assistance system, such as general object alert, general collision avoidance, etc. Additional features will be apparent to artisans from the additional description following the example claims.
While specific embodiments of the present invention have been shown and described, it should be understood that other modifications, substitutions and alternatives are apparent to one of ordinary skill in the art. Such modifications, substitutions and alternatives can be made without departing from the spirit and scope of the invention, which should be determined from the appended claims.
Various features of the invention are set forth in the appended claims.
The application claims priority under 35 U.S.C. §119 and all applicable statutes and treaties from prior U.S. provisional application Ser. No. 62/446,152, which was filed Jan. 13, 2017.
Number | Date | Country | |
---|---|---|---|
62446152 | Jan 2017 | US |