This invention relates generally to free space estimation, and more particularly to estimating the free space from images acquired by a camera system.
In the field of autonomous navigation and computer vision, free space is defined as an area in front of a moving object, e.g., in front of or behind a vehicle, boat, or robot, where the object can manoeuvre without colliding with other objects. Another name for the free space is drivable space.
With the use of accurate maps and localization systems, autonomous navigation provides incremental navigation directions to the moving object to travel from point A to point B without colliding on any obstacles along its path. To do this, it is necessary to know critical information that is necessary to avoid obstacles, and a most cost-effective approach to obtain the critical information.
The most critical information for the autonomous navigation is the free space. It is well-known that the free space can be estimated using stereo cameras. For example, it is possible to estimate a ground plane and obstacles above the ground plane using a stereo camera system. The concept of occupancy grids is closely related to free space estimation. An occupancy grid refers to a two-dimensional (2D) grid where every cell models the occupancy evidence of the environment, and is typically estimated using a three-dimensional (3D) sensor that measures distances on a planar slice of the environment, such as a scanning LIDAR and an array of ultrasound sensors.
A stixel world representation has been used for the free space estimation problem. The stixel world refers to a simplified model of the world using a ground plane and a set of vertical sticks on the ground representing the obstacles. The model can compactly represent an image using two curves, where a first curve runs on the ground plane enclosing a largest free space in front of the camera and a second curve indicates the height (vertical coordinates) of all the vertical obstacles at a boundary of the free space. The stixel world can be determined using depth maps obtained from stereo cameras. There are several algorithms that determine the depth maps from stereo images, such as semi-global stereo matching method (SUM). Stixels can be also determined without explicitly estimating the depth maps from stereo images using dynamic programming (DP). Those techniques either implicitly or explicitly determine depth using a stereoscopic or 3D sensor.
To reduce the system complexity and cost, it is desired to determine the free space from a sequence of images, i.e., a video, acquired by a monocular camera mounted on the moving object. There are several challenges in solving this problem using monocular videos instead of stereo videos. In contrast to other segmentation problems, it is not possible to rely completely on color or edges. For example, in videos of roads, strong gradients from cross-walks and lane markings are often present. In the case of water, there is often reflection from nearby boats, buildings, or sky. Features based on homography, that relies on planar roads, may not be accurate due to non-flat roads. Furthermore, the moving objects have additional challenges in monocular free space estimation.
It is known how to perform geometric layout estimation from single images. It is possible to classify the pixels in a given image into ground, buildings, and sky. This classification has been used to obtain popup 3D models of buildings. A scene can be modeled using two horizontal curves that partition an image into top, middle, and bottom regions. It is shown that this problem of segmenting an image into regions can be done using a globally optimum method.
The general idea of using dynamic programming for column-wise matching has been used for estimating the 3D models of buildings, and generalized to work with several layers of height-maps for modeling urban scenes.
Monocular videos have been used by simultaneous localization and mapping (SLAM) methods. Most of those methods provide a sparse point cloud and do not explicitly estimate the free space, which is the most critical information for autonomous navigation.
To the best of our knowledge, we are not aware of any free space estimation method for boats in the water using prior art computer vision techniques. The segmentation of water in an image of a scene is particularly challenging due to its specular properties, such as the reflection of nearby obstacles or sky on the water. Features such as color and edges perform poorly under such cases.
The embodiments of the invention provide a method for estimating free space near moving objects from a video, i.e., a sequence of images, acquired of a scene by a monocular camera. The method can be used for autonomous navigation of on-road vehicles and in-water boats. The method uses dynamic programming (DP) to segment the images acquired in front of or behind the moving objects into free space and obstacles.
In contrast to prior approaches using stereoscopic or 3D sensors, we make it possible to solve this problem using a video acquired by a monocular camera. Our method relies on several image and geometric features, such as the appearance of water and road, edge information, homographies between consecutive images, and smoothness. The features are used to determine a set of potentials for an objective or energy function.
We formulate the free space estimation problem as an optimization problem in a Markov random field (MRF). We define the MRF as a one-dimensional (1D) graph, where each node corresponds to a column of pixels in the image. As our graph forms a 1D chain, exact inference can be done using DR Parameters of an energy function of the MRF can be manually set, or automatically determined from a sequence of training images using a structured support vector machine (SVM).
Our method estimates the free space reliably, while only employing monocular cues. Our method can also incorporate other monocular cues, such as optical flow determined from the monocular video, and depth maps determined by estimating the motion of the camera and using multiple frames in the monocular video as multiple images. Although our main focus is on monocular camera systems, if stereo or multiple cameras are available, then we can also incorporate stereo or multi-camera cues, such as depth maps determined from the multiple cameras.
We are interested in scenes with challenging traffic conditions rather than buildings, and also maritime scenes. Furthermore, we obtain the free space segmentation that varies smoothly across a video sequence.
As shown in
The free space as determined by the embodiments of the invention can be used in a number of applications, such as autonomous navigation of vehicles moving from one location to another, manouvering a boat during berthing or mooring, parking a vehicle, and moving a robot such as an automated vacuum cleaner.
First, we construct 210 a 1D graph 301, see
The energy function is defined as a sum of potential functions determined by the features with their corresponding weight parameters. The weight parameters can be manually set, or automatically determined from a sequence of training images having ground truth annotations 250 by using a learning algorithm, as shown in
Monocular Free Space Estimation
Specifically, as shown in
The 1D graph 301 includes a set of nodes denoted by yi corresponding to the w columns of the image. Each node represents a discrete variable whose value comes from a set of labels given by the h rows in the image. The labels are ordered such that the bottom pixel has a label l, and the top pixel has it label h. By solving the energy maximization problem on the graph and labeling these nodes, we obtain the 1D curve 310. An area under the curve defines the free space 101 near the vehicle or boat. The 1D curve can be used for estimating the nearby free space in front of, or perhaps behind, vehicles, robots, and boats.
Below, we provide details on the energy function, potentials, inference, and parameter learning.
Energy Function
Let It denote the image at time t in the video. The dimensions of the image are w×h, where w and h are the width and height respectively. We model the problem so that we have w discrete variables yi,i∈{1, . . . , w} and each variable can take a value from h discrete labels, yi∈{1, . . . , h}.
Let us consider the 1D graph 301. G={V,E}, where vertices are V={1, . . . w}, and the edges are (i,i+1)∈E,i∈{1, . . . , w−1}. We can further restrict the states of yi to never be above the horizon 320, which is always above the ground plane. We can determine an estimate of the horizon manually based on the configuration of the camera system (e.g., position and field of view of the camera), or automatically using the set of training images, and use the estimate to restrict labels in an inference procedure. Hence, the estimate of the horizon can be used to restrict: vertical coordinates of obstacles in the scene.
To determine the curve 310 for image It, we also use features from image It−1. Our energy function E is
E(y,It,It−1)=wTφ(y,It,It−1), (1)
where T is a transpose operator, y=(y1. . . , yw), and potentials φ of the energy function decompose into unary and pairwise terms:
The unary potentials are φu(yi), and the pairwise potentials are φp(yi,yj). The weight parameters of the unary and pairwise terms are wu and wp,, respectively. The weight parameters can be formed as w={wu,wp}, which are set manually or learned from the set of training images using structure prediction techniques. Here,
U={appearance,edge,homography,geometric_context}, (3)
refers to the set of unary potentials, which are now described in greater detail.
Appearance
We use two Gaussian mixture models (GMMs), each with five components, to model the appearance model of road or water, and background. Herein, the terms road and water can be used interchangeably. The GMMs are used to determine the probability for each pixel to be road or water, or background. The GMMs that are specific to target scenes can be determined from the training images with the ground truth annotations 250 using an expectation-maximization (EM) method. The GMMs can be also determined, for more general scenes by using images including road/water and background with ground truth annotations.
Our goal is to estimate the free space so that the curve lies on the boundary between road and non-road (or water and obstacles). Towards this goal, we derive a potential that considers the entropy of the distribution in patches around the labels as follows:
where the entropy H(i, j) is determined in terms of the distribution of road/non-road (or water/non-water) pixels in a patch centered at each pixel location (i, j). The entropy H(i, j) should be high near the boundary between road/non-road pixels. Because the curve passes through the boundary between the closest set of obstacles and the road, we use a cumulative sum that attains a maximum for the curve that passes through the bottom of the image. The function also ensures that the curve passes through a pixel that has a non-zero H(i, k) value. This estimate can be quite noisy.
To smooth the curve, we determine a location prior probability, which expresses, for each pixel, the probability of being road, which is estimated by determining empirical counts of the road pixels below the curve from the training data. We then determine the pixels which in all images are road, and force their entropy to be zero. This is a conservative estimate, but nonetheless it smooths the results considerably,
Edge
The ground truth curve is at the boundary between the ground plane and the obstacles. We derive an edge potential, which encodes the fact that we prefer the curve to be aligned with contours. There may be many contours in the image, however, the curve prefers edges that are located near the bottom of the image, which would correspond to closest edges in the 3D space of the scene. To take this into account, we define the following potential, which accumulates edge evidence as follows:
with e(i, j)=1 when there is an edge at the (i, j) pixel, and zero otherwise.
Geometric Context
Given a single image, it is possible to classify the pixels in the image to ground, buildings (horizontally oriented), and sky. Let G(i, j) refer to a Boolean label corresponding to whether a pixel is ground, or not, based on this classification. Because we are interested in getting the curve that encloses the largest free space, we use the following potential:
Homography
It is important to obtain a free space that is smooth across the image sequence. One possibility is to estimate the curves in two consecutive images jointly by considering pairwise connections between nodes in one image to the nodes in a previous image. This would result in constraining the labeling of a pixel p(i, j) in image It with neighboring pixels of p(i′, j′) in image It−1, leading to a 2D graph that is neither a 1D chain nor a tree-structured graph. As a result, the inference would become NP-hard.
Therefore, we use homography to impose smoothness across images and still maintain the 1D graph during inference. Instead of using smoothness across nearby pixels, we determine a homography matrix based on the ground plane. This gives us one-to-one mapping from a pixel on the ground in one image to its corresponding pixel on the ground in the previous image. This also provides a mapping between the free space curve in one image to another. Let H(t,t−1) be the homography matrix that maps a pixel location at (i, j) in image It to a pixel at location (i′, j′) in image It−1, as given by:
In this way, the potential can be written as
φhomography(yi=j)=φu(yi′=j′), (8)
where u∈U\homography, and φu(yi′=j′) is the unary potential in the previous image It−1.
The different unary potentials in image It−1 map to homography potentials in It. We determine the homographies using scale-invariant feature transform (SIFT) in a RANdom SAmple Consensus (RANSAC) framework. In the case of water, we detect correspondences only near the previous free space curve because there are too many spurious correspondences on the water. Note that the edges from reflections of objects on water will not match using the homography matrix, and thus we can filter the reflections from our curve. In the case of road scenes, we use the SIFT matches below the previous free space curve to determine the homography matrix, as reflections are not a problem.
Smoothness
We employ a truncated quadratic penalty to encourage the curve to be smooth. Note that this curve is non-smooth only when there are obstacles, which happens only at a few columns. Thus
where α, λd, and T are constants.
Inference
The MAP estimate, or maximum energy configuration, can be determined by solving
max wTφ(y,it,It−1). (10)
Our graph forms a 1D chain, and thus exact inference can be done using dynamic programing, with a complexity of O(wn2), where w is the width of the image and n is the number of labels for each variable after imposing the horizon constraint.
The weight parameters w can be manually set, or automatically determined using the training images 250 as described below.
Parameter Learning
We use the structured SVM to learn the weight parameters using the training images:
Here, y(t) is the ground-truth curve for the i-th instance, Δ(y,y(i)) the loss function, and N the total number of training examples. The loss function is a truncated version of the relative gap as shown below:
where T is a constant. We use a cutting plane procedure, where at each iteration we solve the following loss augmented inference to determine the most violated constraint:
∀i, max wT(φi(y)−φi(y(i)))+Δ(y,y(i)). (13)
As the loss decomposes into unary potentials, the loss-augmented inference can be solved exactly via dynamic programming.
Our method can also incorporate other features by defining a unary potential for each additional feature, and adding the potential to the set of unary potentials in Eq. (3). The weight parameters for the other features can be manually set or automatically determined by the learning method formulated in Eq. (11). Other features include optical flow and depth maps determined from the monocular video. If stereo or multiple cameras are available, features obtained from stereo or multi-camera cues can also be incorporated, such as depth maps determined from the multiple cameras. Note that any number and combination of features can be used in our method by changing the set of unary potentials in Eq. (3) accordingly.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6356664 | Dunn et al. | Mar 2002 | B1 |
7539557 | Yamauchi | May 2009 | B2 |
8195394 | Zhu et al. | Jun 2012 | B1 |
8548220 | Bang et al. | Oct 2013 | B2 |
20020080235 | Jeon | Jun 2002 | A1 |
20060279585 | Milanfar et al. | Dec 2006 | A1 |
20090074275 | O Ruanaidh | Mar 2009 | A1 |
20100121577 | Zhang et al. | May 2010 | A1 |
20100201682 | Quan et al. | Aug 2010 | A1 |
20110025841 | Cunningham et al. | Feb 2011 | A1 |
20130077850 | Hirai et al. | Mar 2013 | A1 |
20130128034 | Carr | May 2013 | A1 |
20130222589 | Lalonde et al. | Aug 2013 | A1 |
20140071240 | Chen et al. | Mar 2014 | A1 |
Entry |
---|
Plagemann, Christian, et al. “A nonparametric learning approach to range sensing from omnidirectional vision.” Robotics and Autonomous Systems 58.6 (2010): 762-772. |
Veksler, Olga. “Graph cut based optimization for MRFs with truncated convex priors.” Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007. |
Vernaza, Paul, Ben Taskar, and Daniel D. Lee. “Online, self-supervised terrain classification via discriminatively trained submodular Markov random fields.” Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on. IEEE, 2008. |
Wellington, Carl, Aaron C. Courville, and Anthony Stentz. “Interacting Markov Random Fields for Simultaneous Terrain Modeling and Obstacle Detection.” Robotics: Science and Systems. 2005. |
Zelinsky, Alexander, et al. “Planning paths of complete coverage of an unstructured environment by a mobile robot.” Proceedings of international conference on advanced robotics. vol. 13. 1993. |
Larson, Jacoby, et al. Advances in autonomous obstacle avoidance for unmanned surface vehicles. Space and Naval Warfare Systems Center San Diego CA, 2007. |
H. Badino, L.I. Franke, and D. Pfeiffer. The stixel world -a compact medium level representation of the 3d-world. 31st DAGM Symposium on Pattern Recognition, 2009. |
D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. ACM Trans. Graph., 2005. |
Number | Date | Country | |
---|---|---|---|
20150206015 A1 | Jul 2015 | US |