The present disclosure relates to the field of computer vision, and in particular, to a lane detection method integratedly using image enhancement and a deep convolutional neural network.
Bad weather, such as rain, fog, and sand, and complicated imaging conditions, such as backlight, glare, and low illumination, may affect the camera sensor of an advanced driver assistance system (ADAS) or an automatic driving system, and greatly degrade the quality of the captured images. With both conventional edge-detection based methods and deep-learning based methods, the quality of input images may greatly affect the performance of the detection system. In order to cope with the difficulties of lane detection under complex imaging conditions, in embodiments of the present disclosure, on the assumption that lanes are solid or dashed lines of locally constant width, and a marker line can be segmented into several image blocks, each of which contains of lane marking in the center, a method based on a deep convolutional neural network is provided to detect marking blocks in the image. Input to the model includes road images captured by a camera as well as a set of enhanced images generated by the CLAHE (contrast limited adaptive histogram equalization) algorithm.
An objective of the present disclosure is to provide a lane detection method integratedly using image enhancement and a deep convolutional neural network, so as to solve the difficulties of the prior art under complicated imaging conditions.
The present disclosure specifically adopts the following technical solution: a lane detection method integratedly using image enhancement and a deep convolutional neural network, and the method includes:
Step (1), acquiring a color image I contains lanes, including three component images I(0), I(1), and I(2) corresponding to red, green, and blue color components of I, respectively; performing the CLAHE algorithm to enhance the contrast of I and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input, where c is the remainder of k divided by 3.
Step (2), constructing the deep convolutional neural network, which consists of an input module, a spatial attention module, a feature extraction module, and a detection module, for lane detection, and stacking the three component images of the color image as well as the K enhanced images generated by the CLAHE algorithm in step (1) as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network.
Step (3), passing the input data through a convolutional layer containing 64 7 × 7 kernels with stride 2, performing a batch normalization and a ReLU activation operation, using a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2 as the final part of the input module, and feeding the output of the input module, which is an M1 × N1 × C feature map, to the spatial attention module, where M1, N1 and C denote the height, width and the number of channels of the feature map.
Step (4), performing, by the spatial attention module, two pooling operations on the feature map that input to the module. One is an average pooling and the other is a max pooling; in these two pooling operations, the size of the sampling kernel is 1 × 1 × C, and the stride is 1, two M1 × N1 × 1 feature maps are formed by the pooling operations and concatenated as an M1 × N1 × 2 feature map, and then is fed to a convolutional layer with a 7 × 7 kernel and with a stride of 1, finally, a spatial attention map of size M1 × N1 × 1 is calculated using a Sigmoid function.
Step (5), taking elements in the spatial attention map as weights, multiplying values of all positions of each channel of the output feature map of the input module by weights of corresponding positions of the spatial attention map to form a feature map, and feeding it to the feature extraction module.
Step (6), taking Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 network as the feature extraction module, where the output of Stage 3 serves as the input to Stage 4 as well as the input to a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, where nB denotes a preset number of detection boxes for each anchor point, and the convolutional layer finally outputs a feature map denoted by F1; the output of Stage 4 of ResNet50 passes through a convolutional layer consists of SnB kernels of size 1 × 1 and with a stride of 1, and the generated feature map is up-sampled and then sums corresponding elements one by one with F1 to generate a M2 × N2 × SnB feature map F; the height and width of the feature map F are M2 and N2, respectively, and the number of channels is 5nB.
Step (7), in the feature map that obtained in step (6), each point on a M2 × N2 plane corresponds to an anchor point; for an anchor point (m, n), the detection module evaluates values on all 5nB channels to determine whether a lane marking block exists at the anchor point, and the size and shape of the marking block, the method is: let i be an integer and 1 ≤ i ≤ nB, a value of the ith channel represents the probability that the lane marking block is detected by using the ith preset detection box; from the (nB + 1)th to the 5nBth channels, each four channels corresponds to a set of position parameters of a detected marking block, where values of channels nB + 4(i - 1) + 1 and nB + 4(i - 1) + 2 represent offset values in the width and the height direction between the center of the ith preset detection box and the center of an actual detected marking block, respectively, a value of channel nB + 4(i - 1) + 3 represents the ratio of the width of the preset detection box to that of the actual detected block, and the value of channel nB + 4i represents the ratio of the height of the preset detection box to that of the detected block.
Step (8), determining the lane model by the Hough transform algorithm using center coordinates of the marking blocks detected by the deep convolutional neural network.
Further, in step (1), a specific process of performing the CLAHE algorithm to enhance the contrast of the image involves: first, processing image I(c) by using a sliding window, where the height and width of the window are Mb + kΔ and Nb + kΔ, respectively, Mb, Nb, and Δ are preset constants that are set according to the size of the image and the expected number of the sliding windows, second, calculating the histogram of the block image covered by the sliding window as H, clipping a histogram bin Hi as Hi = h if Hi exceeds a specified limit h, accumulating amplitude differences, distributing these accumulated differences uniformly to all bins of H, next, taking the modified histogram as input and calculating the mapping function for each gray level by the histogram equalization algorithm, and further, setting the sliding steps in height and width directions to half of the height and width of the sliding window, and taking the mean value of the mapping functions calculated by all the sliding windows covering a pixel in I(c) as the value of the pixel in the enhanced image.
Further, in step (2), parameters of the input module, the spatial attention module, the feature extraction module, and the detection module of the deep convolutional neural network are determined by learning, and the method includes the following sub-steps:
Sub-step A, manually labeling lane markings in the images, and segmenting a labeled lane into image blocks, where each image block contains a lane marking in the center and also overlaps some background regions in both left and right sides.
Sub-step B, preparing expected output for training images: if the height and width of the training image are M and N, respectively, the expected output corresponding to the image is an M′ × N′ × C′ feature map, where
represents an integer no greater than a, C′ = 5nB is the number of channels of the feature, and nB denotes a preset number of detection boxes for each anchor point, and all values of the expected feature map are set according to the coverage of labeled marking regions.
Sub-step C, training: inputting a training image to the deep convolutional neural network to generate a corresponding output feature map by the detection module, calculating a loss function according to the output feature map and the expected feature map corresponding to the training image, loading training images in batches to minimize the sum of loss functions of all training samples in batches, and updating network parameters by a stochastic gradient descent optimization algorithm.
The present disclosure has the following beneficial effects. The method according to the present disclosure can effectively overcome difficulties of lane detection in challenging scenarios, such as poor image quality and small lane marking targets, so as to achieve better robustness.
Specific embodiments of the present disclosure are described in further detail below with reference to the drawings.
The present disclosure is further elaborated below in conjunction with the drawings and specific embodiments to enable those skilled in the art to better understand the essence of the present disclosure.
As shown in
Step (1), I is set as a to-be-processed color image, including three component images I(0), I(1), and I(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I(c) is processed by using a sliding window. The height and the width of the sliding window are Mb + kΔ and Nb + kΔ, respectively, where Mb, Nb, and Δ are preset constants, which may be Mb = 18, Nb = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin Hi exceeds a specified limit h, it is clipped as Hi = h, and amplitude differences are accumulated according to the following formula:
Then, T/L is added back to all elements of the histogram H to form a modified histogram H̃, where L is the number of gray levels in the histogram. Next, taking H̃ as input, mapping functions for gray levels are calculated by using the histogram equalization algorithm. Further, in one embodiment of the present disclosure, sliding steps in height and width directions are set to half of the height and the width of the sliding window. A pixel (x, y) in I(c) may be covered by n sliding windows, where n = 1, 2 or 4, So, the mean value of mapping functions calculated by all the sliding windows covering (x, y) is taken as the value of the pixel in the enhanced image.
Referring to
Step (2), the three component images of the to-be-processed color image and the K enhanced images generated by using the CLAHE algorithm in the above step are stacked as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network in the embodiment of the present disclosure.
Step (3), the deep convolutional neural network for lane detection includes an input module, a spatial attention module, a feature extraction module, and a detection module. According to the data flow of the input module during forward propagation, input data first passes through a convolutional layer with 64 7 × 7 kernels and a stride of 2, and then a batch normalization operation and a ReLU activation operation are performed. The final part of the input module is a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2.
Step (4), output x of the input module is an M1 × N1 × C feature map, where M1 and N1 denote the height and the width, respectively, and C denotes the number of channels of the feature map. The spatial attention module performs two pooling operations on the input. One is mean-pooling operation and the other one is max-pooling operation. In the two pooling operations, the size of the sampling kernel is 1 × 1 × C and the stride is 1. Two M1 × N1 × 1 feature maps are formed by the pooling operations and concatenated as an M1 × N1 × 2 feature map, and then the concatenated feature is fed to a convolutional layer with a kernel of 7 × 7 and with a stride of 1, and finally, a spatial attention map of a size of M1 × N1 × 1 is calculated using a Sigmoid function.
Step (5), elements in the spatial attention map are taken as weights. Values of all positions of each channel of the output feature map x of the input module are multiplied by weights of corresponding positions of the spatial attention map to form a feature map, and then is fed to the feature extraction module in the embodiment of the present disclosure.
Step (6), Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 are taken as the feature extraction module, and the output of Stage 3 serves as the input to Stage 4 as well as the input to a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, where nB denotes a preset number of detection boxes for each anchor point, and the convolutional layer finally outputs a feature map denoted by F1. Output of Stage 4 passes through a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, and the generated feature map is up-sampled and then sums corresponding elements one by one with F1 to generate an M2 × N2 × 5nB feature map F.
Step (7), the height and the width of the feature map F are M2 and N2, respectively, and the number of channels is SnB. Each point on an M2 × N2 plane in the feature map corresponds to an anchor point. The detection module judges, according to values of an anchor point (m, n) on all the channels, whether a lane marking block exists at the anchor point, and the size and the shape of the marking block. Let i denote an integer, where 1 ≤ i ≤ nB. A value of the ith channel represents a probability that the lane marking block is detected at an anchor point by using the ith preset detection box. From the (nB + 1)th to the 5nBth channels, each four channels correspond to a set of position parameters of a lane marking block detected by a given detection box. Specifically, values of channels nB + 4(i - 1) + 1 and nB + 4(i - 1) + 2 represent offset values in the width and the height direction between the center of the ith preset detection box and a center of the actual detection box, respectively, a value of a channel nB + 4(i - 1) + 3 represents a ratio of a width of the preset detection box to a width of the actual detection box, and a value of a channel nB + 4i represents a ratio of a height of the preset detection box to a height of the actual detection box.
Step (8), output of the detection module is a set of detected marking blocks, and a lane model is determined by the Hough transform algorithm using center coordinates of all the blocks in the set as inputs. Specifically, the center coordinates of a detected marking block is (υ, ν), and a lane is written as a straight line expressed in the polar coordinate system:
where ρ denotes the distance from the origin to the line in the Cartesian coordinate system, and θ denotes the angle between the x-axis and the vector that represented by ρ. For a given point (υ, ν), θ is set as an independent variable, and successively takes values in a range of 0° ≤ θ < 180° using a preset step. Thus, a sequence of ρ values are calculated by substituting these 0 values into above formula, and form a curve on the ρ - θ plane. The center of each detected marking block corresponds to a curve on the ρ - θ plane. Further, curves corresponding to the points that belong to a particular lane in the image space will intersect at a single point in the ρ - θ plane. If a point (ρ′, θ′) accumulates a large number of curves, a straight line of the image plane can be determined according to formula (2).
According to the technical solution in the present disclosure, parameters of the input module, the spatial attention module, the feature extraction module, and the detection module of the deep convolutional neural network in step (3) are determined by learning, including:
Sub-step A, preparing images for training: as shown in
Sub-step B, preparing expected output for training images: each training image corresponds to an expected feature map. If the height and the width of a given training image are M and N, respectively, the expected output corresponding to the image is an M′ × N′ × C′ feature map, where
respectively,
represents an integer no greater than a, nB denotes a preset number of detection boxes for each anchor point. All values of the expected feature map are set according to the coverage of labeled marking regions.
Sub-step C, training: input a training image to the deep convolutional neural network to generate the corresponding output feature map by the detection module, and calculating a loss function according to the output feature map and the expected feature map of the training image; load training images in batches to minimize the sum of loss functions of all training samples, and update network parameters by the stochastic gradient descent optimization algorithm.
According to the technical solution in the present disclosure, in step (6), Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 network are served as the feature extraction module, the Stage 2 includes 3 residual blocks, denoted as ResBlock2_i, the Stage 3 includes 4 residual blocks, denoted as ResBlock3_i, and the Stage 4 includes 6 residual blocks, denoted as ResBlock4 i, where i = 1, 2,..., nR, and nR denotes the number of residual blocks in the Stage. The first residual blocks in Stage 2, Stage 3, and Stage 4 are ResBlock2_1, ResBlock3_1, and ResBlock4_1, respectively. Their structures include two branches. The main branch includes 3 convolutional layers, where the first convolutional layer has C 1 × 1 kernels, the second has C 3 × 3 kernels, and the third has 4C 1 × 1 kernels. Each convolutional layer is followed by a batch normalization and a ReLU operation. The 3 convolutional layers of ResBlock2_1 all have a stride of 1, while in ResBlock3_1 or ResBlock4_1, the first convolutional layer has a stride of 2 and the others have a stride of 1. The other branches of ResBlock2_1, ResBlock3_1, and ResBlock4_1 are shortcut branches, each of which includes a convolutional layer, followed by a batch normalization operation. The convolutional layer of the shortcut branch of ResBlock2_1 has 4C 1 × 1 kernels with a stride of 1. The convolutional layers of the shortcut branches of ResBlock3_1 and ResBlock4_1 have 4C 1 × 1 kernels with a stride of 2. Outputs of the last convolutional layer of the main branch and the shortcut branch are fused via element-wise sum to serve as the output of the residual block. In Stage 2, Stage 3, and Stage 4 of ResNet50 network, any residual block except ResBlock2_1, ResBlock3_1, and ResBlock4_1 has a structure consists of two branches. The main branch has 3 convolutional layers. The first convolutional layer has C 1 × 1 kernels, the second has C 3 × 3 kernels, and the third has 4C 1 × 1 kernels. Each convolutional layer is followed by a batch normalization and a ReLU operation, and all these convolutional layers have a stride of 1. The other branch directly copies the feature map input to the residual block, and sums corresponding elements one by one with the output of the last convolutional layer of the main branch to serve as the output of the residual block.
The above are only preferred embodiments of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Any modification or replacement made within the spirit and principle of the present disclosure shall fall within the scope of protection of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110717975.6 | Jun 2021 | CN | national |
The present application is a continuation of International Application No. PCT/CN2022/078677, filed on Mar. 1, 2022, which claims priority to Chinese Application No. 202110717975.6, filed on Jun. 28, 2021, the contents of both of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/078677 | Mar 2022 | WO |
Child | 18107514 | US |