The present disclosure relates to the technical field of concrete crack pattern classification, and in particular to a three-stage modularized convolutional neural network (CNN) for rapidly classifying concrete cracks in images. The proposed network could also be used as backbones for efficient feature extraction in the object detection algorithms.
Concrete buildings are inevitably subject to damage from both man-made and environmental factors. One of the most common types of damage is the occurrence of cracks. Therefore, there is a need for efficient and accurate classification of these cracks. The advancement of unmanned aerial vehicles (UAVs), crawling robots, and wireless transmission technology has paved the way for the collection of large-scale data on concrete buildings. This, in turn, opens up possibilities for the development of intelligent classification systems for apparent cracks in concrete structures.
Compared to the traditional manual classification, crack classification using deep learning offers several advantages, including high accuracy and fast detection speed. However, deep learning neural networks, originating from the computer field, are characterized by their large size for classifying thousands of classes. They have a substantial number of convolutional layers that possess similar structural characteristics.
They are unsuitable for rapidly classifying concrete cracks.
The present disclosure proposes a three-stage modularized CNN for rapidly classifying concrete cracks in images, comprising the following steps.
A concrete crack dataset is built for training the CNN.
The structure of the three-stage modularized CNN, which could be called Stairnet, consists of an input layer, blocks of stair1 in shallow layers, a convolutional block attention module (CBAM), blocks of stair2 in mid-layers, another CBAM, blocks of stair3 in deep layers, and a fully connected layer.
Once the Stairnet model has been successfully trained, it can be employed to classify the class of concrete cracks in images by inputting the concrete crack images into the model.
The shallow layers of the model can be referred to as stair1 and are constructed using inverted residual blocks that exclusively consist of convolutions (Convs).
The mid-layer of the model can be referred to as stair2. When the stride is set to 1, the stair2 structure involves performing a split operation on the input channel. One part of the channel passes through an inverted residual structure that includes a depthwise separable convolution (DConvs), while the other part does not undergo any operation. Afterward, a shuffle operation is performed on the two channels that are concatenated. On the other hand, when the stride is set to 2, the stair2 structure involves copying the input channel. One part of the channel is reduced in dimension through an inverted residual structure with a depthwise separable convolution, another part is reduced in dimension through the depthwise separable convolution, and the third part is reduced in dimension through maximum pooling. Finally, a shuffle operation is performed on the three channels that are reduced in dimension after performing a concatenate operation.
The deep layer of the model can be referred to as stair3, including inverted residual structures containing depthwise separable convolutions and efficient channel attention (ECA) modules.
Preferably, the expansion factor of the stair1 structure is 1 or not.
Preferably, the input layer includes a convolution layer, a batch normalization (BN) layer, and an activation function (AF) layer.
Preferably, the normalization processing of the BN layer is shown in the following formulas:
where xi is a feature map before inputting to the BN layer; yi is a feature map after outputting from the BN layer; m is the number of feature maps input to the layer in the current training batch; and γ and β are variables that vary with network gradient renewal.
Preferably, the AF layer performs non-linear processing via ReLU6:
where xi is a feature map before inputting the ReLU6, and f(xi) is a feature map after outputting the ReLU6.
Preferably, another AF layer performs non-linear processing via data of a Hardswish:
where x is a feature map before inputting the Hardswish, and f(x) is a feature map after outputting the Hardswish.
Preferably, the ECA attention mechanism performs cross-channel interaction on data to obtain an enhanced concrete crack feature extraction map;
where |t|odd represents the nearest odd t; C represents the number of channels inputting data into the ECA attention mechanism, and γ and b are two hyper-parameters; γ is set to 2 and b is set to 1; Es(F) is the ECA attention mechanism, σ is a sigmoid operation, f**k [·] represents performing a k*k convolution operation, F is the input feature map, and AvgPool( ) is the average pooling operation.
Preferably, in the CBAM attention mechanism, the average pooling and maximum pooling are used to aggregate spatial information of the feature map, compress spatial dimensions of the input feature map, and sum and merge element by element to generate a channel attention map:
where Mc represents the channel attention, and MLP( ) is composed of fully connected layer 1+ReLU6+fully connected layer 2; σ is the sigmoid operation, F is the input feature map, AvgPool( ) is the average pooling, MaxPool( ) is the maximum pooling, Ms represents the spatial attention mechanism, σ is the sigmoid operation; and
the average pooling and the maximum pooling methods are used to compress the input feature map in a spatial attention module, to obtain a feature extraction map containing more crack information:
where Ms represents the spatial attention mechanism, σ is the sigmoid operation, f7*7 [·] represents performing a 7*7 convolution operation, F is the input feature map, AvgPool( ) is the average pooling, and MaxPool( ) is the maximum pooling.
Preferably, there further includes:
sparsifying data passing through a dropout layer in each layer to avoid network over-fitting:
where the Bernoulli(p) function is used to generate a probability rj(i) vector, to enable a neuron to stop working with the probability p; y(1) is an output feature map of the previous layer; {tilde over (y)}(l) is a feature map output after passing through the dropout layer.
Preferably, there further includes:
optimizing network internal parameters using the following Adam algorithm:
where Loss(yo. c, po. c) is a loss function between a predicted value and a true value of the network; θ is a parameter to be updated in the model; gt is a gradient obtained by deriving θ from the loss function f(θ); β1 is a first-moment attenuation coefficient; β2 is a second-moment attenuation coefficient; mt is an expectation of the gradient gt; vt is an expectation of gl2, {circumflex over (m)}t is an offset correction of mt; {circumflex over (v)}t is an offset correction of vt; θt-1 is a parameter before the network update; θt is a parameter after the network update; and a is a learning rate.
The advantageous effects of the present disclosure are as follows:
The present disclosure proposes a three-stage modularized CNN aimed at rapidly classifying concrete cracks in images. CNN model like AlexNet, vgg16, resnet50, GoogLeNet, or mobilenet_v3_large has similar structures across its layers. However, these models tend to be large and relatively slow in classifying concrete cracks. In contrast, the proposed model, named Stairnet, exhibits distinct feature characteristics in its early, middle, and deep layers. The three-stage modularized structure of Stairnet is specifically designed to be smaller in size, have shorter training time, and achieve the highest accuracy in concrete crack classification. Furthermore, Stairnet can serve as a backbone for efficient feature extraction in object detection algorithms.
The present disclosure is described in detail in combination with the drawings and embodiments. The specific embodiments described herein are intended only to explain the present disclosure and are not intended to limit it.
The three-stage modularized CNN in the present disclosure is implemented using PyTorch and further details can be found in Table 1:
Step 1, a concrete crack dataset is built for training the CNN;
Step 2, stair1 is utilized as the shallow layers of the network;
Step 3, stair2 is utilized as the mid-layers of the network;
Step 4, stair3 is utilized as the deep layer of the network;
Step 5, based on the three stairs1-3, combining deep learning algorithms for example attention mechanisms, forming the Stairnet, and the dataset is used for training the Stairnet until the model converges
Step 6, multiple concrete crack images can be fed into the well-trained stairNet to obtain the crack classes in the images.
Aiming to build the dataset in step 1, the concrete crack images are manually classified. The crack classes include transverse crack, vertical crack, oblique crack, mesh crack, irregular crack, hole, and no crack (background), as shown in
In Step 2, stair1 is composed of inverted residual structures that exclusively utilize convolutions. There are two variations in stair1, depending on whether the expansion factor is 1 or not. The structure of stair1 is depicted in
In Step 3, the structure of stair2 in step 3 is shown in
When the stride is set to 1, the stair2 structure involves performing a split operation on the input channel. One part of the channel passes through an inverted residual structure that includes a depthwise separable convolution (DConvs), while the other part does not undergo any operation. Afterward, a shuffle operation is performed on the two channels that are concatenated. The structure of the depthwise separable convolution is shown in
When the stride is set to 2, the stair2 involves copying the input channel. One part of the channel is reduced in dimension through an inverted residual structure with a depthwise separable convolution, another part is reduced in dimension through the depthwise separable convolution, and the third part is reduced in dimension through maximum pooling. Finally, a shuffle operation is performed on the three channels that are reduced in dimension after performing a concatenate operation.
In step 4, the structure of stair3 is as shown in
In step 5, the structure of the Stairnet is shown in
The normalization processing of the BN layer is shown in the following formulas:
where xi is a feature map before inputting to the BN layer; yi is a feature map after outputting from the BN layer; m is the number of feature maps input to the layer in the current training batch; and γ and β are variables that vary with network gradient renewal.
The AF layer performs non-linear processing via data of a ReLU6:
where x; is a feature map before inputting the ReLU6, and f(xi) is a feature map after outputting the ReLU6.
The AF layer performs non-linear processing via data of a Hardswish:
where x is a feature map before inputting the Hardswish, and f(x) is a feature map after outputting the Hardswish.
Specifically, the ECA attention mechanism performs cross-channel interaction on data to obtain an enhanced concrete crack feature extraction map;
where |t|odd represents the nearest odd t; C represents the number of channels inputting data into the ECA attention mechanism, and γ and b are two hyper-parameters; γ is set to 2 and b is set to 1; Es(F) is the ECA attention mechanism, σ is a sigmoid operation, f**k [· ] represents performing a k*k convolution operation, F is the input feature map, and AvgPool( ) is the average pooling operation.
In the CBAM attention mechanism, the average pooling and maximum pooling are used to aggregate spatial information of the feature map, compress spatial dimensions of the input feature map, and sum and merge element by element to generate a channel attention map:
where Me represents the channel attention, and MLP( ) is composed of fully connected layer 1+ReLU6+fully connected layer 2; σ is the sigmoid operation, F is the input feature map, AvgPool( ) is the average pooling, MaxPool( ) is the maximum pooling, Ms represents the spatial attention mechanism, σ is the sigmoid operation; and
The average pooling and the maximum pooling methods are used to compress the input feature map in a spatial attention module, to obtain a feature extraction map containing more crack information:
where Ms represents the spatial attention mechanism, σ is the sigmoid operation, f7*7 [·] represents performing a 7*7 convolution operation, F is the input feature map, AvgPool( ) is the average pooling, and MaxPool( ) is the maximum pooling.
The data passing through the dropout layer in each layer is sparsely processed to avoid network over-fitting:
where the Bernoulli(p) function is used to generate a probability rj(i) vector, to enable a neuron to stop working with the probability p, and y(1) is an output feature map of the previous layer, and {tilde over (y)}(1) is a feature map output after passing through the dropout layer.
The following Adam algorithm is used to optimize the network internal parameters:
where Loss(yo. c, po. c) is a loss function between a predicted value and a true value of the network; θ is a parameter to be updated in the model; gt is a gradient obtained by deriving θ from the loss function f(θ); β1 is a first-moment attenuation coefficient; β2 is a second-moment attenuation coefficient; mt is an expectation of the gradient gt; vt is an expectation of gt2, {circumflex over (m)}t is an offset correction of mt; {circumflex over (v)}t is an offset correction of vt; θt-1 is a parameter before the network update; θt is a parameter after the network update; and a is a learning rate.
Stairnet, along with commonly used neural network models, namely AlexNet, GoogLeNet, vgg16_bn, resnet34, and Mobilenet_v3_large area trained and validated in this embodiment. The training process is illustrated in
where yo. c is the true value of a single image in a data set (training set/validation set); po. c is a predicted value of the network, including 7 probabilities, corresponding to 7 crack categories; max ( ) is the category corresponding to the value with the highest probability extracted in po. c; eq is used to verify whether the true value (label) yo. c is equal to max (po. c); ΣN( ) is used to calculate the number of the true value (label) yo. c of all the images in the data set is equal to max (po. c); and N is the number of all the crack images in the data set.
The loss is calculated as follows:
where Loss (yo. c, po. c) is the error between the predicted value and the true value of the network calculated using cross entropy for a single image; M is the number of classes, taking 7 in this embodiment; Nsteps is the number the strides of network training; N is the number of all crack images in the data set; Nbatch is the number of images included in a batch size, taking 16 in this embodiment.
In addition, precision and recalls for crack types are calculated and summarized using the test sets as shown in Table 4. Compared to the general CNN, Stairnet has higher accuracy and recalls for most crack types, for example, 0.90 and 0.94 for mesh crack and 0.70 and 0.88 for VGG16_bn.
The precision is the proportion of all positive samples that are judged to be true, the higher the precision, the lower the probability of network false positives. Precision is calculated as follows:
Recall, true positive (TP) rate, is the proportion of all positive samples predicted true to all actual positive samples. The higher the recall, the lower the probability of network false negative. Recall is calculated as follows:
Specificity, true negative (TN) rate, is the proportion of all negative samples predicted true to all actual negative samples, which is calculated as follows:
where TP, TN, false positive (FP), and false negative (FN) are shown in Table 5, the second letter includes P (Positive) and N (Negative) to indicate the predicted case, and the first letter includes T (True) and F (False) to measure the actual case. The explanation is as follows:
TP: The network judges that the sample is positive, and the judgment is true (in fact, the sample is positive).
TN: The network judges that the sample is negative, and the judgment is true (in fact, the sample is negative).
FP: The network judges that the sample is positive, and the judgment is false (in fact, the sample is negative).
FN: The network judges that the sample is negative, and the judgment is false (in fact, the sample is positive).
In conclusion, the Stairnet model proposed in this embodiment exhibits superior classification accuracy for concrete cracks compared to other comparative CNN models, all while maintaining a significantly smaller size.
The above is only an embodiment of the present disclosure and is not intended to limit the present disclosure. Any modifications, equivalent substitutions, and the like made within the spirit and principles of the present disclosure shall be included in the scope of protection of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2022117054944 | Dec 2022 | CN | national |