An Improved YOLOv5 Crack Detection Method Combined With Transformer
Transportation network maintenance is crucial for urban construction and development. Poor pavement conditions pose safety concerns and economic losses, making crack detection and repair an essential task.
Crack detection methods have gradually evolved from manual scanning to techniques like 3D lidar and sound waves to more effective alternatives involving imaging sensors and computer vision algorithms.
Automatic feature extraction using advanced deep learning solutions enhances crack detection and eliminates the need for hand-crafted feature extractions in existing systems. Deep learning methods for crack detection typically utilize either an encoder-decoder architecture for pixel classification or an object detection approach to regress bounding boxes to classify cracks.
The paper proposes a hybrid pavement crack detection network that combines the prowess of YOLOv5 and a Transformer, leveraging one-stage architecture and long-range dependency capture for efficient and reliable crack detection.
The YOLO (You Only Look Once) series of real-time object detection systems utilize one-stage architecture and have powerful feature extraction capabilities. The Transformer’s capability to capture the long-range dependence of cracks helps acquire contextual information from the crack region.
Pavement damage, mainly longitudinal and lateral cracks, is a significant issue in practical applications. Traditional convolutional methods are insufficient for covering the entire crack region, and stacking multiple layers is insufficient for feature extraction of minute crack objects of large lengths. Transformer structures are introduced to capture the long-range dependence of crack objects and extract contextual semantics.
The pavement crack detection network uses three layers of Transformer structures stacking to construct the Block4 of Backbone, which expands the receptive fields of each Transformer structure. The Transformer module accepts feature maps with (H, W, C) shapes representing height, width, and number of channels, respectively. The network uses TTA (test time augmentation) and ensemble learning to enhance crack detection accuracy. Three scaling rates and horizontal flipping are used to enrich the TTA form of the image, detecting different forms of objects in specific regions. Ensemble learning uses multiple models trained by hyperparameter settings to detect cracks, merging results to avoid limitations. The pipeline averages predictions and votes category predictions from all models. The TTA can significantly improve detection accuracy while maintaining the basic inference speed.
The proposed hybrid model was experimentally evaluated on a public dataset to detect specific types of crack damage, such as longitudinal, lateral, and alligator cracks. This network was more practical and reliable compared to other two-stage or YOLOv5x-based methods. It is highly sensitive to the input image size and can handle different demands, especially when coarse detection is required. In terms of crack object detection, the Transformer structure performs better than u-YOLO and can identify more cracks than the baseline.
The enhanced YOLOv5 network, enriched with TTA and Transformer architectures, emerges as a swift, effective, and cost-efficient solution to urban pavement maintenance. As a quick, effective, and economical solution for urban pavement damage detection, the enhanced YOLOv5 network opens avenues for expanding datasets to address diverse pavement problems in future applications.