High-Performance Segmentation of Power Lines in Aerial Images Using a Wavelet-Guided Hybrid Transformer Network

Burhan Baraklı; Ahmet Küçüker

doi:10.32604/cmes.2026.077872

icon Open Access

ARTICLE

High-Performance Segmentation of Power Lines in Aerial Images Using a Wavelet-Guided Hybrid Transformer Network

Burhan Baraklı, Ahmet Küçüker^*

Department of Electrical and Electronics Engineering, Sakarya University, Sakarya, Türkiye

* Corresponding Author: Ahmet Küçüker. Email: email

(This article belongs to the Special Issue: Advanced Artificial Intelligence and Machine Learning Methods Applied to Energy Systems)

Computer Modeling in Engineering & Sciences 2026, 146(2), 26 https://doi.org/10.32604/cmes.2026.077872

Received 18 December 2025; Accepted 04 February 2026; Issue published 26 February 2026

Abstract

Inspections of power transmission lines (PTLs) conducted using unmanned aerial vehicles (UAVs) are complicated by the fine structure of the lines and complex backgrounds, making accurate and efficient segmentation challenging. This study presents the Wavelet-Guided Transformer U-Net (WGT-UNet) model, a new hybrid network that combines Convolutional Neural Networks (CNNs), Discrete Wavelet Transform (DWT), and Transformer architectures. The model’s primary contribution is based on spatial and channel attention mechanisms derived from wavelet subbands to guide the Transformer’s self-attention structure. Thus, low and high frequency components are separated at each stage using DWT, suppressing structural noise and making linear objects more prominent. The developed design is supported by multi-component hybrid cost functions that simultaneously solve class imbalance, edge sharpness, structural integrity, and spatial regularity issues. Furthermore, high segmentation success has been achieved in producing sharp boundaries and continuous line structures with the DWT-guided attention mechanism. Experiments conducted on the TTPLA dataset reveal that the version using the ConvNeXt backbone outperforms the current state-of-the-art approaches with an F1-Score of 79.33% and an Intersection over Union (IoU) value of 68.38%. The models and visual outputs of the developed method and all compared models can be accessed at https://github.com/burhanbarakli/WGT-UNET.

Keywords

Salient object detection; superpixel segmentation; transformers; attention mechanism; multi-level fusion; edge-preserving refinement; model-driven

1 Introduction

Reliable energy transmission is a critical component of modern industrial and societal infrastructure [1,2]. Power transmission lines (PTLs), which are the most crucial among such systems, often traverse difficult and hazardous terrains for miles; therefore, their regular inspection becomes a necessity [3,4]. Traditional manual inspection methods are not only costly but also time-consuming, inefficient, and potentially hazardous for workers [5,6]. In recent years, the increasing use of Unmanned Aerial Vehicles (UAVs) has significantly advanced inspection processes and created new opportunities for computer vision-based automation systems [7,8]. The automatic detection and segmentation of PTLs from high-resolution aerial images captured by UAVs has become an important task for ensuring both the operational safety of the network and the flight safety of the UAVs [9,10].

However, segmentation from aerial images presents unique challenges. PTLs are thin, elongated objects that occupy a very small pixel area compared to the entire image, as shown in Fig. 1 [11–13]. Such conditions lead to a serious class imbalance problem between the background (vegetation, buildings, roads) and the foreground (power lines) [14,15]. Additionally, complex and variable backgrounds, differences in lighting conditions, and even weather conditions (fog, rain, etc.) weaken the visual characteristics of the lines, making the detection process even more complex [16,17].

images

Figure 1: Power transmission line image samples from transmission towers and power lines (TTPLA) dataset.

Literature has proposed several methods to address the above challenges. Early studies relied on traditional image processing techniques for edge and line detection, such as the Hough transform, Radon transform, and Line Segment Detector, among others [18,19]. Although such techniques are straightforward and computationally inexpensive, they often exhibit limited robustness in complex backgrounds and are sensitive to noise [20]. With the rise of deep learning, research has increasingly focused on convolutional neural networks (CNN)-based semantic segmentation models [21]. Encoder–decoder architectures such as Fully Convolutional Network (FCN), U-Net, and DeepLabv3+ enable accurate pixel-wise classification and have been widely adopted for semantic segmentation [11,22–24]. Furthermore, fundamental architectures have been adapted for PTL detection in various works, and loss functions such as Focal Loss have been employed to better handle class imbalance.

Recent advanced approaches go beyond standard CNN architectures. Hybrid models that combine the local feature extraction power of CNNs with the global context modeling capability of transformers have become increasingly influential in the field [21]. DUFormer proposed a structure where CNN-based rich features feed a transformer block using the heavy token encoder concept [10]. Conversely, the combination of signal processing methods with deep learning has resulted in promising outcomes. Generative Adversarial Networks for Power-Line Segmentation (PLGAN) forces linear consistency in the output by using the Hough transform as a loss function [7], while PLNet uses information obtained from Hough space as an attention mechanism [9]. WaveGNet uses wavelet transform to suppress low-frequency background noise and highlight high-frequency (HF) details [4]. Collectively, the studies highlight the importance of integrating domain-specific prior knowledge into deep learning models.

The existing literature on PTL segmentation largely focuses on wavelet/frequency-based approaches, transformers, and CNN-transformer hybrids. Wavelet-based studies typically aim to suppress the background and strengthen fine line/edge cues using HF components; for example, WaveGNet uses HF coefficients to weaken background information and emphasize edge regions via wavelet decomposition [4]. Similarly, the Wavelet Feature Enhancement (WFE) module in remote sensing segmentation can be seen as a representative feature enhancement approach that feeds detail back to network layers by enhancing HF sub-bands with multiscale Discrete Wavelet Transform (DWT) [25]. However, these wavelet techniques are most often applied at the feature-map level (e.g., enhancement or convolutional/gating attention), rather than being incorporated as an explicit bias within the transformer’s self-attention scoring that governs token-to-token interactions. On the other hand, transformer-based/hybrid methods are effective at modeling the long-range persistence of PTLs. For instance, DUFormer performs global context modeling by converting CNN-based features into tokens with a heavy token encoder and feeding them into a transformer block [10]. Nevertheless, in most transformer designs, attention is primarily shaped by learned content representations and tokenization, and it is less common to use frequency-domain fine-line/edge characteristics as an explicit structural premise to guide self-attention. Overall, frequency information is often introduced as input/feature refinement, while PTL segmentation approaches that directly inject a frequency-derived bias into transformer self-attention scoring appear less explored [11].

The Wavelet-Guided Transformer U-Net (WGT-UNet) proposed in this study aims to address this gap in the literature not only by incorporating a transformer component, but also by establishing a connection at the mechanism level. A spatial attention map is produced with the high frequency components obtained from the wavelet transform and this map is injected directly into the self-attention scores of the windowed transformer block as a bias. Thus, while maintaining the global context power of DUFormer-like hybrids [10], the model enhances the focus on fine structures by bringing the fine-edge cues provided by wavelet-based methods [4] into the self-attention decision; this integration is realized with the Wavelet-Guided Transformer (WGT) blocks and Wavelet Attention Module (WAM) architectures that form the core of WGT-UNet. The key contributions of the study are outlined below:

• A novel hybrid encoder-decoder network is proposed, which synergistically integrates a multi-scale CNN backbone with a decoder built from WGT blocks.

• A Wavelet Attention Module is introduced, which generates Spatial Attention from HF wavelet sub-bands and uses it to directly bias the self-attention mechanism of a windowed transformer block, thereby improving focus on relevant fine structures with low overhead.

• A five-component hybrid loss function is designed to jointly address class imbalance, regional consistency, and line continuity. The loss combines a primary segmentation term (Dice + BCE/Focal) with an auxiliary edge supervision loss, a wavelet-domain loss matching HF components, a low-frequency structural similarity (SSIM) loss, and a Total Variation (TV) regularization term for output smoothness.

• Through extensive experiments, it is demonstrated that the proposed model, particularly using the ConvNeXt backbone, surpasses state-of-the-art (SOTA) methods on the TTPLA dataset, achieving an F1-Score of 79.33%, an IoU of 68.38%, and 43.84 giga floating-point operations per second (GFLOPs), indicating a favorable accuracy-efficiency trade-off.

The structure of the paper is as follows: Related work is discussed in Section 2. The proposed WGT-UNet methodology is explained in detail in Section 3. Section 4 presents the experiment design, dataset, evaluation metrics, and the results obtained, and finally, Section 5 summarizes the work and suggests future research lines.

2 Related Works

The automatic detection and segmentation of PTLs from aerial images has become an important research topic in computer vision in recent years. Research in the area spans a broad range, from traditional image processing techniques to advanced deep learning-based architectures. The following section examines the fundamental approaches in the existing literature, discusses their strengths and weaknesses, and defines the position of the proposed methodology within that framework.

Before deep learning became the dominant approach, PTL detection was largely dependent on handcrafted feature extractors and geometric models. Various techniques like the Hough transform, Radon transform, and Line Segment Detector were often utilized as they could efficiently pick out linear forms in the pictures [18–20,26]. Traditional methods typically operate on images that have been preprocessed using edge detection algorithms such as Canny or Sobel [1]. Approaches like Otsu thresholding and k-means clustering have been used to separate the image into simple background and foreground [1,18]. The main advantage of such techniques is their low computational cost and simple applicability [18]. However, they are quite sensitive to lighting changes, shadows, complex background textures (e.g., tree branches, road edges), and image noise [13,22]. Thus, their accuracy in practical use is restricted and they generally produce many false positives [20].

The rise of deep learning has brought about a paradigm shift in segmentation [21]. Encoder–decoder architectures such as FCN, U-Net, and DeepLabv3+ have become standard approaches in the field [11,18,22,23]. In addition, the Pyramid Scene Parsing Network (PSPNet) [27] introduced a pyramid pooling module to aggregate global context information at multiple scales, improving the segmentation of objects with varying sizes. Similarly, the Feature Pyramid Network [28] architecture, originally developed for object detection, has been widely adopted in segmentation for its effective mechanism of combining semantically strong, low-resolution features with semantically weak, high-resolution features via a top-down pathway and lateral connections. These models enable precise segmentation of lines by classifying the image at the pixel level [22]. The encoder layer extracts hierarchical semantic features, while the decoder layer upsamples the extracted features to create a segmentation map at the original image resolution [11]. The skip connections introduced by the U-Net architecture have been quite effective in improving the segmentation accuracy of fine structures, by transferring low-level fine details (such as edges) from the encoder to the decoder [12,14].

Basic architectures have been specifically improved for the similar tasks in various studies. For example BASNet [29], which, although primarily designed for salient object detection, introduces a predict–refine framework with boundary quality awareness as its central focus. The emphasis on sharp edge definition is highly relevant to the challenge of segmenting thin power lines. For instance, ResNet-Ghost-SIMAM-UNet (RGS-UNet) enhances the U-Net backbone with Residual Network and achieves a lightweight structure by reducing the number of model parameters with Ghost Modules [12]. PL-Deeplab extends the DeepLabv3+ architecture with a multi-branch concatenation network for stronger feature extraction and a one-shot concatenation feature pyramid to provide a wider receptive field [8]. PLE-Net captures rich contextual relationships by adding a multi-scale feature enhancement block to the U-Net architecture [5]. Collectively, the studies demonstrate how effective basic encoder–decoder structures can be in PTL detection with the right modifications.

The literature has extensively utilized attention mechanisms to enhance the performance of models based on CNNs [20,21]. The introduction of attention mechanisms allows the model to focus on relevant regions of the image, thereby improving the detection of objects that are often obscured by complex backgrounds [15,30].

PLE-Net has highlighted regions using a self-attention block [5]. RGS-UNet has used a Simple-Parameter-Free Attention (SimAM) mechanism, which handles channel and spatial dimensions together and does not require additional parameters [12]. PL-UNet has proposed a multi-scale attention gate to improve feature extraction in key regions [11]. More recently, the Axial-UNet++ adopted a gated axial-attention method which applies the attention function one after the other to the x and y axes to reduce the computational cost of the regular attention mechanisms [31]. All the research findings show that the power of attention mechanisms can be exploited fully by virtue of the models’ increased representational strength which consequently leads to better accuracy in segmentation tasks.

Several studies suggested to use hybrid or multi-stage methods that would involve combining different algorithms rather than relying on a single end-to-end model. Taking one study as an instance, components were initially located in bounding boxes through an efficient object detection model like YOLO (You Only Look Once), and subsequently, the center points of those boxes were merged in order to reconstruct the whole lines in a post-processing step [17]. Another hybrid scenario starts with the extraction of a region of interest by means of classical methods like Canny/Sobel, and then a more robust deep learning model like Mask Region-based Convolutional Neural Network is applied to that region only [18]. Similarly, another study proposed a technique that integrates line detection and semantic segmentation [26]. Generally, such methods are directed towards a practical compromise between speed and accuracy; however, because of their multi-stage nature, they not only have a risk of error propagation but also cannot fully exploit the benefits of end-to-end learning [18].

Some innovative studies inspired by the linear and HF nature of PTLs have integrated signal processing techniques into deep learning pipelines. PLGAN and PLNet have utilized the Hough transform to ensure the global linear consistency. While PLGAN regulates the model’s outputs with a loss function defined in Hough space, PLNet uses the information obtained from Hough space as an attention mechanism [7,9]. On the other hand, WaveGNet and a recent Amplitude Stretch Transform–based approach use the Wavelet Transform and the Amplitude Stretch Transform, respectively, to emphasize HF details [4,32]. Such methods effectively suppress low-frequency background noise while highlighting edge details. These approaches suggest that integrating domain-specific information into deep learning models can improve robustness and accuracy compared to standard architectures.

In recent years, transformer architectures have become popular in computer vision because of their strong ability to model global dependencies. DUFormer is among the first studies to apply the potential of transformers to PTL detection [10]. The proposed model is a hybrid architecture that merges the strong local feature extraction capability of CNNs with the broad contextual understanding of transformers. By generating rich semantic tokens using a CNN-based module called the ”heavy token encoder”, the subsequent transformer block operates more efficiently [10]. Hybrid approaches offer potential for preserving local details while maintaining global consistency.

Although the studies summarized above have made significant progress in the field, there are still some gaps that need to be addressed. Hough transform-based methods may experience difficulties in modeling curved lines based on the assumption that lines are straight [9]. Approaches using wavelet transform have not yet fully leveraged the powerful global context modeling capability offered by transformer architectures [4]. Hybrid models combining CNN and transformer do not make use of the extensive edge information that the frequency domain offers to direct the attention mechanism of the transformer, which could possibly result in the loss of both efficiency and accuracy [10].

3 Proposed Method: Wavelet-Guided Transformer U-Net

The WGT-UNet is built around an encoder–decoder architecture. The CNN backbone extracts hierarchical multi-scale features, and the WGT decoder progressively recovers spatial details at full resolution to obtain precise segmentation masks. The general structure is shown in Fig. 2.

images

Figure 2: Overall architecture of the proposed WGT-UNet. A CNN backbone extracts four multi-scale feature maps P0–P3. Each Pi is first passed through a Projection Block (1×1 Conv, BN, ReLU) to obtain the aligned feature Fi, which is then refined by the corresponding wavelet-guided block (WGBi). The decoder consists of a bottleneck and a cascade of wavelet-gated up-sampling blocks (WGUi) that progressively recover spatial resolution while fusing the refined encoder features through skip pathways. A dual-head output produces the segmentation mask (SEG Head, yoC) and an auxiliary edge map (Edge Head, yeC).

A pre-trained CNN backbone receives an image of size H×W×3 as input and generates multi-scale feature maps at four different levels. Modern architectures were evaluated (ConvNeXt, Swin Transformer, and EfficientNet) to select one for this role.

To account for variations in channel dimensions across backbones, 1×1 convolutional projection layers map the four-stage feature maps to common channel dimensions of (96,192,384,768), ensuring a consistent input interface for the decoder regardless of the selected backbone.

The core component is a decoder that receives the projected multi-scale features and progressively upsamples them to produce the final segmentation mask. Skip connections from the encoder are leveraged, and wavelet-derived attention is used to modulate these skip pathways to emphasize PTL structures.

The decoder ends with two output heads: a primary segmentation head that predicts the mask and an auxiliary edge head that predicts an edge map. The dual-head design provides explicit boundary supervision, thereby improving contour sharpness. The auxiliary edge head is used only during training to enforce boundary-aware representations via deep supervision; it is removed at inference, leaving only the primary segmentation head. As a result, the edge branch introduces no additional FLOPs or latency during deployment on UAV hardware.

3.1 Multi-Scale Backbone Encoder

The encoder serves as the primary feature extractor, mapping an input image I∈RH×W×3 into a hierarchy of multi-scale feature representations. Three backbone candidates are compared to characterize trade-offs among convolutional inductive bias, attention-based global context modeling, and computational efficiency. The comparison informs the final encoder selection.

EfficientNet [33] is included as an efficient convolutional baseline. The network is composed of mobile inverted bottleneck blocks with depthwise separable convolutions, reducing parameters while preserving feature expressiveness. In addition, compound scaling jointly adjusts depth, width, and input resolution. EfficientNet provides a purely convolutional baseline under a lightweight computational budget for the target task.

An attention-based backbone is considered to capture long-range context for continuous PTLs. Swin Transformer [34] employs hierarchical window-based self-attention (W-MSA), where attention is computed within non-overlapping windows and propagated across regions through shifted windows in deeper layers. The resulting features combine local texture cues with broader structural context, with a manageable computational cost.

ConvNeXt [35] is also considered a modernized CNN backbone with transformer-inspired design choices. The architecture retains a purely convolutional formulation while adopting elements such as 7×7 depthwise convolution, inverted bottlenecks, and Layer Normalization (LN). Such design choices preserve key convolutional inductive biases while improving scalability and training stability. As a result, ConvNeXt provides an effective feature extractor for separating fine line structures from cluttered backgrounds.

Regardless of the chosen backbone, the encoder processes the input image I and outputs four multi-scale feature maps P={Pi}i=03, where Pi∈RCi×Hi×Wi. As i increases, the spatial resolution (Hi×Wi) decreases, while the channel dimension Ci depends on the selected backbone.

To provide a backbone-agnostic interface to the decoder, each Pi is passed through a lightweight projection block composed of a 1×1 convolution, Batch Normalization (BN), and Rectified Linear Unit (ReLU), yielding aligned features F={Fi}i=03 with Fi∈RNi×Hi×Wi. The projection preserves spatial dimensions and aligns channel dimensions to predefined values Ni, ensuring consistent decoder inputs across backbones.

3.2 Wavelet-Guided Decoder

In the main pipeline (Fig. 2), the decoder is instantiated as a sequence of WGBi and WGUi modules. The detailed WGT workflow within these modules is illustrated in Fig. 3. The decoder reconstructs the full-resolution segmentation map from the encoded hierarchical features. The design integrates wavelet decomposition with attention and transformer-based refinement.

images

Figure 3: Overview of a single wavelet-guided decoder stage at level i. The projected encoder feature Fi is decomposed by a 2D Haar DWT into (LLi,LHi,HLi,HHi), and the HF components are consolidated as HFi to derive attention signals in WAM. WAM outputs channel attention Ac,i and spatial attention As,i, which gate the low-frequency stream and guide the WGT block. The wavelet-gated up-sampling block (WGU) upsamples the decoder feature and fuses it with the corresponding projected encoder feature via a skip pathway, using As,i for residual-gain modulation and optionally injecting HFi via channel-wise concatenation. A dual-head design outputs the segmentation mask (SEG Head) and an auxiliary edge map (Edge Head) during training.

Notation

In this section, W⋅(⋅) denotes a learnable linear mapping; the subscript indicates its role. Parameter definitions are omitted for brevity.

At each decoder stage, the 2D Haar DWT is applied to the corresponding feature map Fi to obtain frequency components. The transform decomposes Fi∈RNi×Hi×Wi into four sub-bands of size Ni×Hi2×Wi2: a low-frequency approximation component (LLi) and three HF detail components capturing horizontal (LHi), vertical (HLi), and diagonal (HHi) structures, as illustrated in Fig. 3. Wavelet decomposition is particularly relevant for PTLs, which appear as localized HF responses against smoother background regions.

For each stage i, the resulting sub-bands (LLi,LHi,HLi,HHi) are used in complementary ways within the decoder. The following subsections describe how these wavelet-derived signals are integrated into WAM, WGT, and the up-sampling path.

3.2.1 Wavelet Attention Module

WAM derives complementary channel (Ac,i) and spatial (As,i) attention signals from the DWT high-frequency (HF) sub-bands to guide the decoder. Given the detail components (LHi,HLi,HHi) at stage i, the consolidated HF map HFi is defined in Eq. (1).

HFi=|LHi|+|HLi|+|HHi|(1)

Here, HFi∈RNi×Hi2×Wi2 highlights regions with strong edge and texture responses.

WAM produces two attention maps. The Ac,i aims to emphasize channels that are informative for power-line localization. It adaptively re-calibrates the weights of different feature channels by explicitly modeling their inter-dependencies. The mechanism is important because different channels in a deep feature map respond to various semantic attributes, and not all are equally relevant for the task. The implementation follows the principles of the Squeeze-and-Excitation (SE) block. The process begins with a Squeeze operation, where the global spatial information of the HF map is aggregated into a channel descriptor vector. The operation is achieved using Global Average Pooling (GAP), which reduces each channel map of HFi to a single scalar value. The resulting vector, zi∈RNi, summarizes the global distribution of HF responses for each channel. The vector is then processed by an Excitation operation—a two-layer Multi-Layer Perceptron (MLP) with a bottleneck structure—to capture channel-wise dependencies. A final sigmoid activation function, σ, normalizes the MLP’s output to produce the attention weights, as shown in Eq. (2).

zi=GAP(HFi),Ac,i=σ(Wca,2ReLU(Wca,1zi))(2)

where Wca,1 and Wca,2 are the weights of the MLP layers.

Complementary to Ac,i, the As,i highlights spatial locations that are most relevant for power-line segmentation. Instead of operating on the channel axis, it generates a 2D map that highlights the most salient spatial regions within the features. The mechanism is particularly important for the task, as power lines occupy a very sparse and specific set of pixel locations. The input to the module is the consolidated HF map, which already contains a strong signal at the locations of edges and fine details.

The procedure for generating the As,i is direct and efficient. A single 3×3 convolution is applied to the HF map. The convolutional layer acts as a spatial feature detector, learning to identify local patterns and spatial contexts that are characteristic of PTLs from the raw HF signals. The 3×3 kernel allows it to aggregate information from a local neighborhood, making the decision more robust to noise than a simple pixel-wise operation. A subsequent sigmoid activation function, σ, normalizes the output of the convolution into a probabilistic attention map, As,i∈R1×Hi2×Wi2, where each pixel value ranges between 0 and 1, as expressed in Eq. (3).

As,i=σ(Ws(HFi))(3)

The resulting map As,i provides a pixel-wise guidance signal, with high values indicating regions likely containing targets.

3.2.2 Wavelet-Guided Transformer Block

The transformer block is used to model global context and long-range dependencies. Unlike standard transformers, the WGT block is explicitly guided by the As,i. The input to the WGT block is first the low-frequency subband LLi gated by both attention maps, i.e., LLi′=LLi⊙Ac,i⊙As,i, where ⊙ denotes element-wise multiplication. Here, Ac,i is broadcast over spatial dimensions and As,i is broadcast over channels.

The full architecture of the WGT block follows a standard transformer block structure, augmented with a wavelet-guided attention mechanism.

First, the input feature map LLi′ is projected to the internal transformer dimension D using a 1×1 convolutional layer, denoted as Wp in Eq. (4). Here, Hc=Hi2 and Wc=Wi2 due to the DWT down-sampling.

LLproj=Wp(LLi′)whereLLproj∈RD×Hc×Wc(4)

To improve computational efficiency, the projected feature map LLproj and the As,i are divided into non-overlapping windows of size ws×ws, as shown in the diagram. Since As,i is single-channel, it is broadcast along the channel dimension when applied to LLproj. The resulting windowed input is denoted as Xw. Subsequently, the tokens within each window pass through two main sub-layers of a standard transformer block, as expressed in Eq. (5).

Xw′=W-MHA(LN(Xw))+XwXw′′=PW-FFN(LN(Xw′))+Xw′(5)

W-MHA represents Wavelet-Guided Multi-Head Attention, and PW-FFN represents the position-wise feed-forward network in Eq. (6). The output of each sublayer is added to its input via a residual connection.

The PW-FFN is a two-layer MLP applied independently to each token (each position), providing additional nonlinearity and channel mixing. The computation is given in Eq. (6).

PW-FFN(Z)=(GELU(ZWffn,1))Wffn,2(6)

Z=LN(Xw′) denotes the previous layer’s normalized output and GELU (Gaussian Error Linear Unit) is the activation function in Eq. (6).

The attention score connecting two tokens p and q is modulated by a learnable bias component bpq derived from the As,i. For stable training, the spatial attention values are normalized before computing the bias term, as in Eq. (7).

A^s,i(p)=As,i(p)−μAsσAs+ε(7)

Here, μAs and σAs are computed over the spatial positions within each non-overlapping window. The bias term and the final attention score are defined in Eq. (8).

Attention(Q,K)pq=softmax(QpKqTdk+bpq)wherebpq=σ(γ)⋅(A^s,i(p)+A^s,i(q))(8)

Here, Q and K denote the query and key matrices, respectively, and dk is the head dimension. As,i(p) denotes the spatial attention value at token position p. γ is a learnable scalar and is mapped to (0,1) via σ(γ). The additive bias encourages attention between token pairs located in high-saliency regions. Non-overlapping windows are used to keep the computation manageable.

Wavelet features are used in the WGT block in two complementary roles. First, the wavelet-derived channel and As,i are used for feature-level gating (the step before Eq. (7)). Then, the spatial attention signal is injected as an additive bias at the attention-logit level (Eq. (8)), which influences the self-attention weighting. Rather than using wavelet information only for feature scaling, this design aims to shape the attention distribution.

In approaches that rely only on multiplicative gating/masking, the signal may be attenuated when the gating map suppresses relevant regions (e.g., low-contrast or partially occluded thin line segments), which can reduce the continuity of linear structures such as PTLs. The additive bias in Eq. (8) provides a soft structural cue by shifting the scores before the softmax operation. Content-based similarity is retained, while the attention distribution is encouraged toward line/edge regions indicated by the wavelet response. Even when wavelet cues are weak in some areas, global context can still help maintain long-range PTL continuity, and the bias can improve stability with respect to line integrity.

The final output of the block, fe, is obtained by recombining the transformer-processed features using the remove-padding and inverse-windowing step.

In the decoder synthesis path (Fig. 3), each WGU block increases the spatial resolution by a factor of 2 and fuses the upsampled decoder stream with the corresponding projected encoder feature via a skip pathway. Specifically, the decoder feature fe is upsampled to match the spatial size of the skip feature Fi+1. The As,i is also resized to the target resolution.

The contribution of the skip feature Fi+1 is modulated using a residual gain controlled by As,i. The gated skip feature Fi+1′ is computed as in Eq. (9).

Fi+1′=Fi+1⊙(1+σ(δ)⋅Upsample(As,i))(9)

The term δ represents a learnable parameter. The sigmoid function σ(⋅) is applied to this parameter, constraining the resulting value σ(δ) to the range (0,1). The structure in Eq. (9) ensures that the attention map does not entirely suppress the original features. Instead, it enhances the important regions by applying a factor greater than one, while diminishing the influence of less significant areas.

To support the reconstruction of fine details (e.g., edges and textures), the HF wavelet component HFi is incorporated via channel-wise concatenation. Specifically, three feature maps are concatenated along the channel dimension: the upsampled decoder feature (Upsample(fe)), the gated skip feature (Fi+1′), and the processed HF feature (Upsample(HFi)). The block output fi is then obtained by applying two consecutive Conv-BN-ReLU layers to the concatenated tensor.

The UpModule denotes a sequence of operations that upsamples an input tensor x by a factor of r, as defined in Eq. (10).

y=ReLU(BN(𝒫𝒮r(Wup(x))))(10)

Here y is the output, and 𝒫𝒮r denotes the pixel-shuffle operation with scale factor r, which redistributes channel information into the spatial dimensions.

3.3 Hybrid Loss Function

Two different head layers have been implemented in the final stage of the decoder for the generation of the final segmentation and edge maps. The outputs produced by the two heads have the same spatial resolution as the input image and share the same architectural template (Conv–BN–ReLU–Conv), but are generated by separate parameter sets.

The segmentation mask M is obtained from the shared feature map fi as in Eq. (11).

M=σ(Wseg,2(ReLU(BN(Wseg,1(fi)))))(11)

The edge map Medge is obtained with Eq. (12) just the same way as the previous one from the same input feature map fi, but through a parallel set of layers with distinct learnable parameters.

Medge=σ(Wedge,2(ReLU(BN(Wedge,1(fi)))))(12)

For efficient supervision during training, a composite total loss is defined, Ltotal, consisting of five weighted components. Each loss term targets a different aspect of power-line segmentation, including class imbalance, regional consistency, boundary sharpness, and structural integrity. The total loss is formulated in Eq. (13).

Ltotal=wsegLseg+wedgeLedge+wwvlLwvl+wllLll+wtvLtv(13)

where wseg,wedge,wwvl,wll,wtv are scalar hyper-parameters controlling the contribution of each loss term.

The main segmentation loss combines two widely used terms to jointly encourage pixel-wise accuracy and region overlap. It consists of Binary Cross-Entropy (BCE) loss and Dice loss. The BCE term penalizes per-pixel classification errors, while the Dice term measures region-level overlap between the predicted and ground-truth masks and is generally more robust to class imbalance. Let M denote the predicted mask after sigmoid activation and Y the ground-truth mask. The resulting loss is given in Eq. (14).

Lseg=LBCE(M,Y)+LDice(M,Y)(14)

where LDice is defined with Eq. (15),

LDice(M,Y)=1−2∑i=1NMiYi∑i=1NMi+∑i=1NYi+ε(15)

with N being the total number of pixels and ε a small constant for numerical stability.

To enforce the generation of sharp and precise boundaries, an auxiliary edge head is supervised with an explicit edge loss. The ground truth edge map, Yedge, is generated on-the-fly by applying a Sobel operator to the ground truth mask Y. The edge loss is a BCE loss between the predicted edge map Medge and the ground truth Yedge calculated with Eq. (16). Yedge is obtained by applying a Sobel filter to Y and normalizing (and optionally thresholding) the gradient magnitude to form an edge target.

Ledge=LBCE(Medge,Yedge)(16)

To ensure sharpness at the signal level, the loss operates in the frequency domain. It penalizes discrepancies between the HF details of the prediction and the ground truth. A Haar DWT is applied to both the predicted mask M and the ground truth Y to obtain their respective HF sub-bands {LH, HL, HH}.

The loss is defined as the L1 distance between the absolute values of the components, as shown in Eq. (17).

Lwvl=E[||LHM|−|LHY||]+E[||HLM|−|HLY||]+E[||HHM|−|HHY||](17)

Lwvl loss encourages the model to accurately reconstruct the fine, directional details characteristic of power lines where E[⋅] denotes the mean over spatial locations (and batch samples).

In addition to the HF term, Lll encourages consistency in the low-frequency content, helping preserve global structure and reducing potential over-sharpening artifacts. SSIM is applied to the low-frequency (LL) sub-bands from the DWT of the prediction (LLM) and the ground truth (LLY).

Lll=1−SSIM(LLM,LLY)(18)

Finally, a TV loss is used as a regularization term to favor spatial smoothness in the predicted mask. By doing so, it punishes large differences between neighboring pixels in the output which aids in getting rid of noise and reducing spurious isolated predictions. It is described with Eq. (19) as the total of the absolute differences over the neighbor pixels in the horizontal and vertical lines.

Ltv=∑i,j(|Mi+1,j−Mi,j|+|Mi,j+1−Mi,j|)(19)

Having detailed the architectural innovations of WGT-UNet, including its multi-scale encoder design, the wavelet-guided decoder, and the comprehensive hybrid loss function, the subsequent section is dedicated to the empirical validation of the proposed method. The following experiments quantify the performance of WGT-UNet, analyze the contribution of its key components, and benchmark it against existing models on the challenging TTPLA dataset.

4 Experiments and Results

The section outlines the experimental setup used to evaluate the WGT-UNet model. We describe the dataset, evaluation metrics, implementation details, and present a comprehensive comparison of the model’s performance against several methods.

4.1 Dataset and Evaluation Metrics

All experiments are conducted on two public benchmarks: the TTPLA dataset [36] and the VLIR (Ground Truth of Powerline) dataset [37]. Both are challenging and widely used benchmarks for power line segmentation. TTPLA dataset comprises 1260 high-resolution aerial images captured by UAVs under diverse and complex conditions, including varying backgrounds (mountains, forests, urban areas), different lighting, and weather. The fine, sparse nature of the power lines in the images makes it an ideal testbed for evaluating the robustness and precision of segmentation models. For the TTPLA experiments, the standard split was followed, using 1008 images for training and 252 images for validation and testing. To further assess the model’s robustness and generalization capability on unseen data domains, the VLIR dataset was additionally employed as an external test set. This dataset consists of 200 aerial images and was used exclusively for testing purposes without any training involvement.

To provide a comprehensive assessment of the model’s performance, a set of standard and task-specific metrics is employed:

• Intersection over Union (IoU): IoU, also referred to as the Jaccard index, serves as the primary metric for assessing semantic segmentation performance. It measures the ratio between the intersection and the union of the predicted and ground truth masks. As a stringent metric that penalizes both false positives and false negatives, it offers a reliable indication of the model’s capability to accurately delineate power lines. For a binary segmentation task, it is computed using Eq. (20) for the foreground class (power lines), and is defined in terms of True Positives (TP), False Positives (FP), and False Negatives (FN).

IoU=TPTP+FP+FN(20)

where TP represents the pixels correctly identified as power lines, FP represents background pixels incorrectly classified as power lines, and FN represents power line pixels that were missed by the model.

• F1-Score: The F1-Score is the harmonic mean of Precision and Recall, and it is a crucial metric for tasks with significant class imbalance, such as power line segmentation where line pixels are a small fraction of the total image area. It provides a more balanced assessment than accuracy alone, as it requires both high precision (the model’s predictions are correct) and high recall (the model finds most of the actual lines).

First, Precision and Recall are defined with Eqs. (21) and (22).

Precision=TPTP+FP(21)

Recall=TPTP+FN(22)

The F1-Score is then calculated with Eq. (23):

F1-Score=2⋅Precision⋅RecallPrecision+Recall(23)

• Parameters: The total number of trainable parameters in the model (in millions, M), indicating its size and complexity.

• FLOPs: Floating Point Operations per second (in Giga-FLOPs, G), measuring the computational cost of a single forward pass.

4.2 Implementation Details

All models were implemented using the PyTorch framework and trained on a single NVIDIA RTX 3090 GPU. Input images were resized to 512×512 pixels. The AdamW optimizer was employed with a base learning rate of 3×10−4 and a weight decay of 0.05. A cosine annealing learning rate scheduler with a linear warm-up of 5 epochs was used to stabilize training. All models were trained for 80 epochs with a batch size of 8. For the hybrid loss function, the weights were empirically set to wseg=1.0, wedge=0.3, wwvl=0.4, wll=0.2, and wtv=0.05 to balance the contributions of each component.

For the DWT calculation, it is implemented as a standard CNN operation. In the implementation, a separable 2D Haar filter bank—corresponding to the LL (low-low), LH (low-high), HL (high-low), and HH (high-high) subbands—is formulated as a single convolutional layer configured with stride = 2. The single Conv2d operation executes both the filtering and 2D subsampling (decimation) steps of the transformation. Setting the number of groups (groups = C) to match the number of input channels changes the operation to a depthwise convolution. Thus, each input channel is processed independently of the others and decomposed into four subband outputs. Fig. 4 shows the DWT analysis of the features in the encoder layers. At the lowest encoder level (L0), all subbands (LL, LH, HL, HH) are saturated with HF texture and noise information from the input image. As the hierarchy deepens, reaching the highest encoder level L3, it is observed that the model effectively filters irrelevant textural information and extracts semantic content. At the deepest level (L3), the low-frequency LL subband successfully isolates a concentrated, low-resolution representation of the salient object, while the detail subbands (LH, HL) are largely suppressed, preserving only the object’s fundamental structural orientations. In preliminary experiments, we compared Haar with Daubechies-2 and Symlet-4 wavelets. Haar provided a similar or slightly better IoU (+0.3–0.5%) at a lower computational cost due to its shorter filters, so we adopted Haar as the default choice in all reported experiments.

images

Figure 4: Visual analysis of 2D Haar DWT subbands at different network levels.

4.3 Performance Comparison

To validate the effectiveness of WGT-UNet, a comprehensive comparison was conducted against several SOTA segmentation models. All methods were re-executed using identical configuration and dataset settings, employing the original code repositories provided in their respective studies. The evaluation includes both quantitative and qualitative analyses, which is common practice in segmentation studies.

4.3.1 Quantitative Evaluation

The detailed quantitative results are summarized in Table 1. The results indicate that WGT-UNet with the ConvNeXt backbone achieves competitive performance on the TTPLA dataset among the evaluated methods. Specifically, proposed model reaches an IoU of 68.38% and an F1-Score of 79.33%, exceeding the next best competitor, U-Net++, by more than 5 percentage points in IoU.

Table 2 presents the evaluation results on the external VLIR dataset, which was used exclusively for testing to assess generalization to an unseen domain. Despite the domain shift and challenging conditions in VLIR (e.g., blur and partially invisible lines), the proposed WGT-UNet (ConvNeXt) maintains robust performance, achieving an IoU of 58.75% and an F1-Score of 71.83%. The model exceeds the closest competitors, PLNet and U-Net++, by approximately 5% in IoU. Notably, WGT-UNet attains a high Recall of 97.62%, suggesting improved sensitivity to faint and disconnected power lines that may be missed by other models. Overall, these results support the effectiveness of the wavelet-guided attention mechanism in capturing structural cues that transfer across domains.

images

The observed accuracy can be attributed to the combination of the wavelet-guided attention mechanism and the hybrid loss function. By leveraging the wavelet transform, the model can better emphasize HF details associated with power lines, while the transformer block helps capture their global and continuous structure.

A detailed comparison of inference latency, Frames Per Second (FPS), and memory usage is reported for both GPU and CPU settings in Table 3. For UAV applications that target real-time processing, a frame rate of around 30 FPS is often considered a practical minimum. WGT-UNet (ConvNeXt) achieves 74.51 FPS with a latency of 13.42 ms on the GPU setup described in the implementation details, suggesting its suitability for real-time processing in practice. While models such as FPN and PSPNet can achieve faster inference due to their lightweight designs, their segmentation accuracy is notably lower in the evaluation (Table 1). Overall, the proposed model offers a favorable accuracy efficiency trade-off by maintaining high inference speed while achieving competitive segmentation accuracy on the TTPLA benchmark.

images

4.3.2 Qualitative Evaluation

This section provides a qualitative evaluation to visually demonstrate the segmentation quality and behavior of proposed WGT-UNet model in challenging scenarios. The analysis aims to highlight the model’s ability to suppress complex background noise and maintain line continuity. Fig. 5 presents a visual comparison of the segmentation results obtained by WGT-UNet variants against several classic and SOTA models ((c) PSPnet, (d) FPN, (e) DeepLabV3, (f) BASNet, (g) PLNET, (h) U-Net, (i) U-Net++) on challenging samples from the TTPLA and VLIR dataset. The input images (column a) effectively illustrate the core challenges of the task: complex and visually noisy backgrounds, often containing linear features (e.g., road edges, shadows), and the fine, low-contrast nature of the power lines themselves.

images

Figure 5: Qualitative results on the TTPLA dataset: (a) Input Image, (b) Ground Truth, (c) PSPnet, (d) FPN (e) DeepLabV3, (f) BASNet, (g) PLNET, (h) U-Net, (i) U-Net++, (j) WGT-Unet-EfficientNet, (k) WGT-Unet-Swin-T, (l) WGT-Unet-ConvNeXt Prediction.

When considering the input images in column (a) and the ground truth masks in column (b) of Fig. 5, it is observed that the compared models (c–i) exhibit various failure modes. Segmentation models such as PSPNet (c), FPN (d), and DeepLabV3 (e) experience pronounced line fragmentation and false-positive detections, especially in noisy backgrounds such as the vegetation in the 5th column and the complex urban texture in the 7th column. Although advanced encoder–decoder architectures such as U-Net (h) and U-Net++ (i) offer better continuity, they still misclassify deceptive linear features that are structurally similar to power lines, such as road edges in the 2nd column and roof edges in the 7th column. The WGT-UNet models (j–l), particularly the one with the ConvNeXt backbone (l), show consistently improved predictions across the entire set of challenging scenarios, in line with the quantitative comparison. These qualitative gains are attributable to the core architectural components of the proposed model.

To validate the generalization capability of the proposed model on unseen data, we further evaluated it on the VLIR dataset (Fig. 6). This dataset introduces environmental challenges that are not fully represented in the training set. Specifically, the images in columns 1, 2, and 3 suffer from significant blur, which degrades edge-related cues. While baseline models struggle to delineate boundaries under these low-sharpness conditions, WGT-UNet preserves the structural integrity of the lines more consistently. In columns 6 and 7, where the power lines are extremely faint and barely visible, the proposed model recovers line topology that competing models miss. Furthermore, the grassy textures in columns 4 and 5 create repetitive linear noise patterns that mimic power lines. Unlike standard models that tend to produce false positives under this vegetative clutter, the frequency-selective nature of proposed method distinguishes the target PTLs from repetitive background patterns.

images

Figure 6: Qualitative results on the VLIR dataset: (a) Input Image, (b) Ground Truth, (c) PSPnet, (d) FPN, (e) DeepLabV3, (f) BASNet, (g) U-Net, (h) U-Net++, (i) WGT-Unet-EfficientNet, (j) WGT-Unet-Swin-T, (k) WGT-Unet-ConvNeXt Prediction.

The proposed attention-based decoder is an important component of the model’s overall performance. Spatial attention is generated by processing the HF components obtained via 2D-DWT. The resulting map is injected as a bias into the self-attention mechanism of the transformer block. This design guides the transformer’s context modeling capability toward HF cues, reducing attention to irrelevant background structures such as road signs in the second column or roof edges in the seventh column. Through multi-head self-attention, the WGT block can capture global context and long-range structural dependencies. As a result, it produces more continuous line segmentation in low-contrast images (e.g., the 3rd and 4th columns) and in partially occluded cases (e.g., the 5th column with tree branches), where competing models tend to fragment the lines. In addition, the five-component hybrid loss function contributes to the sharp boundaries observed in row (l) of Fig. 5 and their high structural similarity to the ground truth in column (b). Auxiliary edge supervision with Ledge, frequency-domain constraints with Lwvl and Lll, and spatial smoothness regularization with Ltv encourage the model to produce not only pixel-level accuracy but also topologically consistent predictions with sharp boundaries.

In contrast, WGT-UNet variants (columns j–l), particularly the ConvNeXt-T backbone model (l), show improved performance in these cases. With the proposed WAM, the model suppresses background clutter and maintains the structural continuity of the lines, producing masks that closely match the Ground Truth (b). The ability to extract line structures with both precision and continuity supports the effectiveness of the proposed hybrid architecture and the five-component loss function used for supervision.

Fig. 7 qualitatively analyzes the incremental refinement process of the model’s multi-stage attention mechanism. The rows show the hierarchical progression from the model’s deepest and coarsest level (S0) to the final refined prediction level (S3). The columns visualize different aspects of the process: Column (a) shows the raw attention maps generated at each stage. It can be observed how the regions of interest to the model evolve from a scattered and noisy structure in S0 to a singular focus concentrated on the salient object in S3. Column (b) presents the attention maps overlaid on the original input image to interpret their spatial locations. Column (c) is a visualization of the attention-weighted feature representations at the relevant stage, reduced to a 3-channel image using principal component analysis. It shows the structure of the processed information extracted from the regions by the model and how it becomes more semantic and structural from S0 to S3.

images

Figure 7: The model’s attention gradually evolves from a coarse level (S0) to a refined level (S3). (a) Attention Map, (b) Overlay, (c) Feature Representation.

The visual representation of the qualitative results in Fig. 5 illustrates that the WGT-UNet model (j–l) outperforms the others (c–i) under difficult conditions. The architectural innovations of the model appear to contribute substantially to this improvement. Despite the overall high performance, the model exhibits limitations in certain specific and challenging scenarios, as illustrated in Fig. 8. Such failures are particularly observed in images captured from a direct top-down perspective, where the PTLs blend with complex ground textures (e.g., dense vegetation or urban structures). In such cases, the distinguishing features of the lines are weakened by the ground complexity, which can confuse the model. The issue likely stems from the insufficient number of such rare and challenging top-down examples in the TTPLA dataset. The model may not have generalized adequately for these specific orientations and ground conditions, highlighting an area for improvement in future data collection and augmentation strategies.

images

Figure 8: Failure cases of the WGT-Unet-ConvNeXt model in some challenging scenarios. (a) Original Image, (b) Ground Truth, (c) Model Prediction.

Although the proposed WGT architecture provides high accuracy, some characteristic failure modes are encountered in challenging field conditions (Fig. 7). To systematically evaluate the limitations, the observed errors are analyzed in terms of viewpoint imbalance and texture bias, low contrast/low signal-to-noise ratio (SNR), partial occlusion, curvilinear topology, and perspective distortion.

In overhead or near-vertical scenes, PTLs are visually integrated with complex ground textures such as dense vegetation, road lines, and building edges. Due to the texture bias in the visual models, thin and long edges in the background can be confused with PTLs, increasing both false negatives and false positives. The under-representation of extreme viewpoints in the long-tailed distribution in the TTPLA dataset makes it difficult to generalize the model to these variations.

Discontinuities in PTL estimates are observed in fog, backlight/glare, or low-contrast conditions. Although DWT-based features enhance edges, when SNR is significantly reduced, HF components are mixed with noise, and discrimination is reduced. Partial obscuration by vegetation or structural obstructions forces mask continuity, resulting in fragmented outputs.

Perspective effects and the curvature of the PTL geometry (e.g., deflection curves) in the image plane create a tendency for over-smoothing or refraction of the estimates. The fact that common pixel-based loss functions do not directly target topological integrity and centerline continuity causes the model to tend towards piecewise or flat solutions instead of following the curvature.

Fig. 9 illustrates a detailed performance comparison of the models on a specific zoomed-in area. Both images used as examples show how power transmission lines are perceived at different levels of complexity in natural and urban environments. In the first example, PTLs with weak contrast and a broken appearance are located among dense tree branches. In the second example, road lines and intersecting lines on the asphalt create structurally similar but semantically different edges for the model. In both cases, classical CNN-based methods (DeepLabV3, PSPNet, BASNet, etc.) exhibited noticeable erratic behavior, mixing the PTL with the background or mistakenly labeling road lines as PTL.

images

Figure 9: Comparative analysis of model performance on zoomed-in regions.

WGT-UNet (Swin) and WGT-UNet (Efficient-B0) offered wider context but had difficulty with the continuity and especially with the thin or intersecting lines. Meanwhile, the WGT-UNet (ConvNeXt) model showed the best performance in both cases, consistently capturing the PTL structure more accurately, even in the complex backgrounds (F1-score = 0.705/0.863, IoU = 0.544/0.768).

4.3.3 High-Frequency Detail Sensitivity Evaluation

The visualization presented in Fig. 10 demonstrates the robustness of the proposed WGT module against HF noise. Located in column 3, it captures not only the wavelet appearance distribution and power distribution, but also hidden HF details such as building sections, road lines, and vegetation. Therefore, the model’s false positive predictions may increase.

images

Figure 10: Visual performance analysis of the proposed model including spatial attention and HF features. Sample visualizations: Input, Ground Truth, HF Features, Spatial Attention, and Prediction.

However, when examining the learned spatial attention in the fourth column, it is evident that the model successfully suppresses these complex background noises. The attention mechanism has specifically concentrated the modulation weights (red/hot regions) on PTLs. Other HF features, such as building and road edges, have been filtered out by being given low weights (blue/cold regions) in the attention map. This result indicates that the proposed gamma-controlled sigmoid mechanism enables the model to focus primarily on relevant fine structural details and reduces its sensitivity to environmental noise.

4.4 Ablation Study

In order to confirm the efficiency of the proposed WGT-UNet and to measure the contribution of its essential components, a thorough ablation study was performed. The study methodically evaluates the influence of the two major contributions: the architectural design of WGT block and the role of each component within the multi-component hybrid loss function. The results are summarized in Table 4.

images

4.4.1 Impact of the Wavelet-Guided Architecture

First, an analysis was performed regarding the contribution of the new architectural elements in the proposed decoder. and stepwise generated the model, starting with a basic one to the complete model structure, in order to distinguish the advantages of every segment.

Model A (Baseline): The model serves as the baseline, representing a standard U-Net architecture. It uses the ConvNeXt-T backbone as an encoder, while the decoder is constructed with standard convolutional and up-sampling layers, without any transformer or wavelet guidance mechanisms. The configuration establishes the performance foundation upon which the proposed innovations are built.

Model B (Baseline + Standard Transformer): To the decoder of Model A, a standard, non-guided transformer block (e.g., a vanilla Swin Transformer block) was added. The purpose of the model is to measure the performance gain attributable solely to the global context modeling capability of a transformer.

Model C (Baseline + WGT Block): The standard transformer was replaced with the proposed WGT block. The effectiveness of the attention map produced by the WAM in guiding the self-attention mechanism of the transformer was directly evaluated under the given configuration. During the corresponding training stage, only the main segmentation loss was applied to all models.

The results in Table 4 demonstrate the architectural advantages. The baseline (Model A) achieves a respectable IoU of 64.31%. Incorporating a standard transformer (Model B) improves the IoU to 65.89%, confirming the value of long-range dependency modeling. A notable improvement is observed with Model C, where the proposed WGT block performance raises up to 66.15% in IoU. The improvement of more than 1.2 percentage points is consistent with our main assumption. Guiding the transformer’s focus with HF data from the wavelet domain appears more effective than a standard transformer for segmenting fine structures.

4.4.2 Impact of Transformer Block

A controlled ablation study was conducted to assess the contribution of the WGT block to overall segmentation performance. The WGT block, which employs window-based self-attention to capture long-range dependencies guided by wavelet-derived cues, was removed and replaced with an Atrous Spatial Pyramid Pooling (ASPP) based context module. Unlike a standard CNN block with a limited local receptive field, the ASPP baseline leverages multi-scale atrous convolutions to provide a wider effective receptive field and better capture elongated line structures. All other components of the architecture (Haar Wavelet Encoder, HF Attention Module, and the remaining decoder structure) were kept identical.

f0(x)=Conv1×1(x),fk(x)=Conv3×3(rk)(x),k∈{1,2,3},fg(x)=Up(Conv1×1(GAP(x)))(24)

Specifically, the WGT block was replaced with a residual ASPP block that matches the input and output dimensions. The ASPP block aggregates multi-scale contextual information through parallel atrous convolutions with different dilation rates, together with a global pooling branch, as defined in Eq. (24). The resulting multi-branch features are concatenated and projected back to the original channel dimension. A residual connection is preserved to ensure a fair comparison with the residual nature of the original transformer block. The overall structure of the residual ASPP ablation block is given in Eq. (25). Here, the fe in Eq. (25) corresponds to the same block output feature, fe, defined in the main architecture description.

fe=x+ϕ(Conv1×1(Concat[f0(x),f1(x),f2(x),f3(x),fg(x)]))(25)

We use dilation rates rk∈{6,12,18} in the atrous branches. This baseline model was trained under the exact same conditions as the proposed method (batch size, learning rate, optimizer, and 512×512 resolution). The performance of the ASPP-based baseline was compared with the proposed transformer-based architecture. The aim was to isolate and quantitatively measure the specific advantage provided by the self-attention mechanism in modeling the topological continuity of power lines.

As seen in Fig. 11a and 11b, the proposed model consistently performs better than the ASPP-based baseline. When the WGT block is used, the F1-score increases by about 10%. The transformer-based variant also converges faster and reaches a lower final validation loss, indicating improved generalization and reduced segmentation error. Moreover, the higher F1-score scores obtained with the proposed method are important for preserving the structural integrity of long and thin objects such as PTLs, where capturing global context via self-attention is challenging to achieve with convolution-only baselines, even with multi-scale atrous receptive fields.

images

Figure 11: Validation loss and validation F1-score curves over training epochs.

4.4.3 Impact of the Hybrid Loss Function Components

After demonstrating the benefits of the proposed WGT architecture, the incremental benefit of each component in the hybrid loss function was evaluated. The experiments started with Model C (the best architecture) and progressively added the loss terms.

Model C (Lseg): The architectural model trained only with the primary segmentation loss (Lseg=LBCE+LDice).

Model D (+Ledge): An auxiliary edge supervision loss (Ledge) is added to encourage sharper and more precise boundaries.

Model E (+Lwvl+Lll): Frequency-domain losses are then incorporated: the HF wavelet loss (Lwvl) to promote accurate reconstruction of fine details, and the low-frequency structural similarity loss (Lll) to preserve overall structural integrity.

Model F (+Ltv): Finally, the Total Variation loss (Ltv) was incorporated as a regularization term to promote spatial smoothness and reduce noise in the final output mask. The configuration represents the complete WGT-UNet model.

All components of the loss function contribute measurably to performance improvement. The inclusion of edge loss in Model D results in an IoU increase to 66.94%, indicating that explicit boundary supervision is effective. Adding the frequency-domain losses in Model E further raises the IoU to 67.82%, revealing a strong synergy between the wavelet-guided architecture and the wavelet-domain loss functions.

Leave-One-Out Analysis: To address concerns regarding potential redundancy, we performed a subtractive analysis starting from the Model F.

• Removing Ltv caused a slight drop to 68.18%, indicating its role in suppressing background noise.

• Removing Lll resulted in a deeper drop to 67.45%, confirming its necessity for maintaining the topological fidelity (sharpness) of the line profiles.

• Removing Lwvl and Ledge led to further degradation, validating that without these components, the model fails to capture HF details.

This dual-validation confirms that all components contribute uniquely to the final performance.

Finally, the inclusion of the TV loss as a regularizer provides a final boost to 68.38 IoU, suggesting that all five components work in concert to achieve the reported performance level, which is competitive with state-of-the-art methods on the TTPLA dataset. These analyses indicate that the strong performance of WGT-UNet is closely related to its carefully designed architecture and supervision strategy.

4.4.4 Sensitivity Analysis of the Learnable Parameters: γ and δ

A controlled ablation study was performed to quantitatively evaluate the contribution of the parameters γ and δ. γ is a learnable scale that adjusts the magnitude of the wavelet-derived bias term bpq in the self-attention logits (Eq. (8)), and δ is a learnable scale that modulates the residual gain of the decoder skip connections (Eq. (9)). In the implementation, the initial values are set to γ=0 and δ=0.5. Due to sigmoid scaling, setting γ=0 does not disable the corresponding component (since σ(0)=0.5). Therefore, disabling is implemented by directly removing the contribution of the relevant term: (i) for γ, the wavelet bias is fixed as bpq←0; (ii) for δ, the skip-gating multiplier in Eq. (9) is fixed to 1. Sigmoid scaling keeps σ(γ) and σ(δ) within (0,1), reducing the risk of training instability caused by saturation.

Table 5 shows how the performance changes under individual and combined disabling scenarios. The results suggest that both γ and δ contribute to the overall performance, with the best scores obtained when they are enabled together. Disabling δ causes a larger drop than disabling γ, indicating that the skip-gating modulation is more influential for preserving the continuity of thin linear structures. In addition, enabling the logit-level wavelet bias (γ) provides a consistent improvement, supporting the role of wavelet-derived structural cues in shaping the attention distribution.

images

4.4.5 Impact of Low Image Resolution

The resolution of 512×512 was used for training the model while testing it on various sizes to find out how sensitive the model is to mismatch in resolution. The results are summarized in Table 6. The highest performance was recorded when the test input resolution was the same as the training size, which was 68.38% IoU, and lower input resolutions resulted in performance decrease. On the other hand, the 256×256 inputs being scaled to the training size of 512×512 before inference led to an increase in the IoU value from 41.07% to 53.33%. This absolute gain of 12.26% indicates that the model supports the 512×512 spatial domain and that careful scaling can partially mitigate the drawbacks associated with low-resolution inputs. Even if the image received from the UAV is only 256×256 pixels in size, scaling it up to 512×512 pixels and sending it to the model will improve performance.

images

5 Conclusion and Future Work

In this work, we present a hybrid deep learning architecture, WGT-UNet, for PTL segmentation in aerial images acquired by UAVs. The proposed model combines the feature extraction capability of CNN-based multiscale backbones with a wavelet-guided transformer decoder. The key innovation, the Wavelet Attention Module, explicitly guides the self-attention mechanism by utilizing HF details from the wavelet domain, enhancing focus on fine, linear structures that are often obscured by complex backgrounds. This architectural design is further supported by a five-component hybrid loss function that jointly addresses class imbalance, edge sharpness, and structural coherence.

Experimental results on the TTPLA and external VLIR datasets indicate that WGT-UNet achieves strong performance, particularly with the ConvNeXt backbone, reaching an IoU of 68.38% and an F1-Score of 79.33%. The evaluation across different backbones also suggests that the framework is adaptable. The EfficientNet variant provides a competitive and lightweight alternative for resource-constrained scenarios. However, there are also limitations to this work. The model exhibits performance degradation in certain top-down perspectives where complex ground textures (e.g., road signs) structurally mimic power lines, and the current supervised approach relies on large-scale labeled datasets that are labor-intensive to acquire.

Future research will focus on overcoming the limitations of the existing work and broadening the scope of the proposed framework. To enhance the depiction of the scenes with different viewpoints and difficult textures, we will apply specific data augmentation techniques, which will involve stronger perspective distortions and background complexity as well. Furthermore, exploiting multi-view temporal fusion from consecutive UAV frames and integrating geometric regularizers into the loss function will be explored to strengthen signal coherence and topological stability in curvilinear segments. Beyond these algorithmic enhancements, priority will be given to the optimization of lightweight variants for real-time processing in embedded UAV systems using hardware acceleration libraries such as NVIDIA TensorRT and quantization techniques. In addition, the integration of self-supervised learning techniques will be investigated in order to minimize the dependency on large-scale labeled data sets. Finally, the universal nature of the proposed frequency-based attention mechanism suggests that it has strong application potential in similar linear segmentation problems, such as crack detection in civil structures or vessel segmentation in medical imaging.

Acknowledgement: Not applicable.

Funding Statement: The authors received no specific funding for this study.

Author Contributions: The authors confirm their contributions to the paper as follows: Ahmet Küçüker was responsible for investigation, visualization, and writing (original draft, review and editing). Burhan Baraklı contributed to conceptualization, methodology, software, validation, and writing (original draft). All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The source code and the masks generated in this study will be publicly available at https://github.com/burhanbarakli/WGT-UNET. All other datasets used in the experiments are publicly available and have been referenced in the experimental section of the manuscript.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Faisal MAA, Mecheter I, Qiblawey Y, Fernandez JH, Chowdhury MEH, Kiranyaz S. Deep learning in automated power line inspection: a review. Appl Energy. 2025;385:125507. doi:10.1016/j.apenergy.2025.125507. [Google Scholar] [CrossRef]

2. Yang L, Fan J, Liu Y, Li E, Peng J, Liang Z. A review on state-of-the-art power line inspection techniques. IEEE Trans Instrum Meas. 2020;69(12):9350–65. doi:10.1109/tim.2020.3031194. [Google Scholar] [CrossRef]

3. Zou K, Jiang Z. Power line extraction framework based on edge structure and scene constraints. Remote Sens. 2022;14(18):4575. doi:10.3390/rs14184575. [Google Scholar] [CrossRef]

4. An D, Chao J, Li T, Fang L. WaveGNet: wavelet-guided deep learning for efficient aerial power line detection. IEEE Sens J. 2024;24(14):22037–44. doi:10.1109/jsen.2023.3330331. [Google Scholar] [CrossRef]

5. Afanaseva OV, Tulyakov TF. Comparative analysis of image segmentation methods in power line monitoring systems. Int J Eng Trans A Basics. 2026;39(1):1–11. doi:10.5829/ije.2026.39.01a.01. [Google Scholar] [CrossRef]

6. Nguyen VN, Jenssen R, Roverso D. Automatic autonomous vision-based power line inspection: a review of current status and the potential role of deep learning. Int J Elect Power Energy Syst. 2018;99:107–20. [Google Scholar]

7. Abdelfattah R, Wang X, Wang S. PLGAN: generative adversarial networks for power-line segmentation in aerial images. IEEE Trans Image Process. 2023;32:6248–59. doi:10.1109/TIP.2023.3321465. [Google Scholar] [PubMed] [CrossRef]

8. Chen G, Hao K, Wang B, Li Z, Zhao X. A power line segmentation model in aerial images based on an efficient multibranch concatenation network. Expert Syst Appl. 2023;228:120359. doi:10.1016/j.eswa.2023.120359. [Google Scholar] [CrossRef]

9. Wei N, Jin Y, Dan Z, Liu N, Peng F, Sun S. Power line segmentation by multilevel attention from hough domain. IEEE Trans Instrum Meas. 2025;74:5005711. doi:10.1109/tim.2024.3522395. [Google Scholar] [CrossRef]

10. An D, Zhang Q, Chao J, Li T, Qiao F, Bian Z, et al. DUFormer: solving power line detection task in aerial images using semantic segmentation. In: Pattern recognition and computer vision (PRCV 2023). Singapore: Springer; 2023. p. 54–66. doi:10.1007/978-981-99-8543-2_5. [Google Scholar] [CrossRef]

11. Zhao Q, Ji T, Liang S, Yu W, Yan C. Real-time power line segmentation detection based on multi-attention with strong semantic feature extractor. J Real Time Image Process. 2023;20:117. doi:10.1007/s11554-023-01367-8. [Google Scholar] [CrossRef]

12. Zhu W, Ding H, Han G, Wang W, Li M, Qin L. Power line segmentation algorithm based on lightweight network and residue-like cross-layer feature fusion. Sensors. 2025;25(11):3551. doi:10.3390/s25113551. [Google Scholar] [PubMed] [CrossRef]

13. Son HS, Kim DK, Yang SH, Choi YK. Recognition of the shape and location of multiple power lines based on deep learning with post-processing. IEEE Access. 2023;11:57895–904. doi:10.1109/access.2023.3283613. [Google Scholar] [CrossRef]

14. Yang L, Fan J, Huo B, Li E, Liu Y. PLE-Net: automatic power line extraction method using deep learning from aerial images. Expert Syst Appl. 2022;198:116771. [Google Scholar]

15. Yang L, Fan J, Xu S, Li E, Liu Y. Vision-based power line segmentation with an attention fusion network. IEEE Sens J. 2022;22(8):8196–205. doi:10.1109/jsen.2022.3157336. [Google Scholar] [CrossRef]

16. Ardis P, Voulodimos A, Doulamis N. Detecting cables and power lines in Small-UAS (Unmanned Aircraft Systems) images through deep learning. Drones. 2021;5(4):115. doi:10.1109/dasc52595.2021.9594405. [Google Scholar] [CrossRef]

17. Yetgin OE, Gerek ON. Automatic recognition of scenes with power line wires in real life aerial images using DCT-based features. Digit Signal Process. 2018;77:102–19. doi:10.1016/j.dsp.2017.10.012. [Google Scholar] [CrossRef]

18. Zhao W, Dong Q, Zuo Z. A method combining line detection and semantic segmentation for power line extraction from unmanned aerial vehicle images. Remote Sens. 2022;14(6):1367. doi:10.3390/rs14061367. [Google Scholar] [CrossRef]

19. Nguyen VN, Jenssen R, Roverso D. LS-Net: fast single-shot line-segment detector. Mach Vision Appl. 2021;32(1):12. doi:10.1007/s00138-020-01138-6. [Google Scholar] [CrossRef]

20. Li Y, Pan C, Cao X, Wu D. Power line detection by pyramidal patch classification. IEEE Trans Emerg Topics Comput Intell. 2019;3(6):416–26. doi:10.1109/tetci.2018.2849414. [Google Scholar] [CrossRef]

21. Huang L, Jiang B, Lv S, Liu Y, Fu Y. Deep-learning-based semantic segmentation of remote sensing images: a survey. IEEE J Select Topics Appl Earth Observ Remote Sens. 2024;17:8370–98. doi:10.1109/jstars.2023.3335891. [Google Scholar] [CrossRef]

22. Zhao Q, Fang H, Pang Y, Zhu G, Qian Z. PL-UNet: a real-time power line segmentation model for aerial images based on adaptive fusion and cross-stage multi-scale analysis. J Real Time Image Process. 2025;22(1):39. doi:10.1007/s11554-024-01615-5. [Google Scholar] [CrossRef]

23. Hota M, Kumar U. Power lines detection and segmentation in multi-spectral UAV images using convolutional neural network. In: 2020 IEEE India Geoscience and Remote Sensing Symposium (InGARSS). Piscataway, NJ, USA: IEEE; 2020. p. 154–7. [Google Scholar]

24. Jaffari R, Hashmani MA, Reyes-Aldasoro CC. A novel focal phi loss for power line segmentation with auxiliary classifier U-Net. Sensors. 2021;21(8):2803. doi:10.3390/s21082803. [Google Scholar] [PubMed] [CrossRef]

25. Li Y, Liu Z, Yang J, Zhang H. Wavelet transform feature enhancement for semantic segmentation of remote sensing images. Remote Sens. 2023;15(24):5644. doi:10.3390/rs15245644. [Google Scholar] [CrossRef]

26. Premachandra MH, Salaan C, Funada MRD, Vea MT, Kawanaka H. Detecting power lines using point instance network for distribution line inspection. Appl Sci. 2021;11(21):10174. doi:10.1109/access.2021.3101490. [Google Scholar] [CrossRef]

27. Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of the IEEE/CVF Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2017. p. 2881–90. [Google Scholar]

28. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2017. p. 2117–25. [Google Scholar]

29. Qin X, Zhang Z, Huang C, Gao C, Dehghan M, Jagersand M. Basnet: boundary-aware salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2019. p. 7479–89. [Google Scholar]

30. Han G, Zhang M, Li Q, Liu X, Li T, Zhao L, et al. A lightweight aerial power line segmentation algorithm based on attention mechanism. Machines. 2022;10(10):881. doi:10.3390/machines10100881. [Google Scholar] [CrossRef]

31. Hu D, Zheng Z, Liu Y, Liu C, Zhang X. Axial-UNet++ power line detection network based on gated axial attention mechanism. Remote Sens. 2024;16(23):4585. doi:10.3390/rs16234585. [Google Scholar] [CrossRef]

32. Xu P, Sulaiman NAA, Ding Y, Zhao J. Innovative segmentation technique for aerial power lines via amplitude stretching transform. Sci Rep. 2025;15(1):2468. doi:10.1038/s41598-025-86753-x. [Google Scholar] [PubMed] [CrossRef]

33. Tan M, Le QV. Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML). London, UK: PMLR; 2019. p. 6105–14. [Google Scholar]

34. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2021. p. 10012–22. [Google Scholar]

35. Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2022. p. 11976–86. [Google Scholar]

36. Abdelfattah R, Wang X, Wang S. TTPLA: an aerial-image dataset for detection and segmentation of transmission towers and power lines. arXiv:2010.10032. 2020. [Google Scholar]

37. Yetgin OE, Gerek ON. Ground truth of powerline dataset (Infrared-IR and Visible Light-VL). Mendeley Data. 2019. [cited 2026 Jan 21]. Available from: https://data.mendeley.com/datasets/twxp8xccsw/9. [Google Scholar]

38. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48. [Google Scholar] [PubMed]

Cite This Article

APA Style

Baraklı, B., Küçüker, A. (2026). High-Performance Segmentation of Power Lines in Aerial Images Using a Wavelet-Guided Hybrid Transformer Network. Computer Modeling in Engineering & Sciences, 146(2), 26. https://doi.org/10.32604/cmes.2026.077872

Vancouver Style

Baraklı B, Küçüker A. High-Performance Segmentation of Power Lines in Aerial Images Using a Wavelet-Guided Hybrid Transformer Network. Comput Model Eng Sci. 2026;146(2):26. https://doi.org/10.32604/cmes.2026.077872

IEEE Style

B. Baraklı and A. Küçüker, “High-Performance Segmentation of Power Lines in Aerial Images Using a Wavelet-Guided Hybrid Transformer Network,” Comput. Model. Eng. Sci., vol. 146, no. 2, pp. 26, 2026. https://doi.org/10.32604/cmes.2026.077872

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

High-Performance Segmentation of Power Lines in Aerial Images Using a Wavelet-Guided Hybrid Transformer Network

Abstract

Keywords

References

Cite This Article

366

184

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link