Superpixel-Aware Transformer with Attention-Guided Boundary Refinement for Salient Object Detection

Burhan Baraklı; Can Yüzkollar; Tuğrul Taşçı; İbrahim Yıldırım

doi:10.32604/cmes.2025.074292

icon Open Access

ARTICLE

Superpixel-Aware Transformer with Attention-Guided Boundary Refinement for Salient Object Detection

Burhan Baraklı^1,*, Can Yüzkollar², Tuğrul Taşçı³, İbrahim Yıldırım²

1 Department of Electrical and Electronics Engineering, Sakarya University, Sakarya, 54050, Türkiye
2 Department of Computer Engineering, Sakarya University, Sakarya, 54050, Türkiye
3 Department of Information Systems Engineering, Sakarya University, Sakarya, 54050, Türkiye

* Corresponding Author: Burhan Baraklı. Email: email

(This article belongs to the Special Issue: Advanced Image Segmentation and Object Detection: Innovations, Challenges, and Applications)

Computer Modeling in Engineering & Sciences 2026, 146(1), 36 https://doi.org/10.32604/cmes.2025.074292

Received 08 October 2025; Accepted 10 December 2025; Issue published 29 January 2026

Abstract

Salient object detection (SOD) models struggle to simultaneously preserve global structure, maintain sharp object boundaries, and sustain computational efficiency in complex scenes. In this study, we propose SPSALNet, a task-driven two-stage (macro–micro) architecture that restructures the SOD process around superpixel representations. In the proposed approach, a “split-and-enhance” principle, introduced to our knowledge for the first time in the SOD literature, hierarchically classifies superpixels and then applies targeted refinement only to ambiguous or error-prone regions. At the macro stage, the image is partitioned into content-adaptive superpixel regions, and each superpixel is represented by a high-dimensional region-level feature vector. These representations define a regional decomposition problem in which superpixels are assigned to three classes: background, object interior, and transition regions. Superpixel tokens interact with a global feature vector from a deep network backbone through a cross-attention module and are projected into an enriched embedding space that jointly encodes local topology and global context. At the micro stage, the model employs a U-Net-based refinement process that allocates computational resources only to ambiguous transition regions. The image and distance–similarity maps derived from superpixels are processed through a dual-encoder pathway. Subsequently, channel-aware fusion blocks adaptively combine information from these two sources, producing sharper and more stable object boundaries. Experimental results show that SPSALNet achieves high accuracy with lower computational cost compared to recent competing methods. On the PASCAL-S and DUT-OMRON datasets, SPSALNet exhibits a clear performance advantage across all key metrics, and it ranks first on accuracy-oriented measures on HKU-IS. On the challenging DUT-OMRON benchmark, SPSALNet reaches a MAE of 0.034. Across all datasets, it preserves object boundaries and regional structure in a stable and competitive manner.

Keywords

Salient object detection; superpixel segmentation; transformers; attention mechanism; multi-level fusion; edge-preserving refinement; model-driven

1 Introduction

Identifying the most salient objects in an image is called Salient Object Detection (SOD) [1,2]. Methods developed for this purpose aim to separate the background region of an image from the foreground object using a saliency map. This map typically highlights visually important regions and is often used for masking operations in various visual data processing tasks. It has also been widely applied in tasks such as object recognition [3,4], segmentation [5], compression [6], classification [7], object tracking [8], content cropping [9], resizing [10], and object detection [11,12].

SOD methods play significant roles in many areas beyond traditional computer vision tasks. For example, in underwater imaging and autonomous underwater robotics, highlighting meaningful regions under low contrast and spectral distortion enhances object detection, navigation, and mapping processes [13]. Saliency-guided feature extraction is important in remote sensing applications, as it reduces shadow-induced false positives and improves change detection accuracy by highlighting critical change regions in difference maps [14]. The importance and use of saliency analysis in medical imaging applications is steadily increasing. Developed approaches aim to make ambiguous tissue boundaries more distinct and highlight small structures and lesions. Additionally, the goal is to preserve the structural integrity of the image and significantly improve segmentation accuracy [15,16]. In visible-infrared transmission line inspection, saliency-based attention mechanisms that highlight thin and low-contrast structures facilitate the detection of fault areas [17]. In intelligent transportation and autonomous driving, SOD enhances the reliability of perception modules by prioritizing vulnerable road users even under low visibility conditions [18]. Furthermore, in road defect detection studies (such as potholes and cracks), saliency models serve as a critical pre-attention component in highlighting both linear, delicate structures and multi-scale deformations [19]. These applications demonstrate that SOD is an effective and general-purpose tool for identifying structurally and semantically important regions across multiple modalities.

Existing SOD methods can be divided into two main groups: traditional and deep learning-based approaches. Heuristic and statistical approaches extracting low-level features such as color, intensity, contrast, texture, and location from visual data are examples of traditional methods [20–24]. These methods have a low computational cost. However, they are outdated and often provide inadequate performance [1,2,25]. Deep learning-based methods that overcome the weaknesses of traditional methods have been developed in four main categories: fully supervised [26–30], weakly supervised [31,32], unsupervised [33,34] and adversarial [35,36].

Recent advances in deep learning-based SOD models have greatly improved performance. However, several key limitations still affect the accuracy and practicality of saliency maps. In particular, the inability to preserve high-level semantic information, insufficient integration of contextual relationships across layers, difficulties in simultaneously detecting objects of varying scales, lack of saturation at object edges, and the generation of blurry contours remain unresolved in existing models. The challenge of accurately detecting thin and detailed edge regions, the limitations of attention mechanisms in ambiguous transition areas, and the high computational load (especially problematic for mobile or real-time applications) are other issues requiring attention. Moreover, Convolutional Neural Network (CNN)- and Transformer-based methods still operate either on pixel grids or regular patch sequences, lacking explicit modeling of region topology. When superpixels are used, they typically serve only as auxiliary masks rather than as primary representational units. As a result, long-range dependencies are learned without awareness of object boundaries, and transition regions are handled only implicitly, which leads to inconsistent contour modeling and ambiguous boundary responses. Studies addressing these challenges are summarized below.

In early studies, multi-context Convolutional Neural Networks (CNN) that evaluate global and local context were used and effective results were obtained [37]. However, CNN-based models cannot fully extract global contextual information due to their limited receptive fields [1]. For example, while Fully Convolutional Network (FCN)-based methods such as PiCANet [38] and BASNet [29] provide strong local feature extraction, they are insufficient in modeling long-range dependencies and can produce incorrect saliency maps in complex images. In recent studies, various methods have been developed to use global context information more effectively. For example, large receptive-field modules have been designed in ICON [39] and BBRF [40]. In addition, CGDINet [41] and CFFNet [42], hybrid networks combining CNN and Transformer architectures, have effectively performed wide-area perspective modeling.

The ability of a model to integrate contextual relationships across different layers while preserving high-level semantic information contributes to accurate object discrimination in complex scenes [1]. To achieve this, Feature Pyramid Network (FPN)-based structures [43] and attention modules (such as CBAM and SENet) have been widely adopted to combine multi-level features and enrich context [38,44]. However, these methods often compromise semantic integrity due to direct feature aggregation, which can weaken high-level representations. Approaches that maintain channel-level coherence and guide information flows more effectively across layers further enhance semantic consistency [39,42].

One challenge in SOD is the simultaneous and accurate detection of objects at varying scales. Transformer-based methods effectively capture global context and long-range dependencies but often struggle to detect small objects. In contrast, CNN-based models generally perform better in detecting small objects [45,46]. In order to benefit from the advantages of these approaches, CGDINet [41] proposed the Depth Auxiliary Module (DAM) structure that aims to represent details better at different scales by using depth information and multi-channel images. Similarly, BBRF [40] provided a balanced representation by processing deep semantic and high-resolution detail information over separate branches. MENet [47], which uses the iterative feature enhancement strategy, reduced the performance imbalances in multi-scale object segmentation.

The lack of saturation and blurry borders that occur at object edges are the most important problems in the SOD field [48–50]. Edge-supervised methods such as SCRN [51] and EGNet [52] have limited edge information learning as they only rely on pre-generated fine edge labels in model training. ICON [39] has developed a learning approach to preserve edge sharpness and internal structural integrity in saliency maps. Transformer-based UGRAN [53] finds regions of low model confidence using uncertainty maps and implements adaptive refinement especially in edge areas. DC-Net [54] reduces blurry contours by learning edge and object regions with two separate encoders and combining them in a fuse-decoder structure. PiNet [55] processes multi-level features in separate branches reducing the erroneous highlighting of unremarkable regions and loss of internal details. OLER [56] enriches boundary representation and increases edge sensitivity using thin and thick edge maps.

The growing use of mobile and embedded systems has driven the need for lightweight and efficient SOD models. ICON [39] and BBRF [40] models are unsuitable for practical applications due to their high computational costs. ICON incorporates complex mechanisms such as multi-layer attention calculations and directed information flow to preserve overall structural integrity and fine details. BBRF uses a dual-branch encoder structure that includes iterative decision paths. Such complex structures in both methods incur additional costs in terms of memory usage, limiting their usability in resource-constrained applications. Lightweight convolutional structures reducing computational load cause performance losses due to their limited representation capacity [57]. With 0.84 million parameters, ADMNet [58] offers an architecture that integrates multi-scale contextual features with densely connected modules and includes two-stage attention mechanisms, providing real-time processing capabilities while maintaining high accuracy. However, performance degradation in terms of structural consistency is observed, particularly when it is applied to images containing multiple objects or low contrast scenes. A complementary lightweight adversarial model [59] further enhances spatial consistency by combining adversarial refinement with a multi-scale contrast module, achieving competitive accuracy with significantly reduced computational overhead.

To address the limitations outlined above, a holistic macro- and micro-level salient object detection framework is proposed that explicitly exploits superpixel (SP) structure. The overall design aims to preserve global scene information while performing detail-oriented boundary refinement. At the macro stage, a Transformer-based SP classification module processes content-adaptive SP representations and partitions the image into background, salient object, and boundary regions. At the micro stage, a lightweight U-Net-like refinement module focuses solely on uncertain edge areas, thereby concentrating computational resources on the most ambiguous regions and improving the efficiency of the learning process.

With the emergence of deep learning, the use of traditional SOD approaches based solely on handcrafted features has declined significantly. Nevertheless, the explainability, computational efficiency, and semantically meaningful separation capabilities of such features still hold significant potential. Hybrid strategies that combine classical feature-based analysis and deep representations are particularly effective in exploiting this potential [57]. In the proposed framework, various statistical, structural, and color-based descriptors computed over SP regions are first aggregated into compact feature vectors. These vectors are then projected into a deep embedding space and used as SP tokens, preserving the interpretability of traditional features while providing information-rich yet compact representations that strengthen regional discrimination. The resulting tokens guide the Transformer-based SP classifier, especially in contour-intensive areas, encouraging attention mechanisms to focus on perceptually meaningful regions and facilitating more effective learning of long-range correlations.

At the macro level, images are divided into content-consistent SP regions that are subsequently classified into three categories: background, salient object, and boundary. Each region is described by semantic feature vectors, including color distributions, edge densities, frequency components, geometric structures, and neighborhood contrasts. These representations are used directly as tokens in the Transformer encoder and offer advantages in terms of computational efficiency and regional consistency compared to grid-based patch segmentations commonly used in the literature. In parallel, a ResNet-18 backbone extracts deep convolutional features, which are queried by the SP tokens through an SP–CNN cross-attention module so that each SP selectively aggregates fine-grained semantic cues from the feature map. The refined SP tokens are combined with a global context token obtained by pooling CNN features and fed into a graph-aware Transformer encoder, where adjacency relationships and pairwise spatial distances between SPs are injected into the attention logits via a learnable bias term. By taking the SP topology into account, the model reduces information loss and scale mismatch, allowing the attention module to utilize local SP features, CNN features, and global context jointly and more effectively.

Following the macro level, a micro model comes into play that focuses only on the SPs at the edge contours of the salient object. This model performs a focused refinement process that responds only to specific regions by processing the input image and SP-based contextual distance maps in two separate encoder branches. The micro model utilizes a U-Net-like architecture, facilitating the hierarchical extraction and reconstruction of features across multiple levels.

Each encoder uses the attention-based Squeeze-and-Excitation-Fusion (SEFuse) mechanism, which combines image features and SP information streams meaningfully by weighting different feature maps at the channel level. During the compression step, the statistical summary of each channel is extracted while the excitation step learns the inter-channel contextual relationships to highlight prominent information. This enables the relative importance of information from different sources to be learned at each encoder stage, resulting in more consistent and semantically rich representations. This provides a holistic learning process supporting detail accuracy and semantic continuity. Additionally, by focusing only on transition areas based on edge SP regions, the model’s sensitivity to unnecessary areas is reduced.

As a result, the proposed system not only integrates global and local contexts successfully but also offers a clear architecture that processes visual and structural information at every level, carefully executes boundary-focused detail enhancements, and maintains parameter efficiency.

This study offers the following original contributions:

• It is the first SOD framework that treats SPs as the primary representational units by encoding each SP into a multi-dimensional feature vector and using these vectors directly as tokens in a Transformer encoder.

• A dedicated SP–CNN cross-attention module is introduced, enabling each SP token to selectively aggregate fine-grained semantic cues from CNN features and improving macro-level semantic consistency.

• A graph-aware Transformer encoder is proposed, where SP adjacency and spatial distances are encoded as a learnable bias added to the attention logits, enforcing boundary continuity and structure-aware information propagation.

• A boundary-focused micro refinement stage is developed using a dual-branch U-Net that processes both raw image content and SP-based contextual distance maps. The proposed SEFuse module adaptively combines these streams at each level.

• An SP-aware edge loss is designed by computing the loss only over SPs located in salient-region boundaries, reducing unnecessary gradient propagation and improving training efficiency.

• The architecture is modular and flexible: the macro SPSeg stage can operate independently for fast inference, while the full macro–micro pipeline achieves superior accuracy–complexity balance suitable for practical applications.

• Extensive experiments on five public SOD benchmarks demonstrate that the proposed method achieves state-of-the-art or superior performance compared to recent Transformer- and CNN-based models, while maintaining a favorable balance between accuracy, parameter count, and computational cost.

The study is organized as follows. Section 2 comprehensively discusses existing SOD approaches. Section 3 explains the proposed method in detail. In Section 4, the effectiveness and performance of the proposed model are demonstrated through experimental analyses conducted on various data sets, comparisons with existing methods, and numerous ablation studies.

2 Related Works

In recent years, numerous methods for SOD have been developed, yielding effective results across various datasets. Most contemporary SOD approaches follow a two-stage processing architecture: the first stage captures the global context while the second stage focuses on making detailed refinements in local regions.

2.1 Superpixel-Based Methods

Integrating SP representations with deep learning-based methods is an effective strategy for improving semantic consistency and structural accuracy. Approaches that evaluate features extracted by deep learning at the SP level offer advantages in learning with limited data, edge sensitivity, and explainability. For example, in the SuperCNN [60] method, each SP is normalized along with its contextual environment and fed into the CNN as input, directly estimating saliency probabilities at the SP level. In another study, a coarse saliency map of the scene is first extracted using a VGG16-based FCN, and then the map is aligned with SLIC SP regions and refined in detail using local CNN modules [61]. Additionally, it was proposed to model foreground and background probabilities at the SP level by fusing semantic features with classical visual features such as color, texture, and intensity [57].

In new approaches based on Transformer architectures using SPs, SP regions have been introduced to interact with attention mechanisms. A local–global token interaction strategy has been proposed to improve regional consistency and enhance long-range modeling [62]. In multimodal settings, SP representations have also been employed in visual–language alignment to provide semantically coherent region-level cues for pre-trained models [63]. More recently, CCTNet [64] utilized SP maps to guide attention toward delicate structures, where SPs act as auxiliary masks that modulate feature dependencies. Unlike these attention-guided formulations, our method treats SPs as the fundamental token units of the Transformer: each SP is encoded as a semantic region-level feature vector and fed directly into the Transformer encoder, effectively replacing the conventional grid-based patch embedding with a content-adaptive tokenization strategy.

SPs are used in collaboration with traditional methods to model low-contrast images. Multi-scale SP segmentation allows regional contrast, center priors, and integrity criteria to be extracted at every scale. The criteria obtained are probabilistically combined to calculate the saliency probability of each region [65]. Supporting these structures with attention maps has provided balanced results especially in regions with weak contrast and scattered objects [66]. In the SPCont [67] method, high-edge-sensitivity maps are produced by evaluating the contrast relationship between each SP’s local neighbors and the scene-wide context. In another study, SP scales are also modeled as graph layers. In each layer, the saliency of the foreground and background are calculated separately using manifold ranking, and the global context is utilized in an integrated manner through inter-layer information flow [68].

SP-based traditional learning approaches offer advantages, particularly in terms of explainability and low data requirements. For example, an adaptive metric learning-based method can semantically distinguish between foreground and background regions by learning the distance relationship between SP pairs [69]. The HDHF (Hybrid Deep and Handcrafted Features) method [70] combines multi-scale CNN features with classical features to create hybrid SP representations, which are then classified using the random forest method and converted into saliency maps.

2.2 Deep Learning-Based Methods

In deep learning-based SOD methods, in addition to high prediction accuracy, contextual information modeling, structural consistency, and uncertainty management are also targeted. Current methods use different design strategies to extract semantic integrity and structural details in the saliency map. These methods are commonly grouped into three main categories: multi-scale approaches that integrate hierarchical features, boundary-focused methods that preserve object contours and structural consistency, and uncertainty-guided strategies that refine ambiguous regions. Beyond RGB-only formulations, a growing body of work investigates multimodal SOD by incorporating auxiliary modalities such as depth or thermal infrared. These multimodal approaches aim to improve robustness under low illumination, cluttered backgrounds, or weak texture conditions by leveraging cross-modal complementarity. The following subsections summarize these existing methods in detail.

2.2.1 Multi-Scale & Feature-Level Methods

This group of methods aims to produce powerful saliency representations by effectively integrating multi-level and multi-scale features obtained from different layers. For example, in PoolNet [71], context information is condensed through pooling, while edge details are enhanced in auxiliary branches. Similarly, in the MiNet [72] model, features at different scales are combined through mutual interaction modules to reduce regional inconsistencies.

In recent years, several transformer-based methods have emerged that effectively highlight global context. One noteworthy example is the GateNet-RGB model [73], which enhances context correction in low-confidence areas by incorporating supervised side branches into the Transformer’s encoder architecture. Similarly, another approach detailed in [74] combines the local detail-capturing capabilities of CNNs with the global context extraction strengths of transformers. More recently, GLSTR [75] adopts a pure transformer encoder with a densely connected decoder, enabling joint global–local representation learning across layers while reducing redundancy and improving spatial precision.

Focusing on the comprehensive integration of multi-level features, the new generation model Multi-scale Feature Enhancement [76] processes local and global contexts together with inter-layer interactions based on VGG16. FLICSP [77] aims to separate ambiguous features in complex content using deep CNNs and multi-feature fusion. MDFANet [78] enhances sensitivity in small objects by emphasizing salient regions with contextual attention mechanisms and edge-based structural discrimination. DC-Net [54] achieves wider receptive fields more efficiently with a two-level parallel encoder structure.

In recent studies, multi-scale modeling has been further strengthened through more adaptive and structurally flexible feature integration strategies. GPONet [79] enhances multi-level fusion with gated units that selectively transmit informative features while suppressing noise, improving stability in scenes requiring a balance between fine details and global semantics. NASAL [80] approaches the problem from an architectural perspective by introducing a U-shaped neural architecture search space that automatically discovers scale-aware encoder–decoder designs, demonstrating that multi-scale integration can also be optimized through automated search. Decoder-side refinements have also gained attention. SalFAU-Net [81] strengthens cross-scale coherence by combining intermediate saliency predictions through dedicated fusion modules within each decoder stage. Meanwhile, SODAWideNet++ [82] expands receptive fields using a hybrid design that couples dilated convolutions with attention-based long-range feature extraction, enabling effective global–local representation learning without relying on classification backbones.

2.2.2 Boundary & Structural Integrity Methods

The methods in this category focus on preserving object boundaries and ensuring internal structural integrity. BASNet [29], which increases detail quality using residual feature transfer and hybrid losses, is the first method in this class. Later, PiCANet [38] produced more accurate boundaries using attention structures that account for local and global contexts at the pixel level. In addition, R3Net [83], which reduces the difference between predictions and labels with iterative learning, was developed.

Part-whole relationships are modeled in TSPOANet [84] to ensure structural integrity. Building on this idea, TCGNet [85] integrates contrast cues from CNNs with part–whole relational semantics from Capsule Networks, using correlation matrices to guide their interaction and enhance object completeness in structurally complex regions. GCPANet [86] was developed to integrate semantic, detailed, and contextual information with roaming module structures and to increase consistency in multi-object scenes. The F3Net [50] method utilizes feature fusion and a feedback decoder structure. In addition, object boundaries are improved by using loss functions sensitive to pixel positions. SCRN [51] improves the boundaries with cross-task learning, and LDF [87] reduces the blurring in the boundary regions with label decomposition. Recent studies have also revisited boundary refinement under challenging conditions. DASOD [88] addresses low-contrast, cluttered, and partially occluded regions by generating coarse body masks and refining them with detail cues, resulting in more reliable boundary preservation in structurally ambiguous areas.

2.2.3 Uncertainty-Guided Learning Methods

The methods in this category have been developed to prevent overconfident estimations in salient object detection. The learning process is guided by identifying regions where the model is uncertain. UGRAN [53] improved uncertain areas using adaptive segmentation mechanisms and multi-level attention modules. Similarly, UDNet [89] reduced boundary ambiguities by modeling internal and external contour information in separate layers. RCSBNet [90] is one of the most remarkable methods in this class. It improves structural consistency in uncertain regions and accuracy in low-confidence regions with its recursive CNN architecture and contour-saliency blending module.

2.2.4 Multimodal SOD Methods (RGB-D/RGB-T)

Multimodal SOD extends RGB-based detection by incorporating depth (RGB-D) or thermal infrared (RGB-T) cues to improve robustness under low illumination, cluttered backgrounds, and regions with weak texture. In RGB-D studies, cross-modal and cross-level fusion is commonly employed to combine geometric depth structure with RGB appearance features [91]. At the same time, hybrid Transformer–CNN designs further strengthen multi-scale correspondence and structural reasoning [92]. Representative examples such as CCAFNet [93], which integrates crossflow and scale-adaptive fusion, demonstrate the effectiveness of depth-guided refinement in improving localization and boundary completeness. RGB-T methods, reviewed comprehensively in recent surveys [94], focus on leveraging the complementary behavior of visible and thermal signals through correlation learning, modality-consistent fusion, and illumination-invariant feature extraction. The alignment-free progressive correlation network [95] avoids pixel-level registration by modeling cross-modal relations directly, while ECFFNet [96] and WaveNet [97] enhance thermal–visible integration using consistency-driven fusion blocks and wavelet-domain knowledge distillation.

3 The Proposed Algorithm—SPSALNet

The proposed method whose block diagram is shown in Fig. 1 comprises two models. The first model called Superpixel Segmentation (SPSeg) performs SP classification. It uses a transformer-based approach that incorporates general image features, SP-level features, and the contextual relationships among SP to separate them into three groups: background, salient region, and salient object boundary region. The second model named SP-based Saliency Detection (SPSal) focuses on the pixels of the SP in the salient object boundary region identified in the SPSeg step. Its primary goal is to determine the local saliency status of these pixels using a U-Net-based approach with the attention mechanisms.

images

Figure 1: Block diagram for SPSALNet. X is the input image. Black (B), White (W), and Gray (G) regions are assumed to have labels y=0, y=1, and y=2, respectively. SX denotes the salient object detection output for X

The background and salient region information provided by SPSeg is combined with the pixel-level salient object boundary information produced by SPSal to achieve clear separation between the salient object and the background. Each model uses different features to represent a SP.

3.1 SPSeg: Superpixel Segmentation & Classification

As illustrated in Fig. 2, the proposed SPSeg architecture introduces a novel SP-based transformer framework that synergistically fuses deep semantic features with hand-crafted structural representations. The workflow processes the input image through two parallel streams: a ResNet backbone extracting high-level CNN tokens and a SP generation module deriving feature-rich SP tokens. To bridge the semantic gap between these representations, we introduce a SP-CNN Cross-Attention mechanism, allowing SP nodes to selectively attend to fine-grained CNN details. Subsequently, a Graph-Aware Transformer Encoder processes the refined representations by explicitly incorporating the spatial topology—defined by adjacency Aij and distance dij—into the self-attention (SA) calculation. Such a structure-aware formulation enables the model to capture boundary continuity and spatial dependencies effectively before performing the final classification.

images

Figure 2: Detailed illustration of the SPSeg architecture. CNN-derived pixel tokens and SP-derived feature tokens are projected into a shared embedding space and fused through the proposed SP–CNN cross-attention mechanism. This module enables each SP token to selectively attend to the most informative CNN tokens, forming refined representations HCA. The refined SP tokens are then processed by the graph-aware multi-head self-attention (MHSA) layer, which incorporates the SP graph topology via adjacency Aij, pairwise distance dij, and a learnable bias term bij. This graph-aware formulation allows the model to capture spatial relationships, boundary continuity, and structural dependencies across SPs. Finally, an FC layer produces a class prediction for each SP, assigning it to background (0), salient (1), or boundary (2)

Let x ∈ RH×W×C be an input image with height H, width W, and C channels. Using a SP algorithm (e.g., SEEDs [98] or LSC [99]), N non-overlapping superpixels patches spi, i=1,…,N are extracted from the image. For each spi, a true class label yi is assigned based on its pixel-wise overlap with the ground truth mask. The set of all SPs with the true class label y∈{0,1,2} is defined as Cy={spi∣yi=y}, and the set of pixels of the i-th SP is denoted by SPi={(x,y)∣(x,y)∈pixels of spi}.

The transformer model fT is applied to predict the class of each SP as described in Eq. (1)

y^i=fT(spi;θ),fT:Rd⟶{0,1,2}(1)

where the set of SPs whose predicted class is y ∈ {0,1,2} is defined as C^y={spi∣y^i=y}.

In Fig. 2, SP sets {C^0,C^1,C^2} are visualized according to the class labels predicted by the model. These sets are represented by black (background), gray (object), and white (boundary region) tones, respectively. The regions shown in white tones generally correspond to boundary and corner regions located at the intersection of the distinct object and the background. SP algorithms cannot perform precise region segmentation in high-frequency transition regions in images, so boundary SPs and the pixel groups that form them are handled separately in the SPSal step.

In the SPSeg model, both SP features and image features are jointly used. Each SP has an F-dimensional feature vector defined by Eq. (2) (see also Table 1).

sfi=[sprgbi, sphisti, splowFi, sphighFi,spposi, spareai, spedgei, spclri]∈RF(2)

images

The feature vector sfi consists of the raw pixel values (sprgb), color histogram (sphist), low- and high-frequency components (splowF,sphighF), location (sppos), area (sparea), edge (spedge) and other color features (spclr) of the SP. By combining the feature vectors of all SPs, the N × F-dimensional feature matrix SPF1=concat⁡(sf1,…,sfN)∈RN×F is obtained. It serves as the input to the model.

The SPF1 feature matrix is passed through a fully connected (FC) layer to make it a suitable input for the transformer network. The corresponding operation can be expressed as Qs=ϕs(SPF1)=SPF1Ws + bs∈RN×Ch, where Ws and bs denote the learnable weight matrix and bias vector. Unless otherwise stated, all W⋅ and b⋅ symbols in the following equations represent learnable weights and biases of appropriate shapes.

To capture both global context and fine-grained visual cues, the input image x is processed by a pre-trained ResNet-18 backbone, producing a convolutional feature map F as in Eq. (3).

F=fResNet(x)∈RH′×W′×Cc(3)

where H′,W′ and Cc are the spatial height, width and channel dimension of the last convolutional block, respectively. The feature map F is first reshaped to a sequence of M=H′⋅W′ spatial locations and then projected to the transformer hidden dimension dℓ=Ch via a linear layer, yielding the CNN tokens U~=ϕc(F)∈RM×dℓ.

A global feature vector xG=GAPool⁡(F)∈R1×Cg is obtained by applying global average pooling on F, and this vector is transformed into a global token compatible with the transformer dimension as QG=xGWg+bg∈R1×dℓ.

In traditional transformer architectures, processing sequential information requires encoding the position of each element explicitly. In the proposed model, a learnable positional encoding mechanism is used to preserve the positional relationships of SP tokens. The 2-D center coordinates (x,y) of each SP are passed through an FC layer defined by Qp=ϕp(sppos)=spposWp + bp∈RN×Ch and converted into a vector whose size matches the hidden dimension of the transformer network. This process allows the model to learn the positional relationships among SPs by adding positional information to the SP tokens. Direct learning of positional information is more flexible than fixed sine-cosine coding and fits well with SP-based classification models. The SP tokens Qt are generated by Eq. (4).

Qt=Qs+Qp∈RN×Ch(4)

SP–CNN cross-attention: The SP tokens Qt encode hand-crafted appearance, frequency, and positional cues for each SP, whereas the CNN tokens U~ carry rich semantic information at a finer spatial granularity. To allow each SP token to selectively aggregate the most relevant deep features, we introduce an SP–CNN cross-attention block that treats SP tokens as queries and CNN tokens as keys and values.

Given Qt∈RN×dℓ and U~∈RM×dℓ, the projected queries, keys and values for the i-th head are obtained by linear projections as defined in Eq. (5).

[QiCA,KiCA,ViCA]=[QtWiQ,CA, U~WiK,CA, U~WiV,CA](5)

where WQ,iCA,WK,iCA,WV,iCA∈Rdℓ×dk are the query, key, and value projection matrices for the i-th head. In this setting, each SP token attends to all CNN tokens and decides from which spatial locations to gather information.

The cross-attention response of the i-th head is computed according to Eq. (6).

headiCA=softmax(QiCA(KiCA)⊤dk)ViCA(6)

where the similarity matrix QiCA(KiCA)⊤ encodes the affinity between SP tokens and CNN tokens, and the softmax operates row-wise over the M CNN locations. Thus, each SP aggregates a weighted combination of CNN features, focusing on the most compatible regions.

After concatenating all heads and applying the output projection, we obtain the cross-attention output HCA=[head1CA,…,headhCA]WO,CA, and the refined SP tokens are defined with a residual connection and layer normalization as X~sp=LayerNorm⁡(Qt + HCA) ∈ RN×dℓ. This step enriches each SP token with complementary deep cues while preserving its original hand-crafted representation. Finally, the refined SP tokens are concatenated with the global token to form the input sequence of the transformer encoder (Eq. (7)), jointly encoding local SP information, aggregated CNN features, and global context.

X(0)=[X~sp,QG]∈R(N+1)×dℓ(7)

Graph-aware transformer encoder:

The sequence X(0) is processed by a stack of L graph-aware transformer encoder layers. In contrast to a vanilla transformer, which treats all tokens as an unstructured sequence, we explicitly exploit the SP graph and spatial layout when computing SA. In the ℓ-th layer, the input token representations are denoted by X(ℓ)∈R(N+1)×dℓ, where the first N rows correspond to SP tokens and the last row corresponds to the global token. For each attention head i=1,…,h, the queries, keys and values are obtained via linear projections parameterized by the weights WQi(ℓ),WKi(ℓ),WVi(ℓ)∈Rdℓ×dk (with dk=dℓ/h), as defined in Eq. (8).

[Qi(ℓ),Ki(ℓ),Vi(ℓ)]=X(ℓ)[WQi(ℓ),WKi(ℓ),WVi(ℓ)](8)

To explicitly model the SP graph, we incorporate an attention bias that depends on the adjacency and spatial distance between SPs. Let A∈{0,1}N×N denote the adjacency matrix of the SP graph and dpq the Euclidean distance between the centers of SPp and SPq. A learnable function B(Apq,dpq) maps these quantities to a bias matrix that has the same size as the similarity matrix Qi(ℓ)(Ki(ℓ))⊤. In practice, B(⋅) is implemented as a small two-layer Multi-Layer Perceptron (MLP) applied element-wise to the pairwise features [Apq,dpq]. In this way, nearby or adjacent SPs are explicitly encouraged to attend to each other, while very distant SP pairs are down-weighted. Consequently, the graph-aware SA for the i-th head is formulated as given in Eq. (9).

headi(ℓ)=SAG(Qi(ℓ),Ki(ℓ),Vi(ℓ))=softmax(Qi(ℓ)(Ki(ℓ))⊤dk+B(Apq,dpq))Vi(ℓ)(9)

The bias term B(Apq,dpq) is added directly to the attention logits and acts as a structural prior over the SP tokens. Specifically, for adjacent or spatially close superpixels, the bias increases the corresponding logits, whereas for distant SP pairs it suppresses the attention scores. For the global token, all bias entries are set to zero so that it can freely attend to all SPs without structural constraints. The indices p and q denote superpixel tokens, so B(Apq,dpq) encodes the structural relationship between the p-th and q-th SP nodes in the graph.

The outputs of the h heads are concatenated and projected as H(ℓ)=[head1(ℓ),…,headh(ℓ)] WℓO, where WℓO∈R(hdk)×dℓ. Subsequently, residual connections and layer normalization (LN) are applied in sequence with a position-wise MLP, as formalized in Eq. (10).

Z(ℓ)=LN⁡(X(ℓ)+H(ℓ)),X(ℓ+1)=LN⁡(Z(ℓ)+MLP⁡(Z(ℓ)))(10)

Thus, the proposed graph-aware encoder preserves the standard transformer structure, but the SA operation is explicitly biased according to the SP graph topology. This design enables information propagation predominantly along meaningful SP neighborhoods and improves boundary consistency and structure-aware aggregation in the SPSeg stage.

Let X(L)∈R(N+1)×dℓ denote the output of the last graph-aware encoder layer. We use only the first N rows, corresponding to SP tokens, for classification, while the last row is the global token and is discarded at this stage. The class probabilities for each SP are obtained by applying a linear layer followed by a softmax, as defined in Eq. (11).

y^=softmax(X1:N,:(L)Wfc+bfc)∈RN×3(11)

where Wfc and bfc are the parameters of the final classification layer, and each row of y^ encodes the predicted probabilities of the background, object, and boundary classes for a single SP token.

Following classification, the SP sets (C^0, C^1, and C^2) predicted by the model are visualized as shown in Fig. 3d, thereby converting the SP-based classification task into a region-level segmentation problem and improving visual interpretability.

images

Figure 3: Classification results for the SPSeg structure. (a) Original image, (b) SP boundaries, (c) ground truth mask, and (d) representation of the region formed by the SP sets ({C^0,C^1,C^2}) determined according to the classes predicted by the model

In summary, the estimated C^0 (background) and C^1 (distinct object) sets mainly provide background and object separation in large regions. However, residual uncertainty remains in the SPs belonging to C^2 (boundary region) around the object edges.

3.2 SPSal (Superpixel-Based Salient Detection)

In the SPSeg stage, a coarse saliency map is generated at the SP level. The accuracy of this map improves as the number of SPs increases. However, a higher number of SPs makes region selection more challenging for the SP algorithms and increases the model’s processing load. In addition, boundary areas must be assessed separately in all cases. To handle this, the SPSal model (fs) determines whether the pixels in the C^2 set belong to the background or the salient region.

z^j=fs(pj;θ),fs:Rk⟶{0,1},pj∈C^2(12)

Each pixel pj in the boundary region C^2 is processed using Eq. (12) to perform a binary classification. The classification result z^j will be combined with the C^1 set of the SPSeg part to create the final saliency map. At this stage, the m2∈{0,1}H×W mask created from the C^2 region is also used.

The SPSal model is based on a U-Net architecture enhanced with SA modules, as illustrated in Fig. 4. Similar to SPSeg, it adopts a dual-input design that utilizes both masked images and SP features. However, unlike SPSeg, only spatial and color features are utilized in this stage. Spatial features are extracted at the pixel level, while color features are aggregated at the SP level.

images

Figure 4: Block diagram of the SPSal architecture. The masked boundary image xm and SP feature tensor SPF2 are processed by dual encoders. A bottleneck with convolution and spatial attention fuses deep features, and decoder layers progressively restore resolution to produce a refined saliency map for the C^2 region

The center coordinates of spi are expressed as spci∈R2. The sets M0 and M1 are constructed using Eq. (13) to represent the background and the salient SP centers, respectively.

Mk={spci∣spi∈C^k},k∈{0,1},spci∈R2(13)

For each pixel pj∈C^2, we calculate the minimum Euclidean distances to the nearest SP centers belonging to the background (k=0) and object (k=1) classes. Subsequently, two distance maps Dk∈RH×W are generated using Eq. (14). As observed in Fig. 5, D0 assigns lower values to pixels that lie close to background SP centers exhibit low distance values, whereas pixels far from such centers show higher responses. The opposite behavior is seen in D1, which favors pixels closer to salient SP centers.

Dk(x,y)=minspci∈Mk‖(x,y)−spci‖,(x,y)∈C^2(14)

images

Figure 5: Visualization of the spatial–appearance priors used in SPSal. The top row shows the input image, the SP regions predicted as boundary class C^2 and the distance maps D0 and D1. As seen, D0 assigns lower values to pixels that are spatially closer to background SP centers, while D1 produces lower distances for pixels closer to salient SP centers. The bottom row shows the class-based similarity maps SM0 and SM1. SM0 highlights regions that are visually similar to background SPs, whereas SM1 emphasizes regions with appearance characteristics similar to salient SPs. Together, these four maps form the tensor SPF2, providing location-aware and appearance-aware priors that guide boundary refinement in SPSal

The visual similarity of each SP in region C^2 to its neighboring SPs is modeled by a metric based on the average color difference. The average color vector for the i-th SP is defined as spμi and the set of K nearest neighbors 𝒩i is determined via Eq. (15) with the KDTree algorithm [100]. Among the nearest neighbors of each SP in the set C^2, those with class k∈0,1 are denoted by the set ℳki.

𝒩i={spj∣j∈Top-K closest to spci,j≠i}spi∈C^2(15)

A class-based similarity score ski is calculated using a Gaussian-like transformation. Eq. (16) produces a similarity score based on the average color difference between the i-th SP in the C^2 set and its closest neighbors with class k∈{0,1}.

ski=exp⁡(−1|ℳki|∑j∈ℳki‖spμi−spμj‖2),k∈{0,1}(16)

Continuous similarity maps at the pixel level SMk∈RH×W are created by reflecting the similarity scores calculated at the SP level onto all pixel locations belonging to the relevant SPs using Eq. (17). As observed in Fig. 5, SM0 assigns higher similarity values to SPs whose visual appearance matches background regions, while SM1 highlights SPs that resemble salient regions.

SMk(x,y)=ski,(x,y)∈SPi,k∈{0,1}(17)

These features enable the model to better evaluate SP class similarity at both semantic and visual levels, enhancing location-aware contextual attention learning, especially in boundary regions and mixed-class areas. As a result, the model generates more stable and detail-sensitive predictions in the class transition regions. The calculated features are combined and denoted as SPF2, as defined in Eq. (18).

SPF2=[D0,D1,SM0,SM1]∈R4×H×W(18)

The masked image xm and the SP feature tensor SPF2 are processed in five-layer parallel encoder blocks. In each encoder block, multi-scale feature maps are produced by applying convolution, normalization and activation operations on the output ℰi created in the previous layer using Eq. (19), where ex0=xm and es0=SPF2 are the first block inputs. The five-layer encoder structure in Fig. 4 extracts multi-scale features ranging from low-level edge and texture information to high-level semantic representations. The first layers focus on local and fine details, while the subsequent layers detect more abstract, object-level, and contextual information. This hierarchical feature learning provides spatial accuracy and semantic integrity in segmentation problems.

exi=ℰi(exi−1),esi=ℰi(esi−1),i=1,…,5(19)

The encoder structure applied for both inputs is given in Table 2. The feature maps obtained in the deepest layers of the image and SP branches are combined at the channel level as Bin=[ex5,es5]∈R1024×H32×W32.

images

The Bin combined representation is converted into a more compact and semantically appropriate format by the model’s bottleneck layer. In this layer, the input first passes through a convolution block that uses 3 × 3 filters. The output of the convolution block is applied to an SA module that models global relationships. This process can be compactly expressed as Bout=SA(ConvBlock3×3(Bin))∈RB×512×H32×W32. Thus, a feature map summarizes all the encoder outputs, contains global context information, and provides a rich semantic representation for the decoder. The decoder layers of the model increase the spatial resolution step by step while combining the multi-level image and SP features obtained in the encoder with attention mechanisms.

Fig. 4 shows the layer where features from the encoder layers are combined, represented by SEFuse blocks. Each SEFuse block combines image and SP features from the corresponding encoder level using the steps in Table 3 (i∈{4,3,2,1}). The SEFuse block learns channel importance weights through channel fusion, 1×1 convolution, and the SE-Attention mechanism. As a result, low-level structural information is fused with high-level semantic information.

images

The input of the first decoder is formed as D5in=Bout+F^4. Sequential SEFuse and decoder outputs are summed in other parts, reducing the number of channels. The operations can be expressed compactly using Eq. (20), where Li denotes the number of channels in the i-th output layer. The decoder layer includes an UpBlock structure consisting of a learnable upsampling layer that doubles the resolution and two consecutive convolution blocks following it. This structure enhances the resolution and detail of low-resolution semantic representations while preserving semantic information and improving clarity. In addition to the first two decoder layers, SA blocks are applied to account for long-range spatial dependencies.

Diin=Di+1+F^i,Di=UpBlockLi+1→Li(Diin)(20)

In the final stage, SM for the C^2 mask region is created by applying a 1 × 1 convolution and σ sigmoid activation as given in Eq. (21). Thus, all pixels in the C2 region are determined to be either salient or non-salient.

z^=σ(Conv1×1(D1))∈R1×H×W(21)

Fig. 6 shows the performance of the decoder output. In Fig. 6b, the SP properties input is calculated with es0=SPF2. Fig. 6a and the m2 mask are used to generate the xm image in Fig. 6d. Fig. 6e shows the saliency map at the decoder output. Fig. 6f is shared for verification purposes. As can be seen in Fig. 6c, the SP pixels belonging to the C^2 region of the SPSeg part have been correctly identified as either salient or background in Fig. 6e.

images

Figure 6: Results of salient detection for the SPSal structure. (a) Original image, (b) SEEDS SP boundaries, (c) C^2 mask obtained from SPSeg, (d) Masked image xm, (e) decoder output z^j, and (f) boundary region derived from (e)

3.3 SPSALNet Training

The flow diagram in Fig. 7 provides detailed visualizations of the proposed method. The ground-truth set Cy is obtained from the intersection of the ground-truth mask with the SPs. The output of the SPSeg model is the set C^y, which contains the classification of the SPs. The SPSeg model is trained using cross-entropy loss. The image masked with the C^2 boundary region set (Fig. 7b) is processed using the SPSal model, and the real boundary pixels are detected as shown in Fig. 7d. The SPSal model uses two loss functions for training. The most noteworthy part of the diagram is the detection of the set C2 as C^2 and its subsequent re-evaluation to determine the boundary pixels z^j. The preparation of the training data and the loss functions in training are discussed below.

images

Figure 7: Overall workflow of the proposed SPSALNet. In the diagram above, the red line represents the SPSeg flow and the blue line represents the SPSal flow. The diagram below is provided for verification. The input image is first segmented into SPs and fed into the SPSeg module together with global features extracted by ResNet-18. The SPSeg module classifies SPs into three sets: background C^0, salient region C^1, and boundary region C^2. The boundary SPs C^2 are then refined by the SPSal module, which generates a pixel-level mask focusing on uncertain boundary areas. Finally, the coarse saliency map from SPSeg and the refined boundary mask from SPSal are fused to produce the final boundary-preserving saliency map

The ground-truth map G∈{0,1}H×W, where 0 represents the background, and 1 represents the foreground region. For each spi, a valid class label yi is defined by Eq. (22), depending on the overlap with G at the pixel level.

yi={0,if ∑(x,y)∈SPiG(x,y)=0 1,if ∑(x,y)∈SPi[1−G(x,y)]=0 2,otherwise.(22)

The class assignment rule in Eq. (22) is a parameter-free decision mechanism determined directly by the pixel composition of each SP. SPs that contain only background or only object pixels are assigned to definitive interior classes, whereas those containing a mixture of both labels naturally correspond to ambiguous transition regions. As a result, interior regions are cleanly separated into two stable classes, while high-frequency contour transitions are modeled as a dedicated category that is later re-evaluated by the SPSal module. The three-class formulation enhances training stability and enables more precise recovery of fine-grained boundary details in the final saliency estimation.

The target ground truth image for the SPSal model is obtained using Y=G(x,y)⊙m2(x,y), where ⊙ denotes the element-wise product of arrays. In training the SPSeg and SPSal methods, the goal is to simultaneously minimize the detection errors of SP regions and pixel-level detection errors. The total loss function ℒtotal is calculated in each iteration as the weighted sum of two separate losses using Eq. (23), where y^ and Y^ represent the outputs of the SPSeg and SPSal structures, respectively.

ℒtotal=12[ℒSPSeg(y^,y)+ℒSPSal(Y^,Y)](23)

The SPSeg model uses the cross-entropy loss function given by Eq. (24) to predict SP labels, where y^i,c denotes the model output for the c-th class of the i-th SP.

ℒSPSeg(y^,y)=−1N∑i=1Nlog⁡exp⁡(y^i,yi)∑c=02exp⁡(y^i,c)(24)

The loss function used for pixel-level predicted saliency maps consists of a combination of the binary cross-entropy (BCE) loss and the Dice loss, which are defined in Eqs. (26) and (25), where M denotes the total number of pixels.

ℒdice(Y^,Y)=1−2∑jY^jYj+ϵ∑jY^j+∑jYj+ϵ(25)

ℒbce(Y^,Y)=−1M∑j=1M[Yjlog⁡(Y^j)+(1−Yj)log⁡(1−Y^j)](26)

By combining the loss functions in a balanced manner as in Eq. (27) and using the m2 mask, the final loss function is denoted by ℒSPSal.

ℒSPSal=12[ℒbce(m2⊙Y^, m2⊙Y)+ℒdice(m2⊙Y^, m2⊙Y)](27)

The optimal values of the model parameters θ are determined by minimizing the total loss function ℒtotal. Model parameters were updated according to Eq. (28). The gradient of the total loss function with respect to the parameters is computed as a weighted linear combination of the gradients of both losses, ∇θℒtotal(θ), using Eq. (29).

θt+1=θt−η⋅∇θℒtotal(θt)(28)

ℒtotal=αℒSPSeg(y^,y)+(1−α)ℒSPSal(Y^,Y)(29)

4 Experiments

4.1 Datasets

The proposed model SPSALNet is trained on the DUTS-TR dataset [31], which is the largest and most diverse training set in the SOD literature. Its wide range of object categories, background clutter, and scale variations provides sufficient variability to stabilize the learning of SP–semantic relations during the SPSeg stage. The performance of the model is evaluated on the pixel-wise annotated ECSSD [101], HKU-IS [102], DUT-OMRON [103], PASCAL-S [104], and DUTS-TE [31] benchmark datasets. These datasets collectively cover a broad spectrum of visual challenges and therefore allow a comprehensive assessment of both the macro-level SP classification module and the micro-level boundary refinement component of SPSALNet.

ECSSD contains 1000 semantically meaningful images with complex structural patterns. HKU-IS consists of 4447 images that frequently include multiple disconnected salient regions and boundary-touching objects. DUT-OMRON includes 5168 images characterized by highly cluttered backgrounds and multiple object instances, providing a challenging setting for evaluating the robustness of SP grouping. PASCAL-S provides 850 images with cognitively grounded saliency annotations and diverse object layouts. The DUTS dataset includes two subsets: DUTS-TR with 10,553 images used for training, and DUTS-TE with 5019 images used for testing; both subsets exhibit substantial variability in object scale, illumination, and background complexity. This combination of datasets forms a well-established evaluation protocol in SOD research and enables reliable testing of the proposed architecture under diverse structural, contextual, and scale-related conditions.

4.2 Model Implementation Details

All experiments were performed on an Ubuntu workstation with an NVIDIA RTX 3090 GPU. Images and masks were resized to 352 × 352 pixels, and mild augmentations (horizontal/vertical flip and random cropping) were applied to reduce overfitting. Strong geometric distortions were avoided because they break the spatial homogeneity needed for reliable superpixel generation. Controlled augmentations preserve SP consistency while providing sufficient variability for generalization.

In SPSeg, a ResNet-18 model pretrained on ImageNet is used for global context inference and produces feature maps with 512 channels. These feature maps are reduced to 128 channels through adaptive average pooling followed by a linear transformation. Only the final layers of ResNet-18 were included in the training.

A fixed number (N=196) of SPs is generated for each image using the Fast-SLIC algorithm [105], and features are extracted from each SP. These features are grouped into five categories (RGB, histogram, frequency, geometry, and others) and are individually projected into a 128-dimensional space using small MLPs. The resulting representations are combined and semantically integrated for each SP. Location information is then included, and a summary token derived from the global context is prepended to the sequence, forming a total of N + 1 tokens. This sequence is processed by a 6-layer, 4-head Transformer encoder block, which predicts each SP as belonging to background, salient object, or boundary classes.

The SP–CNN cross-attention module in SPSeg uses h=4 attention heads, each with a projection dimension of dk=Ch/h=32. SP tokens serve as queries and CNN tokens as keys and values. The outputs of all heads are concatenated and projected back to Ch dimensions, followed by a residual connection and layer normalization.

The SPSal module generates edge-sensitive saliency maps using a dual-encoder, attention-enhanced encoder–decoder architecture. The RGB encoder extracts image features, while the SP encoder processes the SP-derived inputs; both encoders follow the same progression of channel dimensions: 32, 64, 128, 256, and 512. Each encoder block consists of convolution, batch normalization, and ReLU operations, with two consecutive 3×3 convolutions per block.

Features from the final encoder layers are concatenated into a B×1024×H×W bottleneck representation and refined using spatial attention. The attention module receives a B×512×H×W tensor, computes attention weights using query and key projections of dimension C/8, and reweights the value projection before applying a residual connection.

The graph-aware attention mechanism incorporates structural priors using a learnable bias function B(Apq,dpq). In implementation, B(⋅) is realized as a lightweight two-layer MLP applied element-wise to the pairwise adjacency values and SP-center distances [Apq,dpq]. The resulting scalar bias values are added to the attention logits before softmax, strengthening interactions among spatially adjacent SPs.

The decoder consists of five sequential stages and uses transposed convolutions for learnable upsampling. The channel widths of these stages are 512, 256, 128, 64, and 32, with a final refinement stage also operating at 32 channels. The first two decoder stages incorporate spatial attention blocks to preserve long-range dependencies. At each stage, additional 3×3 convolutional filters are applied to further refine the features. The decoder receives a combined tensor formed by concatenating the image and SP encoder outputs along the channel dimension, resulting in a representation of size B×2C×H×W, which is compressed using a 1 × 1 convolution and then processed by the SEFuse module. The final decoder output is reduced to B×32×H×W and converted into a single-channel saliency prediction using a 1 × 1 convolution followed by a sigmoid activation.

The model was trained for 100 epochs with a batch size of 16. A linear warm-up was applied for the first five epochs at a learning rate of 1 × 10−4, followed by a cosine annealing schedule decaying to 1 × 10−6. AdamW optimization was used with a weight decay of 1 × 10−4 and gradient clipping with a maximum norm of 1.0.

4.3 Evaluation Metrics

We evaluate the proposed model using four widely used metrics in salient object detection: S-measure (Sm), Enhanced Alignment Measure (Eξm), weighted F-measure (Fβw), and Mean Absolute Error (M↓). Sm assesses structural similarity by combining object-level and region-level consistency. Eξm evaluates both global alignment and local correspondence between the predicted saliency map and the ground truth. Fβw provides a weighted balance between precision and recall with spatial adaptivity, resulting in a more reliable evaluation. Finally, M↓ measures the average pixel-wise error, where lower values indicate higher accuracy. All metrics are computed following standard SOD benchmarking protocols.

4.4 Comparison with State-of-the-Art Methods

We compare our model against recent state-of-the-art (SOTA) methods, including NASAL [80], UGRAN [53], SODAWideNet++ [82], PiNet [55], SALFAUNet [81], DASOD [88], GLSTR [75], TCGNet [85], GPONet [79], ADMNet [58], OLER-R [56], MENET [47], BBRF [40], ICON-R [39], VST [45], and U2Net [30]. Outdated or underperforming studies are not included in the comparison. Therefore, the comparison is restricted to only those studies that satisfy the criteria of high accuracy, widespread usage, and architectural diversity.

1) Performance and Efficiency Evaluation

As summarized in Table 4 (ECSSD and PASCAL-S) and Table 5 (DUTS-TE, HKU-IS, and DUT-OMRON), SPSALNet performs competitively against recent methods in terms of the computational cost–accuracy trade-off on five benchmark datasets. The observed cost–accuracy behavior is supported both by the overall architectural design (see the MACs and parameter counts reported in Table 4) and by the rich set of attributes computed for each SP. The multi-level integration of SP priors and image features enables attentive processing that preserves local details while enhancing global context awareness.

images

On ECSSD, SPSALNet achieves the best alignment score with Eξm=0.937, while also maintaining a strong balance between structural similarity and error (Sm=0.936, Fβω=0.937, MAE=0.025). On PASCAL-S, the proposed method attains SOTA performance in all four metrics, reaching Sm=0.889, Eξm=0.884, Fβω=0.859, and MAE=0.048. Relative to the strong baseline BBRF, SPSALNet improves structural similarity by 1.1 percentage points (pp) and alignment by 0.6 pp, while increasing the weighted F-measure by 0.3 pp and reducing the MAE by 0.001.

On DUTS-TE, SPSALNet ranks within the top two methods across all metrics (Sm=0.906, Eξm=0.924, Fβω=0.881, MAE=0.026), closely following BBRF with differences of at most 0.5 pp.

On HKU-IS, the model delivers the best alignment and F-measure scores (Eξm=0.963, Fβω=0.931) and the second lowest MAE (MAE=0.021), matching or surpassing heavier transformer-based competitors.

On DUT-OMRON, SPSALNet clearly outperforms existing methods, achieving the best scores in all metrics with Sm=0.893, Eξm=0.907, Fβω=0.827, and MAE=0.034. Compared to the next best method, this corresponds to yields up to 2.4 pp gains in Fβω and a reduction of 0.01 in MAE, demonstrating the model’s effectiveness in complex scenes with cluttered backgrounds and small salient regions.

The achieved accuracy levels are directly attributable to the modular attention mechanisms and multi-feature fusion embedded in the SPSALNet architecture. While SP-level features are learned through categorical grouping and semantic coverage, spatial relationships and appearance similarity are modeled jointly via global context integration. The SEFuse blocks enable seamless blending of SP-based and image-based encoder outputs within the SPSal stage without losing fine structural cues, resulting in sharper boundaries and more reliable predictions in mixed-class and transition regions.

In terms of efficiency, SPSALNet operates with only 21.98 GMACs1 and 35.97 M parameters, running at 51 FPS. This yields a substantially lower computational burden than that of computationally heavy SOTA models such as GPONet (145.3 GMACs, 71.46 M parameters), UGRAN (61.79 GMACs, 96.54 M parameters), and BBRF (46.0 GMACs, 74.40 M parameters). Concretely, SPSALNet reduces the MAC count by about 85% relative to GPONet, 64% relative to UGRAN, and 52% relative to BBRF, while using roughly 50%–63% fewer parameters. Despite being significantly lighter, SPSALNet achieves comparable or superior accuracy on multiple benchmarks.

The curves in Fig. 8 further demonstrate that SPSALNet consistently produces high-quality saliency maps and achieves a favorable precision–recall trade-off with low false-positive rates. In particular, the method yields the most favorable PR and F-measure profiles on the DUT-OMRON dataset and remains highly competitive on ECSSD, HKU-IS, and DUTS-TE, often matching or surpassing high-performance architectures such as BBRF, GLSTR, and GPONet in the medium and high recall ranges. The fact that the PR curves for some datasets begin only at high recall values (Recall≳0.8) indicates that the model labels primarily high-confidence regions as positive even at low thresholds, suppressing false positives and leaving no meaningful precision–recall pairs at low recall intervals.

images images

Figure 8: Comparison of Precision–Recall (PR) and F-measure curves for SPSALNet against state-of-the-art methods. (a–c) PR curves on DUT-OMRON, DUTS-TE, and ECSSD; (d,e) PR curves on HKU-IS and PASCAL-S; (f–h) F-measure curves on DUT-OMRON, DUTS-TE, and ECSSD; and (i,j) F-measure curves on HKU-IS and PASCAL-S

Threshold-based F-measure analyses in Fig. 8 support these observations. SPSALNet attains the highest or near-highest Fβ scores across most threshold values on DUT-OMRON and HKU-IS, demonstrating strong robustness to variations in the decision threshold. The flat and broad plateaus of the Fβ curves indicate that SPSALNet remains stable under a wide range of thresholds, reducing the need for manual tuning in practical settings.

Overall, SPSALNet is a highly efficient SOTA candidate that achieves MAE≤0.034 across all five datasets, delivers the best Eξm and/or Fβω scores on three datasets, and maintains a significantly lower computational burden than competing methods. Thanks to its semantic disaggregation of SP features, attention-based information integration, and modular design, SPSALNet offers a balanced, scalable, and dependable solution for both academic benchmarks and real-time, resource-constrained applications.

2) Visual Comparison

Table 6 shows the visual performance of the proposed method compared to existing methods. To assess performance across varied scenarios, we group the images into five categories: small, medium, and large objects, multiple-object scenes, and challenging cases.

images

Due to limited pixel area and low contrast, segmentation of small objects is quite difficult. However, the proposed model can detect these objects accurately. Thanks to the globally narrowed search regions with the SPSeg model and the local enhancement made by the SPSal model, enabling precise detection of fine salient details.

Our model competes with effective methods such as ICON-S and BBRF, especially for medium and large objects. Thanks to SEFuse blocks, low-level structural details and high-level semantic representations are balanced and integrated. This reduces detail loss in large object segmentation while increasing contour accuracy in medium-sized objects.

In scenes with multiple salient objects, the proposed model successfully detects discrete regions. Object co-occurrence, background blending, or missing object problems observed in other methods have been minimized in SPSALNet. The main reason for this is that global context learning based on SP neighborhood relationships can clearly distinguish visual regions in the scene.

The proposed model provides high accuracy in challenging cases (e.g., low contrast, fine details, noisy scenes). In regions with complex edge structures and texture clutter, sharp contours are obtained and segmentation integrity is preserved. Three main technical approaches are behind this success: First, edge-aware SP encoding preserves contour continuity at the representation level. Second, SEFuse blocks suppress low-confidence responses and emphasize boundary regions. Third, an attention-guided decoder structure prevents detail loss and preserves region homogeneity.

Overall, SPSALNet not only detects salient objects accurately in images of varying complexity but also produces sharper, more homogeneous, and cleaner segmentation. These findings confirm that SPSALNet effectively balances accuracy and robustness, providing reliable results even in challenging visual conditions.

4.5 Ablation Studies

Comprehensive ablation studies were conducted to analyze the contributions of the components of SPSALNet. The effectiveness of different structures and design choices in the model architecture was systematically evaluated. All ablation experiments were performed on the ECSSD dataset, with 880 randomly selected samples used for training and the rest for testing.

First, ResNet-18 and ResNet-50 networks were compared to evaluate the effect of the deep network used in global feature extraction. Then, the features calculated at the SP level were divided into three categories, and each of the RGB, histogram, and frequency feature groups was used alone in the model, and the relative contribution of each group was examined.

In addition, cases where the model used only image-based features or only SP features were tested separately, allowing us to investigate how effective each source of information is on its own. Furthermore, the SEFuse blocks were removed to examine how the model performs when feature fusion is disabled.

Finally, experiments were conducted with different SP counts (N=100,196,256) to evaluate the robustness of SP-based analyses, and comparative tests were performed using the LSC and SEEDS algorithms. In this context, the impact of segmentation quality and the number of SPs on salient object detection accuracy was examined in detail.

These ablation studies quantitatively demonstrate how much each model component contributes, clarifying why the proposed integrated structure achieves high accuracy and precise edge sensitivity.

4.5.1 Backbone Study

The proposed model employs ResNet-18 for global context extraction. This backbone’s low parameter count and fast inference make it well suited for real-time applications. Replacing ResNet-18 with a ResNet-50 based extractor revealed how a larger feature extractor affects overall performance.

ResNet-50 delivers richer contextual information through its deeper architecture and higher-dimensional feature representations. Experiments on the ECSSD dataset, however, indicate that the resulting performance gains are modest. Table 7 presents a comparison of the two backbone architectures.

images

With other components kept unchanged, these experiments confirm that the more compact ResNet-18 backbone provides an effective balance of accuracy and efficiency. The model’s main results were obtained using ResNet-18 on the DUTS-TR training set.

4.5.2 SP Features Ablation

In order to analyze the contribution of SP features to model performance, ablation experiments were carried out for each feature category.

The contribution of each feature is shown in Table 8. Using the full feature set yielded the best overall results. When the RGB-based statistics were removed, the F-measure dropped and the MAE increased. Omitting the histogram and frequency features likewise caused a noticeable drop in accuracy, highlighting their contribution to the model’s contextual and structural sensitivity.

images

4.5.3 Relative Contribution of Image and SP Information

In order to analyze how effectively the model uses image- and SP-based information sources, two basic ablation experiments were performed, testing each component separately, as shown in Table 9.

images

In the first scenario, SP features were excluded, and the saliency map was estimated solely based on contextual features extracted by the ResNet backbone. This led to slight decreases in the F-measure and E-measure and a small rise in MAE, highlighting the contribution of SP representations to better accuracy and edge precision.

In the second scenario, only SP features were used and global context extraction from the image was disabled. This led to a noticeable drop in the F-measure and S-measure scores and a slight increase in MAE, indicating that the absence of image-based context reduced visual integrity and boundary sensitivity.

These two experiments show that SP- and image-based information sources are complementary; the highest performance is achieved with the full structure where both sources are used together.

4.5.4 Removing SEFuse Blocks

SEFuse blocks used in the decoder stage of the model are designed to perform channel-level adaptive feature fusion between image and SP encoder outputs. After combining different information sources on the channel axis, these blocks re-weight them with a squeeze–excitation mechanism, thus providing channel-wise content adaptivity.

To analyze the contribution of these structures, SEFuse blocks were completely removed from the model and only direct feature fusion was performed on the relevant layers. After this change, the model’s F-measure dropped by 2.9 pp, while the S-measure and E-measure decreased by around 3.8 pp each.

4.5.5 The Effect of SP Count and SP Segmentation Algorithms on Performance

To analyze the effects of different SP segmentation algorithms and SP counts on model performance, systematic ablation experiments were conducted using both the SEEDS [98] and LSC [99] algorithms. SP counts of 98, 196, and 256 were tested for both algorithms, while keeping all other model components unchanged. The numerical results are provided in Table 10, and the visual examples can be seen in Fig. 9.

images

Figure 9: Comparison of superpixel segmentations obtained at different superpixel levels (N=98,196,256) using the LSC and SEEDS algorithms and the corresponding model predictions

The results for the SEEDS algorithm show a slight but consistent improvement in model accuracy as the number of SPs increases. Notably, the highest S-measure and F-measure values were obtained when N=256. In contrast, slight accuracy drops were observed at lower SP densities (N=98). These findings suggest that SEEDS provides more effective semantic representations when finer segmentation granularity is used.

In experiments performed with the LSC algorithm, the effect of the number of SP on performance was much more noticeable. Especially with a low number of SPs, there was a significant decrease in the performance of the model, and performance seriously decreased in the F-measure and MAE criteria. However, as the number of SPs increased, especially at the N=256 level, the model’s accuracy values improved significantly and became comparable to the SEEDS algorithm, achieving the lowest MAE value.

In the comparison carried out for N=196, it was found that SEEDS provided more stable and consistent outputs, particularly in the Eξm and F-measure criteria. However, the LSC algorithm also produced quite successful results when it reached sufficient SP density.

4.5.6 Failure Cases

SPSALNet employs a two-level SP-based segmentation strategy; however, the architectural design introduces specific limitations. The method relies on the hierarchical clustering of SPs, the stability of their semantic composition across the image space, and the consistency of the transition regions formed by SPs. In complex scenes, the interaction among these factors reveals practical constraints inherent to the approach. Fig. 10 shows several representative cases where the performance of SP classification, granularity, and localized refinement is suboptimal.

images

Figure 10: Representative failure cases. Each column corresponds to one image example, showing the original image (top), ground truth (middle), and the SPSALNet prediction (bottom)

A recurring limitation arises in scenes containing multiple objects with highly similar shapes and appearances, for example, airplanes in aerial imagery (Fig. 10, row 4). The SPSeg module relies on SP tokens whose class assignments are determined according to Eq. (22), which assumes that each SP exhibits a dominant semantic identity. When several visually similar objects lie in close proximity, the Transformer encoder in SPSeg tends to pull neighboring SP tokens toward the same class cluster due to strong feature homogeneity. As a result, the separation between C0 and C1 becomes weaker, mixed-label regions grow larger, and the transition band C2 becomes progressively wider.

When SPs become very small, their semantic composition becomes unstable because even minor pixel-level fluctuations can shift a region from predominantly foreground to predominantly background, or vice versa. In scenes such as Fig. 10 (row 4), boundary-aligned SPs may contain nearly equal amounts of object and background pixels, making their class assignment highly ambiguous. Such SPs are frequently pushed into the C2 transition band, and when many neighboring regions fall into this uncertain state, the transition area expands excessively. Since SPSal refines only C2 and does not re-establish global context, it cannot correct these large ambiguous zones, leading to merged contours and reduced separation between nearby objects.

Another category of failures arises when thin structural details or closely adjacent object parts are present (Fig. 10, row 3). The SPSeg module often merges SPs across narrow gaps because such regions lack a clear dominant label under Eq. (7). As a result, boundary SPs in C2 become less faithful to the true contour geometry. During refinement, the lightweight encoder–decoder structure of SPSal (Section 3.3) operates with a limited receptive field, a deliberate design choice to reduce computational load. While this promotes sharper local transitions, it also causes over-smoothed predictions in ambiguous areas, leading to the loss of delicate structures such as leaf tips or thin object protrusions.

Similar limitations appear in multi-scale scenarios where small objects are represented by only a few SPs (Fig. 10, row 2). Their C1 regions can be easily absorbed into C0, and because SPSal only refines the existing C2 band, missing structures cannot be reintroduced. This results in incomplete or fragmented predictions for small-scale objects.

Lastly, elongated or filament-like structures present a persistent challenge (Fig. 10, row 1). Along such shapes, SP decomposition often becomes fragmented or low-contrast, increasing classification ambiguity within the C2 band. The SPSal module—guided by boundary masks and optimized with losses weighted toward transition regions—focuses on local consistency but lacks a mechanism to enforce long-range structural coherence. Consequently, refined predictions may exhibit discontinuities, broken segments, or spill-over into adjacent regions, causing elongated foreground structures to appear distorted or partially lost.

Overall, the failure cases indicate that SPSALNet tends to be mainly challenged by similarity-induced token fusion, adjacency-driven merging, scale imbalance, and structural thinness. These shortcomings arise from fundamental architectural trade-offs: (i) SP-based macro grouping vs. semantic purity, (ii) lightweight local refinement vs. global context modeling, and (iii) transition-focused optimization vs. holistic saliency preservation.

5 Conclusion

In this paper, we propose SPSALNet, a two-stage salient object detection framework that simultaneously prioritizes structural preservation and computational efficiency. The proposed model performs global-level segmentation using a superpixel-based Transformer classifier, while refining details through a U-Net architecture that focuses on uncertain and transition regions at the local level. In the macro stage, superpixel features and global contextual cues are jointly processed through an SP–CNN cross-attention block and a graph-aware Transformer encoder, which together enhance semantic coherence and boundary awareness. In the micro stage, the SPSal module refines only the boundary band using dual encoders and SEFuse-based feature fusion, allowing the network to allocate most of its capacity to structurally ambiguous regions.

Comprehensive experimental analyses on five public SOD benchmarks have shown that SPSALNet can provide satisfactory results in terms of both structural and content accuracy with a low computational load, achieving state-of-the-art or highly competitive performance while maintaining a favorable accuracy–efficiency trade-off. Therefore, the model is considered suitable for use in real-time or resource-constrained applications. Nevertheless, SPSALNet still exhibits limitations in scenes where salient objects are extremely small, filament-like, or represented by very few or unstable superpixels, and in cases where the initial SP segmentation quality is poor. In such settings, the fixed SP granularity and the reliance on a predefined SP decomposition may enlarge the ambiguous boundary band and reduce the effectiveness of the refinement stage.

Possible future research directions include making the superpixel segmentation stage more flexible (e.g., adaptively choosing the SP granularity or jointly learning SP partitions with the backbone), integrating uncertainty-focused learning strategies more deeply into both SPSeg and SPSal, and evaluating the applicability of the proposed architecture to different input modalities and tasks (such as RGB-D, RGB-T, or medical image segmentation). These extensions would further strengthen the robustness of SPSALNet and help mitigate the observed failure cases in highly complex or structurally extreme scenes.

Acknowledgement: The authors would like to thank Professor Cabir Vural for his valuable feedback and guidance during the development of this work, which helped improve the quality of this manuscript.

Funding Statement: The authors received no specific funding for this study.

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Methodology and Software, Burhan Baraklı and Can Yüzkollar; Validation and Software, Tuğrul Taşçı; Writing—original draft preparation and Ablation studies, İbrahim Yıldırım. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The source code and the masks generated in this study will be publicly available at https://github.com/burhanbarakli/SPSALNet. All other datasets used in the experiments are publicly available and have been referenced in Section 4 of the manuscript.

Ethics Approval: Not applicable. This study did not involve human participants or animals, and no ethical approval was required.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

1GMACs are computed using only forward convolution and linear layer MAC operations for a single input; activation and batch normalization are excluded.

References

1. Wang X, Yu S, Lim EG, Wong MLD. Salient object detection: a mini review. Front Signal Process. 2024;4:1356793. doi:10.3389/frsip.2024.1356793. [Google Scholar] [CrossRef]

2. Ullah I, Jian M, Hussain S, Guo J, Yu H, Wang X, et al. A brief survey of visual saliency detection. Multimed Tools Appl. 2020;79(45–46):34605–45. doi:10.1007/s11042-020-08849-y. [Google Scholar] [CrossRef]

3. Ren Z, Gao S, Chia LT, Tsang IWH. Region-based saliency detection and its application in object recognition. IEEE Trans Circ Syst Video Technol. 2014;24(5):769–79. doi:10.1109/tcsvt.2013.2280096. [Google Scholar] [CrossRef]

4. Hussain N, Khan MA, Sharif M, Khan SA, Albesher AA, Saba T, et al. A deep neural network and classical features based scheme for objects recognition: an application for machine inspection. Multimed Tools Appl. 2020;83(5):14935–57. doi:10.1007/s11042-020-08852-3. [Google Scholar] [CrossRef]

5. Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(4):640–51. doi:10.1109/TPAMI.2016.2572683. [Google Scholar] [PubMed] [CrossRef]

6. Itti L. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Trans Image Process. 2004;13(10):1304–18. doi:10.1109/TIP.2004.834657. [Google Scholar] [PubMed] [CrossRef]

7. Ma S, Yang JJ. Image-based vehicle classification by synergizing features from supervised and self-supervised learning paradigms. Eng. 2023;4(1):444–56. doi:10.3390/eng4010027. [Google Scholar] [CrossRef]

8. Chen B, Li P, Sun C, Wang D, Yang G, Lu H. Multi attention module for visual tracking. Pattern Recognit. 2019;87:80–93. doi:10.1016/j.patcog.2018.10.005. [Google Scholar] [CrossRef]

9. Wang W, Shen J, Ling H. A deep network solution for attention and aesthetics aware photo cropping. IEEE Trans Pattern Anal Mach Intell. 2019;41(7):1531–44. doi:10.1109/TPAMI.2018.2840724. [Google Scholar] [PubMed] [CrossRef]

10. Achanta R, Süsstrunk S. Saliency detection for content-aware image resizing. In: 2009 16th IEEE International Conference on Image Processing (ICIP). Piscataway, NJ, USA: IEEE; 2009. p. 1005–8. doi:10.1109/ICIP.2009.5413815. [Google Scholar] [CrossRef]

11. Hussain T, Anwar S, Ullah A, Muhammad K, Baik SW. Densely deformable efficient salient object detection network. arXiv:2102.06407. 2021. [Google Scholar]

12. Zhang D, Tian H, Han J. Few-cost salient object detection with adversarial-paced learning. arXiv:2104.01928. 2021. [Google Scholar]

13. Elmezain M, Saad Saoud L, Sultan A, Heshmat M, Seneviratne L, Hussain I. Advancing underwater vision: a survey of deep learning models for underwater object recognition and tracking. IEEE Access. 2025;13:17830–67. doi:10.1109/access.2025.3534098. [Google Scholar] [CrossRef]

14. Zhang H, Wang S, Liu T. Remote sensing image change detection combined with saliency. IEEE Sens J. 2024;24(11):18108–21. doi:10.1109/jsen.2024.3390674. [Google Scholar] [CrossRef]

15. Joshi A, Saquib Khan M, Choi KN. Medical image segmentation using combined level set and saliency analysis. IEEE Access. 2024;12:102016–26. doi:10.1109/access.2024.3431995. [Google Scholar] [CrossRef]

16. Bougourzi F, Hadid A. Recent advances in medical imaging segmentation: a survey. arXiv:2505.09274. 2025. [Google Scholar]

17. Zhou W, Wang Y, Qian X. Knowledge distillation and contrastive learning for detecting visible-infrared transmission lines using separated stagger registration network. IEEE Trans Circ Syst I Regul Pap. 2025;72(8):4140–52. doi:10.1109/tcsi.2024.3521933. [Google Scholar] [CrossRef]

18. Arvanitis G, Stagakis N, Zacharaki EI, Moustakas K. Cooperative saliency-based pothole detection and AR rendering for increased situational awareness. IEEE Trans Intell Transp Syst. 2024;25(5):3588–604. doi:10.1109/tits.2023.3327494. [Google Scholar] [CrossRef]

19. Zhou W, Ju Z, Cong R, Yan W. RCNet: dual-network resonance collaboration via mutual learning for RGB-D road defect detection. IEEE Trans Circ Syst Video Technol. 2025:1. doi:10.1109/tcsvt.2025.3617769. [Google Scholar] [CrossRef]

20. Li X, Li Y, Shen C, Dick A, Van Den Hengel A. Contextual hypergraph modeling for salient object detection. In: 2013 IEEE International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2013. p. 3328–35. doi:10.1109/iccv.2013.413. [Google Scholar] [CrossRef]

21. Filali I, Allili MS, Benblidia N. Multi-scale salient object detection using graph ranking global-local saliency refinement. Signal Process Image Commun. 2016;47:380–401. doi:10.1016/j.image.2016.07.007. [Google Scholar] [CrossRef]

22. Wang X, Ma H, Chen X. Geodesic weighted Bayesian model for saliency optimization. Pattern Recognit Lett. 2016;75:1–8. doi:10.1016/j.patrec.2016.02.008. [Google Scholar] [CrossRef]

23. Li H, Lu H, Lin Z, Shen X, Price B. Inner and inter label propagation: salient object detection in the wild. IEEE Trans Image Process. 2015;24(10):3176–86. doi:10.1109/TIP.2015.2440174. [Google Scholar] [PubMed] [CrossRef]

24. Yang J, Yang MH. Top-down visual saliency via joint CRF and dictionary learning. IEEE Trans Pattern Anal Mach Intell. 2017;39(3):576–88. doi:10.1109/TPAMI.2016.2547384. [Google Scholar] [PubMed] [CrossRef]

25. Wang L, Chen R, Zhu L, Xie H, Li X. Deep sub-region network for salient object detection. IEEE Trans Circ Syst Video Technol. 2021;31(2):728–41. doi:10.1109/TCSVT.2020.2988768. [Google Scholar] [CrossRef]

26. Zhang J, Sclaroff S, Lin Z, Shen X, Price B, Mech R. Unconstrained salient object detection via proposal subset optimization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2016. p. 5733–42. doi:10.1109/cvpr.2016.618. [Google Scholar] [CrossRef]

27. Su J, Li J, Zhang Y, Xia C, Tian Y. Selectivity or invariance: boundary-aware salient object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2019. p. 3798–807. doi:10.1109/ICCV.2019.00390. [Google Scholar] [CrossRef]

28. He S, Jiao J, Zhang X, Han G, Lau RWH. Delving into salient object subitizing and detection. In: 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2017. p. 1059–67. doi:10.1109/ICCV.2017.120. [Google Scholar] [CrossRef]

29. Qin X, Zhang Z, Huang C, Gao C, Dehghan M, Jagersand M. BASNet: boundary-aware salient object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2019. p. 7471–81. [Google Scholar]

30. Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M. U2-Net: going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020;106:107404. doi:10.1016/j.patcog.2020.107404. [Google Scholar] [CrossRef]

31. Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, et al. Learning to detect salient objects with image-level supervision. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2017. p. 3796–805. doi:10.1109/cvpr.2017.404. [Google Scholar] [CrossRef]

32. Zhang J, Yu X, Li A, Song P, Liu B, Dai Y. Weakly-supervised salient object detection via scribble annotations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2020. p. 12543–52. doi:10.1109/cvpr42600.2020.01256. [Google Scholar] [CrossRef]

33. Nguyen DT, Dax M, Mummadi CK, Ngo TPN, Nguyen THP, Lou Z, et al. DeepUSPS: deep robust unsupervised saliency prediction with self-supervision. In: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 8–14; Vancouver, BC, Canada. p. 1–11. [Google Scholar]

34. Zhou H, Chen P, Yang L, Lai J, Xie X. Activation to saliency: forming high-quality labels for completely unsupervised salient object detection. IEEE Trans Circ Syst Video Technol. 2023;33(2):743–55. doi:10.1109/tcsvt.2022.3203595. [Google Scholar] [CrossRef]

35. Tang Y, Wu X. Salient object detection using cascaded convolutional neural networks and adversarial learning. IEEE Trans Multimed. 2019;21(9):2237–47. doi:10.1109/TMM.2019.2900908. [Google Scholar] [CrossRef]

36. Cai Y, Dai L, Wang H, Chen L, Li Y. A novel saliency detection algorithm based on adversarial learning model. IEEE Trans Image Process. 2020;29:4489–504. doi:10.1109/TIP.2020.2972692. [Google Scholar] [PubMed] [CrossRef]

37. Zhao R, Ouyang W, Li H, Wang X. Saliency detection by multi-context deep learning. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2015. p. 1265–74. doi:10.1109/CVPR.2015.7298731. [Google Scholar] [CrossRef]

38. Liu N, Han J, Yang MH. PiCANet pixel-wise contextual attention learning for accurate saliency detection. IEEE Trans Image Process. 2020;29:6438–51. doi:10.1109/TIP.2020.2988568. [Google Scholar] [PubMed] [CrossRef]

39. Zhuge M, Fan DP, Liu N, Zhang D, Xu D, Shao L. Salient object detection via integrity learning. IEEE Trans Pattern Anal Mach Intell. 2023;45(3):3738–52. doi:10.1109/TPAMI.2022.3179526. [Google Scholar] [PubMed] [CrossRef]

40. Ma M, Xia C, Xie C, Chen X, Li J. Boosting broader receptive fields for salient object detection. IEEE Trans Image Process. 2023;32:1026–38. doi:10.1109/TIP.2022.3232209. [Google Scholar] [PubMed] [CrossRef]

41. Hu C, Guo J, Xie H, Zhu Q, Yuan B, Gao Y, et al. CGDINet: a deep learning-based salient object detection algorithm. IEEE Access. 2025;13:4697–723. doi:10.1109/ACCESS.2024.3525303. [Google Scholar] [CrossRef]

42. Xia X, Ma Y. Cross-stage feature fusion and efficient self-attention for salient object detection. J Vis Commun Image Rep. 2024;104(104271):104271. doi:10.1016/j.jvcir.2024.104271. [Google Scholar] [CrossRef]

43. Zhao T, Wu X. Pyramid feature attention network for saliency detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2019. p. 3080–9. [Google Scholar]

44. Hussain T, Anwar A, Anwar S, Petersson L, Wook Baik S. Pyramidal attention for saliency detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway, NJ, USA: IEEE; 2022. p. 2877–87. [Google Scholar]

45. Liu N, Zhang N, Wan K, Shao L, Han J. Visual saliency transformer. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2021. p. 4702–12. [Google Scholar]

46. Liu N, Luo Z, Zhang N, Han J. VST++: efficient and stronger visual saliency transformer. IEEE Trans Pattern Anal Mach Intell. 2024;46(11):7300–16. doi:10.1109/tpami.2024.3388153. [Google Scholar] [CrossRef]

47. Wang Y, Wang R, Fan X, Wang T, He X. Pixels, regions, and objects: multiple enhancement for salient object detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2023. p. 10031–40. [Google Scholar]

48. Hou Q, Cheng MM, Hu X, Borji A, Tu Z, Torr PHS. Deeply supervised salient object detection with short connections. IEEE Trans Pattern Anal Mach Intell. 2019;41(4):815–28. doi:10.1109/TPAMI.2018.2815688. [Google Scholar] [PubMed] [CrossRef]

49. Luo Z, Mishra A, Achkar A, Eichel J, Li S, Jodoin PM. Non-local deep features for salient object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2017. p. 6593–601. doi:10.1109/CVPR.2017.698. [Google Scholar] [CrossRef]

50. Wei J, Wang S, Huang Q. F3Net: fusion, feedback and focus for salient object detection. Proc AAAI Conf Artif Intell. 2020;34(7):12321–8. doi:10.1609/aaai.v34i07.6916. [Google Scholar] [CrossRef]

51. Wu Z, Su L, Huang Q. Stacked cross refinement network for edge-aware salient object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2019. p. 7263–72. [Google Scholar]

52. Zhao J, Liu JJ, Fan DP, Cao Y, Yang J, Cheng MM. EGNet: edge guidance network for salient object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2019. p. 8778–87. doi:10.1109/ICCV.2019.00887. [Google Scholar] [CrossRef]

53. Yuan Y, Gao P, Dai Q, Qin J, Xiang W. Uncertainty-guided refinement for fine-grained salient object detection. IEEE Trans Image Process. 2025;34:2301–14. doi:10.1109/TIP.2025.3557562. [Google Scholar] [PubMed] [CrossRef]

54. Zhu J, Qin X, Elsaddik A. DC-Net: divide-and-conquer for salient object detection. Pattern Recognit. 2025;157:110903. doi:10.1016/j.patcog.2024.110903. [Google Scholar] [CrossRef]

55. Wang X, Liu Z, Liesaputra V, Huang Z. Feature specific progressive improvement for salient object detection. Pattern Recognit. 2024;147:110085. doi:10.1016/j.patcog.2023.110085. [Google Scholar] [CrossRef]

56. Yao Z, Wang L. Object localization and edge refinement network for salient object detection. Expert Syst Appl. 2023;213(B):118973. doi:10.1016/j.eswa.2022.118973. [Google Scholar] [CrossRef]

57. Chi J, Wu C, Yu X, Chu H, Ji P. Saliency detection via integrating deep learning architecture and low-level features. Neurocomputing. 2019;352:75–92. doi:10.1016/j.neucom.2019.03.070. [Google Scholar] [CrossRef]

58. Zhou X, Shen K, Liu Z. ADMNet: attention-guided densely multi-scale network for lightweight salient object detection. IEEE Trans Multimed. 2024;26:10828–41. doi:10.1109/TMM.2024.3413529. [Google Scholar] [CrossRef]

59. Huang L, Li G, Li Y, Lin L. Lightweight adversarial network for salient object detection. Neurocomputing. 2020;381:130–40. doi:10.1016/j.neucom.2019.09.100. [Google Scholar] [CrossRef]

60. He S, Lau RWH, Liu W, Huang Z, Yang Q. SuperCNN: a superpixelwise convolutional neural network for salient object detection. Int J Comput Vis. 2015;115(3):330–44. doi:10.1007/s11263-015-0822-0. [Google Scholar] [CrossRef]

61. Li Y, Cui F, Xue X, Cheung-Wai CJ. Coarse-to-fine salient object detection based on deep convolutional neural networks. Signal Process Image Commun. 2018;64:21–32. doi:10.1016/j.image.2018.01.012. [Google Scholar] [CrossRef]

62. Zhang A, Ren W, Liu Y, Cao X. Lightweight image super-resolution with superpixel token interaction. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2023. p. 12682–91. [Google Scholar]

63. Zhang S, Chen Y, Sun Y, Wang F, Yang J, Bai L, et al. Superpixel semantics representation and pre-training for vision-language tasks. Neurocomputing. 2025;615:128895. doi:10.1016/j.neucom.2024.128895. [Google Scholar] [CrossRef]

64. Chen J, Zhang H, Gong M, Gao Z. Collaborative compensative transformer network for salient object detection. Pattern Recognit. 2024;154:110600. doi:10.1016/j.patcog.2024.110600. [Google Scholar] [CrossRef]

65. Tong N, Lu H, Zhang L, Ruan X. Saliency detection with multi-scale superpixels. IEEE Signal Process Lett. 2014;21(9):1035–9. doi:10.1109/LSP.2014.2323407. [Google Scholar] [CrossRef]

66. Zhang M, Wu Y, Du Y, Fang L, Pang Y. Saliency detection integrating global and local information. J Vis Commun Image Rep. 2018;53:215–23. doi:10.1016/j.jvcir.2018.03.019. [Google Scholar] [CrossRef]

67. Qiu Y, Mei J, Xu J. Superpixel-wise contrast exploration for salient object detection. Knowl Based Syst. 2024;292:111617. doi:10.1016/j.knosys.2024.111617. [Google Scholar] [CrossRef]

68. Wang S, Ning Y, Li XM, Zhang C. Saliency detection via manifold ranking on multi-layer graph. IEEE Access. 2024;12:6615–27. doi:10.1109/access.2023.3347812. [Google Scholar] [CrossRef]

69. Li S, Lu H, Lin Z, Shen X, Price B. Adaptive metric learning for saliency detection. IEEE Trans Image Process. 2015;24(11):3321–31. doi:10.1109/TIP.2015.2440755. [Google Scholar] [PubMed] [CrossRef]

70. Li G, Yu Y. Visual saliency detection based on multiscale deep CNN features. IEEE Trans Image Process. 2016;25(11):5012–24. doi:10.1109/TIP.2016.2602079. [Google Scholar] [PubMed] [CrossRef]

71. Liu JJ, Hou Q, Cheng MM, Feng J, Jiang J. A simple pooling-based design for real-time salient object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2019. p. 3912–21. doi:10.1109/cvpr.2019.00404. [Google Scholar] [CrossRef]

72. Pang Y, Zhao X, Zhang L, Lu H. Multi-scale interactive network for salient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2020. p. 9410–9. [Google Scholar]

73. Zhao X, Pang Y, Zhang L, Lu H, Zhang L. Suppress and balance: a simple gated network for salient object detection. In: Computer Vision—ECCV 2020. Vol. 12347. Cham, Switzerland: Springer; 2020. p. 35–51. doi:10.1007/978-3-030-58536-5_3. [Google Scholar] [CrossRef]

74. Peng Z, Guo Z, Huang W, Wang Y, Xie L, Jiao J, et al. Conformer: local features coupling global representations for recognition and detection. IEEE Trans Pattern Anal Mach Intell. 2023;45(8):9454–68. doi:10.1109/tpami.2023.3243048. [Google Scholar] [PubMed] [CrossRef]

75. Ren S, Zhao N, Wen Q, Han G, He S. Unifying global-local representations in salient object detection with transformers. IEEE Trans Emerg Topics Comput Intell. 2024;8(4):2870–9. doi:10.1109/tetci.2024.3380442. [Google Scholar] [CrossRef]

76. Li S, Wang R, Zhou F, Wang Y, Guo N. Multi-scale feature enhancement for saliency object detection algorithm. IEEE Access. 2023;11:103511–20. doi:10.1109/ACCESS.2023.3317901. [Google Scholar] [CrossRef]

77. Li Z, Wu J, Wang G. The application of FLICSP algorithm based on multi-feature fusion in image saliency detection. IEEE Access. 2024;12:2100–12. doi:10.1109/ACCESS.2023.3347583. [Google Scholar] [CrossRef]

78. Yang Q, Zheng J, Chen J. Multilevel diverse feature aggregation network for salient object detection. Neurocomputing. 2025;628:129648. doi:10.1016/j.neucom.2025.129648. [Google Scholar] [CrossRef]

79. Yi Y, Zhang N, Zhou W, Shi Y, Xie G, Wang J. GPONet: a two-stream gated progressive optimization network for salient object detection. Pattern Recognit. 2024;150:110330. doi:10.1016/j.patcog.2024.110330. [Google Scholar] [CrossRef]

80. Liu ZA, Liu JJ. Towards efficient salient object detection via U-shape architecture search. Knowl Based Syst. 2025;318:113515. doi:10.1016/j.knosys.2025.113515. [Google Scholar] [CrossRef]

81. Mulat KA, Feng Z, Eshetie TS, Hasen AE. SalFAU-Net: saliency fusion attention U-Net for salient object detection. arXiv:2405.02906. 2024. [Google Scholar]

82. Dulam RVS, Kambhamettu C. SODAWideNet++: combining attention and convolutions for salient object detection. In: Antonacopoulos A, Chaudhuri S, Chellappa R, Liu CL, Bhattacharya S, Pal U, editors. Pattern recognition. Vol. 15310. Cham, Switzerland: Springer Nature; 2025. p. 210–26. doi:10.1007/978-3-031-78192-6_14. [Google Scholar] [CrossRef]

83. Deng Z, Hu X, Zhu L, Xu X, Qin J, Han G, et al. R’Net: recurrent residual refinement network for saliency detection. In: IJCAI’18: Proceedings of the 27th International Joint Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press; 2018. p. 684–90. doi:10.24963/ijcai.2018/95. [Google Scholar] [CrossRef]

84. Liu Y, Zhang Q, Zhang D, Han J. Employing deep part-object relationships for salient object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019 Oct 27–Nov 2; Seoul, Republic of Korea. Piscataway, NJ, USA: IEEE; 2019. p. 1232–41. [Google Scholar]

85. Liu Y, Zhou L, Wu G, Xu S, Han J. TCGNet: type-correlation guidance for salient object detection. IEEE Trans Intell Transp Syst. 2024;25(7):6633–44. doi:10.1109/tits.2023.3342811. [Google Scholar] [CrossRef]

86. Chen Z, Xu Q, Cong R, Huang Q. Global context-aware progressive aggregation network for salient object detection. Proc AAAI Conf Artif Intell. 2020;34(7):10599–606. doi:10.1609/aaai.v34i07.6633. [Google Scholar] [CrossRef]

87. Wei J, Wang S, Wu Z, Su C, Huang Q, Tian Q. Label decoupling framework for salient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2020. p. 13022–31. [Google Scholar]

88. Asheghi B, Salehpour P, Khiavi AM, Hashemzadeh M, Monajemi A. DASOD: detail-aware salient object detection. Image Vis Comput. 2024;148:105154. doi:10.1016/j.imavis.2024.105154. [Google Scholar] [CrossRef]

89. Fang Y, Zhang H, Yan J, Jiang W, Liu Y. UDNet: uncertainty-aware deep network for salient object detection. Pattern Recognit. 2023;134:109099. doi:10.1016/j.patcog.2022.109099. [Google Scholar] [CrossRef]

90. Ke YY, Tsubono T. Recursive contour-saliency blending network for accurate salient object detection. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Piscataway, NJ, USA: IEEE; 2022. p. 1360–70. doi:10.1109/wacv51458.2022.00143. [Google Scholar] [CrossRef]

91. Peng Y, Zhai Z, Feng M. RGB-D salient object detection based on cross-modal and cross-level feature fusion. IEEE Access. 2024;12:45134–46. doi:10.1109/access.2024.3381524. [Google Scholar] [CrossRef]

92. Yao C, Feng L, Kong Y, Xiao L, Chen T. Transformers and CNNs fusion network for salient object detection. Neurocomputing. 2023;520:342–55. doi:10.1016/j.neucom.2022.10.081. [Google Scholar] [CrossRef]

93. Zhou W, Zhu Y, Lei J, Wan J, Yu L. CCAFNet: crossflow and cross-scale adaptive fusion network for detecting salient objects in RGB-D images. IEEE Trans Multimed. 2022;24:2192–204. doi:10.1109/tmm.2021.3077767. [Google Scholar] [CrossRef]

94. Hou J, Wen H, Zhao Q, Ma J, Li Z, Wang S, et al. A comprehensive survey on the research and development of RGB-T salient object detection. In: Computational visual media. Beijing, China: Tsinghua University Press; 2025. p. 1–31. doi:10.26599/CVM.2025.9450501. [Google Scholar] [CrossRef]

95. Wang K, Chen K, Li C, Tu Z, Luo B. Alignment-Free RGB-T salient object detection: a large-scale dataset and progressive correlation network. arXiv:2412.14576. 2024. [Google Scholar]

96. Zhou W, Guo Q, Lei J, Yu L, Hwang JN. ECFFNet: effective and consistent feature fusion network for RGB-T salient object detection. IEEE Trans Circ Syst Video Technol. 2022;32(3):1224–35. doi:10.1109/tcsvt.2021.3077058. [Google Scholar] [CrossRef]

97. Zhou W, Sun F, Jiang Q, Cong R, Hwang JN. WaveNet: wavelet network with knowledge distillation for RGB-T salient object detection. IEEE Trans Image Process. 2023;32:3027–39. doi:10.1109/tip.2023.3275538. [Google Scholar] [PubMed] [CrossRef]

98. Van Den Bergh M, Boix X, Roig G, Van Gool L. SEEDS: superpixels extracted via energy-driven sampling. Int J Comput Vis. 2015;111(3):298–314. doi:10.1007/s11263-014-0744-2. [Google Scholar] [CrossRef]

99. Li Z, Chen J. Superpixel segmentation using linear spectral clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2015. p. 1356–63. [Google Scholar]

100. Maneewongvatana S, Mount D. Analysis of approximate nearest neighbor searching with clustered point sets. In: Data structures, near neighbor searches, and methodology: fifth and sixth DIMACS implementation challenges. Providence, RI, USA: American Mathematical Society; 2002. p. 105–23. doi:10.1090/dimacs/059/06. [Google Scholar] [CrossRef]

101. Yan Q, Xu L, Shi J, Jia J. Hierarchical saliency detection. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2013. p. 1155–62. doi:10.1109/CVPR.2013.153. [Google Scholar] [CrossRef]

102. Li G, Yu Y. Visual saliency based on multiscale deep features. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2015. p. 5455–63. doi:10.1109/CVPR.2015.7299184. [Google Scholar] [CrossRef]

103. Yang C, Zhang L, Lu H, Ruan X, Yang MH. Saliency detection via graph-based manifold ranking. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2013. p. 3166–73. doi:10.1109/CVPR.2013.407. [Google Scholar] [CrossRef]

104. Li Y, Hou X, Koch C, Rehg JM, Yuille AL. The secrets of salient object segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2014. p. 280–7. doi:10.1109/CVPR.2014.43. [Google Scholar] [CrossRef]

105. Kim A. Fast Slic. 2019 [cited 2025 Nov 23]. Available from: https://github.com/Algy/fast-slic. [Google Scholar]

Cite This Article

APA Style

Baraklı, B., Yüzkollar, C., Taşçı, T., Yıldırım, İ. (2026). Superpixel-Aware Transformer with Attention-Guided Boundary Refinement for Salient Object Detection. Computer Modeling in Engineering & Sciences, 146(1), 36. https://doi.org/10.32604/cmes.2025.074292

Vancouver Style

Baraklı B, Yüzkollar C, Taşçı T, Yıldırım İ. Superpixel-Aware Transformer with Attention-Guided Boundary Refinement for Salient Object Detection. Comput Model Eng Sci. 2026;146(1):36. https://doi.org/10.32604/cmes.2025.074292

IEEE Style

B. Baraklı, C. Yüzkollar, T. Taşçı, and İ. Yıldırım, “Superpixel-Aware Transformer with Attention-Guided Boundary Refinement for Salient Object Detection,” Comput. Model. Eng. Sci., vol. 146, no. 1, pp. 36, 2026. https://doi.org/10.32604/cmes.2025.074292

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Superpixel-Aware Transformer with Attention-Guided Boundary Refinement for Salient Object Detection

Abstract

Keywords

References

Cite This Article

832

286

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link