Open Access
ARTICLE
FSS: Focusing on Suboptimal Samples for Detector-Agnostic Label Assignment in Object Detection
1 Hunan Intelligent Rehabilitation Robot and Auxiliary Equipment Engineering Technology Research Center, Changsha, China
2 School of Computer and Artificial Intelligence (School of Software), Huaihua University, Huaihua, China
3 School of Business, Hunan Normal University, Changsha, China
4 College of Information Science and Engineering, Hunan Normal University, Changsha, China
* Corresponding Authors: Jinping Liu. Email: ; Yimei Yang. Email:
(This article belongs to the Special Issue: Development and Application of Deep Learning based Object Detection)
Computers, Materials & Continua 2026, 88(1), 61 https://doi.org/10.32604/cmc.2026.077655
Received 14 December 2025; Accepted 11 March 2026; Issue published 08 May 2026
Abstract
Many occluded and ambiguous ground truths exist in object detection, making detectors unable to obtain optimal training samples. In this article, we revisit the suboptimal sample issue in label assignment for object detection and propose a novel detector-agnostic strategy, termed FSS, to address it. FSS reformulates label assignment as the process of selecting high-quality sub-optimal samples and progressively transforming them into optimal ones. Specifically, for each candidate, we estimate the probability of being an optimal sample by jointly considering localization quality and classification confidence, thereby constructing an instance-wise probability matrix. Based on the spatial distribution of potentially optimal samples, we introduce a Gaussian prior to adaptively determine the number of sub-optimal samples per instance. We then assign weights to these sub-optimal samples according to their optimality probabilities, enforcing consistent ranking between classification and localization and promoting the emergence of truly optimal samples. Extensive experiments on MS-COCO demonstrate the effectiveness and plug-and-play nature of FSS: when integrated into a modern one-stage detector, FSS achieves 50.8 AP under single-model, single-scale testing, without introducing any additional inference overhead.Keywords
Object detection, a fundamental yet still challenging aspect of computer vision, aims to localize and classify objects in images while suppressing irrelevant background interference. With the rapid development of deep learning, object detection has achieved remarkable progress. Current object detection approaches can be categorized into multi-stage and one-stage methods.
Multi-stage detectors typically follow a proposal-driven pipeline: candidate regions are generated to separate foreground from background, pruned to remove redundancy, and then refined by subsequent detection heads. Owing to progressive refinement and explicit control of the positive/negative (Pos/Neg) ratio, they often outperform one-stage methods, albeit with higher architectural complexity and computational cost. In contrast, one-stage detectors predict classification and box regression densely on feature maps, without an explicit proposal stage. Anchor-based variants use predefined anchors with a single refinement step, offering high efficiency. However, dense feature pyramid network (FPN) [1] predictions generate large numbers of candidates and induce severe class imbalance, with positives rare relative to negatives. This imbalance largely accounts for the accuracy gap to multi-stage detectors: the latter explicitly regulates the Pos/Neg ratio via proposals, whereas one-stage methods must rely on loss design and sampling.
To alleviate the above-mentioned problem, RetinaNet introduces Focal Loss [2] to down-weight abundant negatives and emphasize hard positives. While effective, it does not resolve the fundamental scarcity of positive samples. Fully Convolutional One-Stage object detection (FCOS) [3] increases the number of positives by labeling points near each ground-truth center as positives across FPN levels, but the resulting set may include low-quality or ambiguous samples, potentially hindering convergence and final accuracy.
These observations raise a central question in dense detection: how to select informative candidates and assign them as positive or negative with respect to each ground-truth object, a process commonly termed label assignment. Recent studies [4,5] show that assignment design—spanning matching metrics, decision thresholds, and spatial/predictive priors—is a key, yet often overlooked, determinant of detection performance. Existing approaches broadly fall into two paradigms: static and dynamic label assignment. Static strategies [5,6] label anchors as positive using fixed IoU thresholds or hand-crafted spatial priors (e.g., grid cell centers), assigning the rest to negatives or discarding them. They are simple and efficient, but often brittle to variations in object scale, shape, and density, leading to suboptimal matches for irregular, small, or crowded targets. In contrast, dynamic label assignment methods [7] adapt criteria to the model’s current predictions (e.g., classification confidence and localization quality), enabling more context-aware assignments in diverse scenes.
However, an important issue has received relatively little attention: in realistic detection scenarios, truly optimal samples may be absent or extremely rare. Due to occlusions, extreme aspect ratios, small object sizes, and cluttered backgrounds, many candidate samples may exhibit a mismatch between classification confidence and localization quality. In other words, the sample with the highest classification score is not necessarily the one with the best IoU, and vice versa. Moreover, it is often impossible to determine a priori whether a given sample is globally optimal.
To formalize this notion, we define a label-metric score
where
where
Fig. 1 illustrates the presence of uncertain samples in object detection, which can induce inconsistencies between the classification and localization rankings. We therefore collect these high-performing yet uncertain suboptimal samples and harmonize their task-specific rankings. It is worth noting that, if a truly optimal sample exists, it should maximize any reasonable label metric irrespective of the specific choices of

Figure 1: Illustrative uncertain samples existing in object detection, which lead to inconsistencies in classification and localization rankings. To address this, we collect these well-performing yet ambiguous suboptimal samples and align their rankings across tasks.
In summary, truly optimal samples rarely occur in real-world scenes, yet whenever they do, they are almost surely labeled as positives. Consequently, the detector is largely shaped by the abundant suboptimal samples. Our goal is to focus on suboptimal samples (FSS), which consists of two steps: (1) Selecting suboptimal samples. Rather than using static heuristics, FSS builds an instance-wise probability matrix to estimate how likely each candidate is to be optimal, jointly accounting for classification confidence and regression quality. (2) Transforming suboptimal samples into optimal ones. FSS assigns each selected candidate an instance-specific weight derived from its probability score. These weights encourage consistent ordering between classification and localization by amplifying the learning signal for higher-ranked candidates. The main contributions of this article are summarized as follows:
• The underexplored role of suboptimal positives in dense label assignment is identified and formalized, with emphasis placed on regimes where truly optimal samples are absent or unreliable.
• A unified probability score coupling classification confidence and localization quality is introduced, based on which a Gaussian-prior-guided dynamic-
• The detector-agnostic applicability of the proposed FSS with no additional inference overhead is validated on MS-COCO and DOTA benchmarks, where consistent gains over representative baselines and competitive performance are achieved.
The remainder of this article is organized as follows. Section 2 reviews related work on one-stage object detection and label assignment. Section 3 presents the proposed label assignment strategy, FSS. Section 4 reports extensive experiments on the MS-COCO and DOTA benchmark datasets and compares our method with state-of-the-art approaches. Section 5 concludes the paper and outlines directions for future research.
This section briefly reviews related works on one-stage object detection and label assignment strategies for dense detectors.
2.1 One-Stage Object Detection
Depending on the design, one-stage detectors can be anchor-based or anchor-free. Anchor-based detectors rely on a set of predefined anchor boxes (priors) with different scales and aspect ratios as regression references, which are often designed using statistics (e.g., clustering) over the training set. Anchor-free methods dispense with explicit anchors and instead predict bounding boxes from points or keypoints on feature maps, leading to simpler designs and often better robustness to extreme aspect ratios and small objects. OverFeat [8] is among the earliest deep learning-based one-stage detectors, introducing a unified framework for joint classification, localization, and detection. YOLO [5] formulates object detection as a single regression problem: it partitions the final feature map into a
Anchor-free detectors further reduce reliance on handcrafted priors. CornerNet [9] casts detection as paired keypoint prediction by producing heatmaps for the top-left and bottom-right corners. FCOS [3] treats each pixel on feature maps as a candidate location and regresses distances to the four box sides, while an additional centerness branch suppresses low-quality predictions. DETR [10] introduces Transformers and reformulates detection as a set-prediction problem, using Hungarian matching between a fixed set of object queries and ground-truth boxes, thereby removing the need for anchor design and non-maximum suppression.
These developments have significantly improved the accuracy and simplicity of one-stage detectors, enabling their benchmarking and application across diverse dense detection scenarios. However, they also highlight a key bottleneck: how to effectively assign labels to the large number of dense candidates.
During training, each candidate (anchor or point) must be assigned to a ground-truth instance or to the background prior to loss computation; this positive/negative assignment shapes optimization and largely determines detector performance.
Early detectors such as Faster R-CNN [11] and RetinaNet mainly rely on anchor ground-truth Intersection over Union (IoU): candidates above a preset threshold are labeled as positives, whereas those below a lower threshold are labeled as negatives. In contrast, FCOS [3] and YOLO [5] incorporate spatial constraints. YOLO assigns responsibility to anchors whose centers fall in the grid cell containing the ground-truth center, while FCOS and FoveaBox [12] expand the positive set by treating points within a region around each ground truth as positives.
Despite their differences, these approaches share a key limitation: positives and negatives are separated by a single hand-crafted criterion (e.g., IoU threshold, scale, or spatial rule). Such fixed heuristics can yield noisy or ambiguous supervision and fail to exploit richer context among candidates, thereby limiting adaptivity and attainable performance.
To improve adaptivity, a range of dynamic assignment strategies has been proposed. ATSS [13] derives instance-specific IoU thresholds from the mean and standard deviation of candidate IoUs. FreeAnchor [14] casts assignment as maximum-likelihood estimation, allowing anchors to select ground truths via learned likelihood scores. Zhang et al. [15] model quality scores with a Gaussian mixture and use EM to probabilistically separate positives from negatives. AutoAssign [16] introduces instance-wise labeling via central/confidence weighting modules, and DW [17] assigns task-aware weights to both positive and negative samples.
More recently, several methods explicitly combine classification confidence and localization quality for matching and/or weighting. SimOTA (and OTA) [18] performs dynamic-
The overall pipeline of the proposed FSS framework is illustrated in Fig. 2. For each spatial location, the detector outputs a classification score, an objectness score, and a bounding-box offset. Based on the label metric

Figure 2: The pipeline of the proposed FSS. The model consists of a CNN-based backbone and a detection head. The classification score (h
3.1 Choosing Suboptimal Samples
3.1.1 Probability Matrix: The Likelihood of Being an Optimal Sample
Conventional sample-quality metrics typically rely on IoU thresholds or spatial constraints as a proxy for geometric alignment with the assigned ground truth. In dynamic label assignment, each ground-truth instance is usually matched to multiple candidates (
For such suboptimal samples, the rankings induced by IoU and by classification confidence should be as consistent as possible with respect to the corresponding ground truth. From the viewpoint of the joint label metric
where
The resulting score
Training stability. Eq. (3) is used only to generate supervision (sample selection and loss reweighting) and is not treated as a differentiable objective. In implementation, we compute
3.1.2 Gaussian Prior Dynamic
Determining how many suboptimal samples should be assigned to each instance is crucial for stable and effective training. Many existing methods control this quantity with a fixed hyperparameter or a static threshold, overlooking substantial instance-level variability: heavily occluded objects may provide only a few reliable candidates, whereas large, well-defined objects can support many. Since this factor is difficult to model analytically, OTA [18] proposed a simple yet effective heuristic, termed dynamic
Given the
where
Consequently, candidates with larger overlaps contribute more to
However, the localization-quality landscape over feature locations can be discrete and irregular: many regions inside an object may yield high IoU yet remain weakly discriminative. As a result, IoU-only dynamic

Figure 3: Visualization of Gaussian-prior dynamic
To encourage such compactness, we introduce a 2-D Gaussian prior
We emphasize that the 2-D Gaussian prior is a lightweight heuristic that regularizes the spatial distribution of candidate centers rather than assuming that an object itself strictly follows a Gaussian shape. As a soft, instance-wise re-ranking term, it is primarily intended to suppress spatially distant candidates and encourage compact selection around the potentially optimal sample. Potential failure cases include highly elongated or irregular objects, fragmented instances under heavy occlusion, and crowded scenes with overlapping objects, where the optimal candidate region may be non-elliptical. In such cases, the prior may be less accurate, but its effect remains bounded because the final selection is still jointly governed by localization and classification quality (via
To visually analyze potential failure cases, we present an asymmetric airplane case in Fig. 4. The airplane in the image has an irregular shape with an incomplete left wing. The classification peak (b) is localized on the fuselage, causing the center-focused prior (c) to neglect the wide wings. Consequently, the detection box (red) is suppressed and clipped by the rigid prior weights, even when the manual ground truth (green) is accurately defined. This confirms that unimodal priors struggle with non-convex or protruding geometries.

Figure 4: Failure analysis on an irregular object. (a) Input; (b) Classification heatmap; (c) Gaussian prior; (d) Ground truth (GT) vs. suppressed detection.
This size-dependent covariance yields a scale-adaptive prior: small (large) objects naturally induce a narrower (broader) spatial support, which is consistent with the typical extent of reliable candidates. Moreover, since the prior is only used to re-rank candidates within each instance, it mitigates sensitivity to absolute object scale.
Let
By down-weighting spatially distant candidates, this formulation suppresses distracting regions and reduces the chance of selecting noisy samples, thereby concentrating the assignment around more plausible locations.
Overall, the Gaussian-prior dynamic
3.2 Transforming Optimal Samples
Intuitively, each ground-truth object should correspond to one and only one optimal sample. Therefore, preserving a meaningful ranking among suboptimal samples is crucial: only when their task rankings are properly ordered can the truly optimal sample emerge during training.
In FSS, we explicitly focus on the suboptimal samples of each instance and aim to “excavate” the optimal one from these suboptimal candidates. To this end, the probability associated with each suboptimal sample is used to guide the allocation of its learning weight, which in turn reweights the classification and localization losses to encourage consistent task rankings.
FSS mitigates cross-task inconsistency among suboptimal samples by assigning instance-aware weights and gradually transforming high-quality suboptimal samples into optimal ones. The weighting design follows three principles:
1. Preserve intra-instance ranking. For suboptimal samples of the same instance, a larger score
2. Maintain inter-instance fairness. Potentially optimal samples across different instances should have comparable weight scales, preventing the detector from overfitting to a few instances.
3. Respect score gaps. Within an instance, larger gaps in
Formally, for the
where
Eq. (8) can be viewed as a mean-normalized power transform of the optimality score. It satisfies the desired properties: (i) monotonicity and intra-instance ranking preservation: for any fixed instance
Finally, we apply the learned weights to both classification and localization losses to promote the emergence of optimal samples. The overall FSS loss is defined as:
where
4 Experimental Validation and Result Discussions
This section reports the confirmatory and comparative experimental results on the MS-COCO and DOTA datasets.
(1) MS-COCO dataset
The MS-COCO benchmark [23] contains approximately 118k training images, 5k validation images, and 20k test-dev images. Following standard practice, we adopt the trainval135k/minival split: models are trained on trainval135k (the union of the train set and a 35k subset of the val set, totaling 135k images), and validated on the remaining 5k images (minival). Final results are reported on test-dev by submitting predictions to the official MS-COCO evaluation server, ensuring fair comparisons with state-of-the-art (SOTA) detectors.
(2) DOTA dataset (a large-scale dataset of object detection in aerial images)
The DOTA dataset [24] is an open-source benchmark for object detection in remote-sensing imagery. Unlike natural-image datasets, objects in aerial images appear with arbitrary orientations due to the overhead viewing geometry. DOTA-v1.5 extends DOTA-v1.0 by expanding the label space from 10 to 16 categories. It contains over 2800 images collected from diverse platforms and online sources, each with a resolution of approximately 4000
Given the extremely high resolution of DOTA images, we adopt a tiling-based preprocessing strategy inspired by YOLT (You Only Look Twice) [25], which splits each image into overlapping tiles and merges tile-level predictions with non-maximum suppression (NMS) to remove duplicates. While tiling is effective for small, densely packed objects, performing it online during inference substantially increases runtime because each high-resolution image must be processed into many patches. To avoid additional inference cost, we apply tiling offline as a data-augmentation procedure, thereby improving sensitivity to small/occluded objects without sacrificing inference efficiency. After extensive experiments, we crop each image into 1024
4.2 Implementation Details and Evaluation Criteria
The experiments are trained on Ubuntu 18.04.3 with an NVIDIA Tesla T4 GPU (16 GB) and an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10 GHz (16 cores). We use SGD with an initial learning rate of 0.01, momentum of 0.937, a 3-epoch learning-rate warmup, and a weight decay of 0.0005, and we further adopt an exponential moving average (EMA) of model parameters. The test environment is Microsoft Windows 10 (19043.1348) with an NVIDIA GeForce RTX 3060 Laptop GPU (80 W) and an 11th Gen Intel(R) Core(TM) i7-11800H CPU @ 2.30 GHz (8 cores).
We evaluate both accuracy and efficiency. For accuracy, we report the standard metrics, with Average Precision (AP) as the primary measure. The model resolution input is 1024
We use stochastic gradient descent (SGD) with a momentum of 0.937 and a weight decay of 5e–4. The base learning rate is set to 0.01 with a cosine annealing schedule, preceded by 3 warmup epochs. Unless otherwise specified, models are trained from scratch for 200 epochs on 4 GPUs with a mini-batch size of 16 per GPU, using a strong data augmentation pipeline (without pre-training). The hyperparameter
Since FSS is a detector-agnostic label assignment strategy, it can be seamlessly integrated into a wide range of modern detectors. In our experiments, we instantiate two representative architectures based on commonly used backbones: ResNet-101 [26] and CSPDarkNet [27]. In line with mainstream object detection frameworks, both architectures comprise a backbone, a neck, and a detection head. For the CSPDarkNet backbone, we adopt a PANet-style neck and a decoupled detection head; for the ResNet-101 backbone, we use an FPN neck coupled with the same decoupled head. The primary difference between the two instantiations thus lies in the choice of backbone and neck, while the head and the proposed FSS label assignment remain identical.
Although FSS introduces no additional inference overhead, it adds extra computation during training due to the Gaussian-prior dynamic

All ablation experiments are conducted on the COCO minival set. Unless otherwise specified, we use CSPDarkNet [27] as the backbone, train for 42 epochs under the same settings as in Section 4.2, and keep all other implementation details identical to ensure fair comparisons.
4.3.1 Effectiveness of FSS for Label Assignment
To verify that FSS selects higher-quality suboptimal samples and progressively promotes them toward the optimal ones, we compare it with five representative label assignment strategies under the same baseline. Results are reported in Table 2.

All results in Table 2 are averaged over 10 independent runs with different random seeds and reported as mean
4.3.2 Contribution Analysis of Suboptimal Selection and Transformation
It is worth noting that FSS consists of two key stages: suboptimal sample selection (SubOptSel) and optimal sample transformation (SubOptTrans). To quantify the contribution of these two stages, we adopt Max-IoU as the baseline label assignment strategy and progressively introduce suboptimal selection and optimal transformation. As reported in Table 3, replacing Max-IoU with suboptimal selection improves AP by 2.0. Adding optimal transformation on top of suboptimal selection yields an additional 0.3 AP gain. Notably, these improvements are achieved purely through training-time assignment and reweighting, introducing no extra inference overhead; hence, they can be regarded as “free” performance gains for the existing detector.

4.3.3 Effectiveness of Hyperparameters
The hyperparameters
To evaluate the sensitivity of FSS to these hyperparameters, we vary

4.4 Comparison with State-of-the-Art Methods
The comparative quantitative results of FSS with representative label assignment strategies on detectors on the MS-COCO test-dev dataset are summarized in Table 5. The compared strategies include ATSS [13], which calculates adaptive IoU thresholds based on statistical properties of object fits; GFL [20], which optimizes a generalized focal loss to jointly model classification and localization quality; PAA [15], which separates positive and negative anchors using a probabilistic Gaussian mixture model; TOOD [19], which aligns tasks via an explicit task-aligned head and learning mechanism; and DW [17], which introduces dual weighting to dynamically refine label importance. FSS, in contrast, focuses on progressively promoting high-quality suboptimal samples through unified probability scoring.

Since detection accuracy is strongly influenced by model capacity (e.g., the backbone/neck), we primarily draw conclusions from backbone-matched comparisons and report results under multiple backbone settings to demonstrate generality. As shown in Table 5, FSS consistently improves performance across diverse detector architectures (YOLOv3, Faster R-CNN, RetinaNet, and modern Decoupled detection head (DDH)-based models) while maintaining high inference efficiency.
Improvements on Standard Detectors. On the classic YOLOv3 (DarkNet-53), FSS boosts the baseline from 33.0 to 36.3 AP (+3.3 AP), outperforming advanced assignment methods such as TOOD [19] (36.0 AP) and PAA [15] (36.0 AP) without the speed drop observed in TOOD (35 vs. 37 FPS). Similarly, on the two-stage Faster R-CNN (ResNet-50), FSS achieves the highest AP of 44.5, surpassing GFL (44.2) and TOOD (43.9). Notably, on the anchor-based RetinaNet (ResNet-101), FSS delivers a substantial gain of +5.0 AP (39.1
Validation on Modern Decoupled Heads. To verify effectiveness on stronger baselines, we apply FSS to detectors with Decoupled Detection Heads (DDH). With a ResNet-101 backbone, FSS achieves 49.0 AP, matching the top-performing TOOD but with a distinct speed advantage (37.6 FPS vs. 34.4 FPS). More importantly, FSS yields superior localization quality, achieving higher
These results confirm that (i) the proposed suboptimal-sample–focused label assignment can be effectively integrated into different detector architectures (ResNet-101+FPN and CSPDarkNet+PANet), and (ii) FSS provides a competitive or superior accuracy–speed trade-off compared with strong SOTA detectors, without introducing any additional inference overhead.
4.5 Generalization on Remote Sensing Benchmark
To further evaluate the generalization of FSS in scenarios dominated by tiny, densely packed, and arbitrarily oriented objects, we conduct additional experiments on the DOTA dataset. DOTA is a large-scale aerial-image benchmark featuring extreme aspect ratios, cluttered backgrounds, and dense object layouts, where “perfectly aligned” optimal samples are often scarce. This makes DOTA a particularly suitable testbed for assessing whether our strategy can reliably mine high-quality suboptimal positives and improve label assignment under challenging conditions.
The quantitative results are summarized in Table 6. Since the table contains multiple detectors and assignment variants, we highlight the main takeaway here: FSS consistently improves AP and
Furthermore, by integrating our proposed FSS strategy (denoted as SADet+FSS), the detection performance is boosted to 48.2 AP and 71.8
To intuitively demonstrate this capability, Fig. 5 presents the qualitative visualization of FSS on the DOTA dataset compared with ground-truth annotations. As observed, even in extreme scenarios with densely packed vehicles and varying-scale ships, FSS maintains exceptional localization precision. The predicted bounding boxes closely align with the ground truth, effectively distinguishing adjacent tiny instances without introducing significant false positives. The successful application on DOTA confirms that FSS is not limited to natural scenes but generalizes well to complex remote sensing tasks, further validating its cross-domain robustness.

Figure 5: Visualization of detection results on the DOTA dataset. FSS achieves high localization accuracy on densely packed objects, such as vehicles and ships, generating bounding boxes that closely match ground truths.
Conceptual comparison with related methods. Although several recent methods also couple classification confidence with localization quality, their primary focus differs from ours. ATSS [13] adaptively sets an IoU threshold by exploiting the statistics of candidate IoUs, whereas FSS explicitly models the learning value of suboptimal candidates via a unified probability score and then uses a Gaussian-prior-guided dynamic
Limitations and applicability. FSS leverages the correlation between classification confidence and localization quality to construct the probability score. In the early epochs, both classification confidence and IoU can be poorly calibrated, which may lead to unstable score rankings and fluctuating selected samples/weights. To mitigate this early-stage instability, the score is used only for sample selection/weighting via stop-gradient, and it is further stabilized by instance-wise normalization and smoothing, thereby bounding its impact on optimization. In extremely sparse scenes, the pool of informative suboptimal candidates may be limited, and in extremely dense/crowded scenes, multiple nearby instances can produce highly ambiguous candidates, potentially increasing assignment uncertainty. In practice, we find FSS is stable across a wide range of
Performance-efficiency trade-off and deployment feasibility. FSS modifies only the training-time label assignment and reweighting procedure, and it does not introduce any additional layers or computations in the inference graph. Therefore, the deployed detector preserves the same model complexity (Params/FLOPs) as the corresponding baseline, and the inference throughput/latency reported in our experiments reflects practical deployment behavior under fixed hardware and precision settings (e.g., batch size 1 and FP32 as stated in the table captions). From the perspective of computational and energy constraints, this means that inference-time compute and the associated energy cost remain essentially unchanged when adopting FSS. The only extra cost occurs during training; as quantified in Table 1, the added overhead is minor, indicating that FSS is feasible for practical training and deployment pipelines.
Complexity and scaling. The dominant additional computation in FSS comes from constructing the instance-wise score/matching statistics between candidate locations and ground-truth instances. Let N denote the number of candidates (which grows with input resolution and feature-map density) and M denote the number of ground-truth objects; the associated matrix-style operations scale roughly with
In this article, we proposed FSS, an adaptive and detector-agnostic label assignment scheme that explicitly focuses on suboptimal samples. For each instance, FSS identifies high-quality suboptimal candidates using a unified probability score that couples classification confidence with localization quality. It then combines IoU with a Gaussian prior centered at the potentially optimal sample to adaptively determine the number of positives for each ground truth. The resulting optimality probability is further mapped to instance-wise weights applied to both classification and localization heads, preserving the ranking structure and progressively promoting truly optimal samples during training. Extensive experiments on MS-COCO and DOTA datasets demonstrate that FSS is effective and generalizes well across detectors, achieving a competitive accuracy-speed trade-off with no additional inference overhead. In future work, we will integrate FSS into fully end-to-end detection pipelines by eliminating hand-crafted NMS and explore its extension to broader detection and instance-level recognition tasks.
Acknowledgement: Not applicable.
Funding Statement: This research was funded by the National Natural Science Foundation of China under Grant No. 62371187 and the Open Program of Hunan Intelligent Rehabilitation Robot and Auxiliary Equipment Engineering Technology Research Center under Grant No. 2024JS101.
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Jinping Liu and Kunyi Zheng; methodology, Lijuan Huang, Kunyi Zheng, Xinyu Zhou, Jinping Liu and Yimei Yang; software, Yimei Yang and Zhixian Liu; validation, Lijuan Huang and Zhixian Liu; investigation, Lijuan Huang, Zhixian Liu and Jinping Liu; resources, Zhixian Liu; writing—original draft preparation, Jinping Liu and Yimei Yang; writing—review and editing, Yimei Yang and Jinping Liu; visualization, Lijuan Huang, Xinyu Zhou, Zhixian Liu and Yimei Yang; supervision, Jinping Liu; funding acquisition, Jinping Liu. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The public accessed datasets-MS-COCO and DOTA are used in this study.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2017. p. 936–44. doi:10.1109/CVPR.2017.106. [Google Scholar] [PubMed] [CrossRef]
2. Lin TY, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell. 2020;42(2):318–27. doi:10.1109/TPAMI.2018.2858826. [Google Scholar] [PubMed] [CrossRef]
3. Tian Z, Shen C, Chen H, He T. FCOS: fully convolutional one-stage object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2019. p. 9626–35. doi:10.1109/ICCV.2019.00972. [Google Scholar] [PubMed] [CrossRef]
4. Dai H, Gao S, Huang H, Mao D, Zhang C, Zhou Y. An adaptive sample assignment network for tiny object detection. IEEE Trans Multimed. 2024;26:2918–31. doi:10.1109/TMM.2023.3305120. [Google Scholar] [PubMed] [CrossRef]
5. Khanam R, Hussain M. YOLOv11: an overview of the key architectural enhancements. arXiv:2410.17725. 2024. [Google Scholar]
6. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot MultiBox detector. In: European Conference on Computer Vision. Cham, Switzerland: Springer; 2016. p. 21–37. doi:10.1007/978-3-319-46448-0_2. [Google Scholar] [CrossRef]
7. Zhao Z, Du J, Li C, Fang X, Xiao Y, Tang J. Dense tiny object detection: a scene context guided approach and a unified benchmark. IEEE Trans Geosci Remote Sens. 2024;62:5606913. doi:10.1109/TGRS.2024.3357706. [Google Scholar] [PubMed] [CrossRef]
8. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y. OverFeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229. 2013. [Google Scholar]
9. Law H, Deng J. CornerNet: detecting objects as paired keypoints. In: European Conference on Computer Vision. Cham, Switzerland: Springer; 2018. p. 734–50. [Google Scholar]
10. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S, et al. End-to-end object detection with transformers. In: European Conference on Computer Vision. Cham, Switzerland: Springer International Publishing; 2020. [Google Scholar]
11. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39:1137–49. [Google Scholar] [PubMed]
12. Kong T, Sun F, Liu H, Jiang Y, Li L, Shi J. FoveaBox: beyound anchor-based object detection. IEEE Trans Image Process. 2020;29:7389–98. doi:10.1109/TIP.2020.3002345. [Google Scholar] [PubMed] [CrossRef]
13. Zhang S, Chi C, Yao Y, Lei Z, Li SZ. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2020. p. 9756–65. doi:10.1109/CVPR42600.2020.00978. [Google Scholar] [PubMed] [CrossRef]
14. Zhang X, Wan F, Liu C, Ji X, Ye Q. Learning to match anchors for visual object detection. IEEE Trans Pattern Anal Mach Intell. 2022;44(6):3096–109. doi:10.1109/TPAMI.2021.3050494. [Google Scholar] [PubMed] [CrossRef]
15. Zhang F, Zhou S, Wang Y, Wang X, Hou Y. Label assignment matters: a Gaussian assignment strategy for tiny object detection. IEEE Trans Geosci Remote Sens. 2024;62:5633112. doi:10.1109/TGRS.2024.3430071. [Google Scholar] [PubMed] [CrossRef]
16. Zhu B, Wang J, Jiang Z, Zong F, Liu S, Li Z, et al. AutoAssign: differentiable label assignment for dense object detection. arXiv:2007.03496. 2020. [Google Scholar]
17. Li S, He C, Li R, Zhang L. A dual weighting label assignment scheme for object detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2022. p. 9377–86. doi:10.1109/CVPR52688.2022.00917. [Google Scholar] [PubMed] [CrossRef]
18. Ge Z, Liu S, Li Z, Yoshie O, Sun J. OTA: optimal transport assignment for object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2021. p. 303–12. doi:10.1109/CVPR46437.2021.00037. [Google Scholar] [PubMed] [CrossRef]
19. Xu C, Zhang R, Yang W, Zhu H, Xu F, Ding J, et al. Oriented tiny object detection: a dataset, benchmark, and dynamic unbiased learning. IEEE Trans Pattern Anal Mach Intell. 2026;48(3):3167–84. doi:10.1109/TPAMI.2025.3634161. [Google Scholar] [PubMed] [CrossRef]
20. Li X, Wang W, Wu L, Chen S, Hu X, Li J, et al. Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv Neural Inf Process Syst. 2020;33:21002–12. [Google Scholar]
21. Li X, Wang W, Hu X, Li J, Tang J, Yang J. Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. arXiv:2011.12885. 2020. [Google Scholar]
22. He J, Erfani S, Ma X, Bailey J, Chi Y, Hua X. α -IoU: a family of power intersection over union losses for bounding box regression. Adv Neural Inf Process Syst. 2021;34:20230–42. [Google Scholar]
23. Chen X, Fang H, Lin T, Vedantam R, Gupta S, Dollar P, et al. Microsoft COCO captions: data collection and evaluation server. arXiv:1504.00325. 2015. [Google Scholar]
24. Ding J, Xue N, Xia GS, Bai X, Yang W, Yang M, et al. Object detection in aerial images: a large-scale benchmark and challenges. IEEE Trans Pattern Anal Mach Intell. 2022;44(11):7778–96. doi:10.1109/tpami.2021.3117983. [Google Scholar] [PubMed] [CrossRef]
25. Li W, Li W, Yang F, Wang P. Multi-scale object detection in satellite imagery based on YOLT. In: IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium. Piscataway, NJ, USA: IEEE; 2019. p. 162–5. doi:10.1109/IGARSS.2019.8898170. [Google Scholar] [PubMed] [CrossRef]
26. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2016. p. 770–8. doi:10.1109/CVPR.2016.90. [Google Scholar] [PubMed] [CrossRef]
27. Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh IH. CSPNet: a new backbone that can enhance learning capability of CNN. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway, NJ, USA: IEEE; 2020. p. 1571–80. doi:10.1109/CVPRW50498.2020.00203. [Google Scholar] [PubMed] [CrossRef]
28. Ge Z, Liu S, Wang F, Li Z, Sun J. YOLOX: exceeding YOLO series in 2021. arXiv:2107.08430. 2023. [Google Scholar]
29. Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, et al. Mobile-former: bridging mobilenet and transformer. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2022. p. 5270–9. [Google Scholar]
30. Liu J, Zheng K, Liu X, Xu P, Zhou Y. SDSDet: a real-time object detector for small, dense, multi-scale remote sensing objects. Image Vis Comput. 2024;142:104898. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools