Face-Pedestrian Joint Feature Modeling with Cross-Category Dynamic Matching for Occlusion-Robust Multi-Object Tracking

Qin Hu; Hongshan Kong

doi:10.32604/cmc.2025.069078

icon Open Access

ARTICLE

Face-Pedestrian Joint Feature Modeling with Cross-Category Dynamic Matching for Occlusion-Robust Multi-Object Tracking

Qin Hu, Hongshan Kong^*

The School of Cryptography Engineering, Information Engineering University, Zhengzhou, 450001, China

* Corresponding Author: Hongshan Kong. Email: email

(This article belongs to the Special Issue: Secure & Intelligent Cloud-Edge Systems for Real-Time Object Detection and Tracking)

Computers, Materials & Continua 2026, 86(1), 1-31. https://doi.org/10.32604/cmc.2025.069078

Received 13 June 2025; Accepted 13 August 2025; Issue published 10 November 2025

Abstract

To address the issues of frequent identity switches (IDs) and degraded identification accuracy in multi object tracking (MOT) under complex occlusion scenarios, this study proposes an occlusion-robust tracking framework based on face-pedestrian joint feature modeling. By constructing a joint tracking model centered on “intra-class independent tracking + cross-category dynamic binding”, designing a multi-modal matching metric with spatio-temporal and appearance constraints, and innovatively introducing a cross-category feature mutual verification mechanism and a dual matching strategy, this work effectively resolves performance degradation in traditional single-category tracking methods caused by short-term occlusion, cross-camera tracking, and crowded environments. Experiments on the Chokepoint_Face_Pedestrian_Track test set demonstrate that in complex scenes, the proposed method improves Face-Pedestrian Matching F1 area under the curve (F1 AUC) by approximately 4 to 43 percentage points compared to several traditional methods. The joint tracking model achieves overall performance metrics of IDF1: 85.1825% and MOTA: 86.5956%, representing improvements of 0.91 and 0.06 percentage points, respectively, over the baseline model. Ablation studies confirm the effectiveness of key modules such as the Intersection over Area (IoA)/Intersection over Union (IoU) joint metric and dynamic threshold adjustment, validating the significant role of the cross-category identity matching mechanism in enhancing tracking stability. Our_model shows a 16.7% frame per second (FPS) drop vs. fairness of detection and re-identification in multiple object tracking (FairMOT), with its cross-category binding module adding aboute 10% overhead, yet maintains near-real-time performance for essential face-pedestrian tracking at small resolutions.

Keywords

Cross-category dynamic binding; joint feature modeling; face-pedestrian association; multi object tracking; occlusion robustness

1 Introduction

With the rapid advancement of intelligent surveillance and smart city construction, MOT technology has emerged as a core research focus within the computer vision domain. While existing deep learning-based tracking methods (e.g., DeepSORT [1], FairMOT [2]) have achieved significant progress on public datasets, they still encounter significant challenges such as frequent ID switches and degraded identity recognition performance in practical complex scenarios, particularly under circumstances involving pedestrians with similar attire and mutual occlusion [3]. Consequently, multi-object tracking in heavily occluded environments requires further investigation. Traditional pedestrian tracking methods primarily rely on pedestrian appearance features and motion models for data association [4]. However, their performance is constrained in the following scenarios:

(1) Short-term occlusion: Occlusion leads to severe degradation of appearance features, reducing discriminability among pedestrians wearing similar clothing during intersections, as shown in Fig. 1a,b.

images

Figure 1: Illustration of complex scene. (a, b) Examples of short-term occlusion causing appearance feature degradation and reduced discriminability among pedestrians in similar clothing. (c, d) Examples of cross-camera tracking challenges due to significant appearance disparities under different viewpoints, highlighting the difficulty of establishing stable associations with unimodal features

(2) Cross-camera tracking: Significant disparities in pedestrian appearance under different viewpoints make it difficult to establish stable associations using only unimodal features, illustrated in Fig. 1c,d.

(3) Dynamic environments: Dynamic factors such as fluctuating pedestrian density and target deformation can invalidate fixed matching thresholds.

In crowded scenes, pedestrians are often partially or completely occluded, with the face typically remaining the only continuously visible part [5]. Pedestrian tracking, reliant on full-body or upper-body detection, is highly susceptible to detection failures under heavy occlusion. Conversely, face tracking focuses solely on the facial region, enabling consistent tracking even when most of the body is occluded, as depicted in Fig. 2. However, most existing works treat faces and pedestrians as independent detection targets, lacking an effective cross-category associationmechanism.

images

Figure 2: Illustration of Target ID correction. (a) Tracking example of pedestrian bounding box (ID 38) and face bounding box (ID 38), annotated by a pink box. (b) Tracking example of pedestrian bounding box (ID 38) with occluded face bounding box, annotated by a pink box. (c) Tracking example of face bounding box (ID 541) with occluded pedestrian bounding box, annotated by a red box. (d) Tracking example after ID correction, showing the correct association of the pedestrian bounding box (ID 38) and face bounding box (ID 38), annotated by a pink box

To address these limitations, this study proposes a face-pedestrian joint tracking approach featuring cross-category dynamic binding and occlusion-robust multi-target tracking, with main contributions including:

(1) Face-Pedestrian Joint Tracking Model: An enhanced model based on the FairMOT framework, centered on the principle of “intra-class independent tracking + cross-category dynamic binding.” This significantly mitigates tracking instability issues (frequent ID switches and trajectory fragmentation) encountered by traditional single-category pedestrian trackers in complex scenes.

(2) Multi-Modal Joint Matching Metric with Spatio-temporal and Appearance Constraints: ① Effectively suppresses cross-category mismatches in complex scenes by fusing historical centroid distance and spatial constraints (IoA/IoU); ② Introduces a cross-category feature mutual verification mechanism, utilizing historical feature pools for bidirectional face-pedestrian feature validation to enhance association reliability; ③ Proposes an occlusion-aware dynamic weighting strategy that adaptively adjusts the weights of spatio-temporal and appearance constraints based on environmental complexity.

(3) Face-Pedestrian Association via Dual Matching Strategy: ① Formulates a matching matrix integrating feature similarity and spatio-temporal constraints, transforming the cross-category association task into a rapidly solvable maximum-weighted bipartite matching problem; ② Establishes a bidirectional historical mapping mechanism with ID consistency correction and feature inheritance strategies, effectively overcoming frequent ID switches and degradation in identification rate caused by occlusion, thereby ensuring spatio-temporal continuity of cross-category trajectories.

(4) Dynamic Threshold Filtering Based on Environmental Complexity: Designs an environmental complexity-based assessment framework to realize adaptive dynamic threshold filtering, effectively resolving the adaptability limitations of fixed thresholds in dynamic scenes.

To validate the efficacy of the proposed method, comprehensive experiments were conducted on the annotated Chokepoint_Face_Pedestrian_Track test set.

(1) Face-Pedestrian Matching Experiment: In complex scenarios, our method demonstrated substantial improvement over traditional geometric matching and pure feature-based methods (FP, IoA, IoU), with F1 AUC enhancement ranging between approximately 4 and 43 percentage points, indicating stronger robustness and matching precision.

(2) Face-Pedestrian Matching Ablation Study: By incrementally incorporating functional modules (IoA-IoU joint metric, cross-category feature retrieval, spatial constraint, dynamic threshold adjustment, historical ID correction), the effectiveness of each module in optimizing face-pedestrian matching performance was individually verified.

(3) Multi-Face-Pedestrian Joint Tracking Experiment: Achieved overall performance metrics of IDF1: 85.1825% and MOTA: 86.5956%. Compared to the baseline model without the joint matching mechanism (IDF1: 84.2776%, MOTA: 86.5413%), this represents increases of 0.91 percentage points and 0.06 percentage points, respectively. The synchronized improvement in IDF1 and MOTA validates the contribution of the proposed cross-category identity matching mechanism to enhancing tracking performance.

(4) Real-Time Performance and Computational Cost Analysis: The computational efficiency analysis reveals that Our_model (without matching) exhibits a 16.7% FPS reduction compared to FairMOT’s single-category tracking, with the cross-category dynamic binding module introducing approximately 10% additional computational overhead. Nevertheless, Our_model maintains near-real-time performance with small-resolution inputs, achieving its essential cross-category face-pedestrian tracking capability at controlled computational costs.

2 Related Work

As one of the core tasks in computer vision, MOT primarily revolves around detection, association, and feature representation. In recent years, with the advancement of deep learning techniques, MOT methods have gradually evolved into a diversified research framework centered on detection while integrating multiple technologies. The following discussion classifies and reviews existing methods from the perspectives of MOT technical paradigms and specific target tracking, while also summarizing commonly used object tracking datasets.

2.1 Classification of MOT Methods

2.1.1 Detection-Based MOT Methods

Tracking by Detection (TBD): Constructs trajectories via frame-wise detection and association. SORT [6] uses IoU for matching but suffers under occlusion. DeepSORT reduces ID switches via cascade matching and appearance features. ByteTrack [7] leverages low-confidence boxes for occlusion handling, while BoT-SORT [8] fuses motion/appearance data. TBD methods remain vulnerable to detection errors causing trajectory fragmentation.

Joint Detection and Tracking (JDT): End-to-end joint optimization. FairMOT employs a CenterNet-based dual-branch (detection/ReID) architecture. Chained-Tracker [9] formulates association as regression. JDT mitigates pipeline fragmentation but requires extensive training data.

2.1.2 Segmentation-Based MOT Methods

MOTS [10] pioneers mask-level tracking via Track-RCNN. PointTrack [11] treats pixels as point clouds, improving occlusion robustness through instance embeddings. However, segmentation methods rely on costly pixel-level annotations and exhibit high computational complexity.

2.1.3 Transformer-Based MOT Methods

In recent years, Transformer [12] has emerged as a research focus in MOT due to its global modeling capability. TransTrack [13] achieves unified modeling of detection and tracking through a query mechanism, while TrackFormer [14] employs an encoder-decoder framework to handle trajectory initialization and association. Although some Transformer-based MOT methods demonstrate superior performance in complex motion scenarios, their accuracy on conventional datasets remains inferior to optimized TBD methods such as OC-SORT [15]. Moreover, high-performance Transformer-based MOT approaches suffer from substantial computational overhead, while lightweight models exhibit significant accuracy degradation in complex environments, making them difficult to deploy on edge devices.

2.2 Face Tracking

Face tracking aims to detect and track facial targets in video sequences, leveraging facial key points to enhance identity preservation. Shi et al. [16] proposed a fusion framework based on the MTCNN detector and an improved KCF tracker, establishing a “detection-tracking-redetection” loop mechanism. However, its multi-stage detection structure leads to insufficient real-time performance in high-density scenarios. Lin et al. [17] adopted RetinaFace to replace MTCNN and combined it with KCF for fast multi-face tracking, significantly improving processing efficiency. Qi et al. [18] enhanced YOLOv5 to design the YOLO5Face detector, introducing a Stem module to strengthen generalization capability and improving occlusion stability through landmark supervision. Jöchl and Uhl [19] proposed the FaceSORT model, integrating facial biometric features with visual appearance features, thereby enhancing robustness in occlusion and side-view scenarios. However, the sensitivity to biometric feature quality and adaptive parameter tuning remain areas for improvement. Current face tracking methods generally suffer from high dependency on frontal faces, performance degradation under extreme poses, and frequent ID switches in low-resolution scenarios.

2.3 Pedestrian Head Tracking

Pedestrian head tracking focuses on detecting and continuously tracking head targets to address challenges such as occlusion and scale variation. Stewart et al. [20] developed an end-to-end detection framework based on Faster R-CNN but did not resolve high-density association issues. Sundararaman et al. [21] constructed the first large-scale head tracking dataset, CroHD, and proposed the HeadHunter-T framework. It employs a context-sensitive feature pyramid for optimized detection and combines particle filtering with histogram matching for tracking, achieving an IDF1 of 57.1% in occlusion scenarios. However, it underutilizes biometric features and relies on handcrafted features for appearance similarity modeling, leading to frequent ID switches under complex lighting conditions. Sun et al. [22] designed a multi-source information fusion network (MIFN) that integrates five sources (RGB + optical flow + depth + frame difference + density map) for end-to-end training, achieving a MOTA of 76.7% on the Cchead dataset (a 5.7% improvement over FairMOT). Overall, existing head trackers primarily rely on appearance features, lacking the strong discriminative power of facial biometrics. They exhibit significant shortcomings in occlusion robustness and adaptability to dynamic scenes, with prediction errors notably increasing during nonlinear motions such as sharp turns or evasive maneuvers in crowds.

2.4 Multimodal Joint Tracking

Multi-modal joint tracking technology aims to enhance persistent target identity tracking in complex surveillance scenarios by fusing multi-modal biometric features. Current research suggests that collaborative modeling of target identity, behavior, and intention improves performance in multi-target tracking tasks under challenging conditions. Multi-modal identity inference (e.g., cross-camera re-identification based on skeletal pose or clothing attributes) serves as the foundation for resolving target ID switches. Behavior recognition (e.g., classifying walking/stationary states) provides dynamic constraints for trajectory motion models, while pedestrian intention prediction (e.g., crossing decisions) critically influences the reliability of long-term trajectory forecasting. Sharma et al. [23] highlight that integrating visual (RGB/IR), geometric (point cloud), and behavioral modalities (gait/hand gestures) enhances intention reasoning accuracy, with spatio-temporal graph networks further optimizing group trajectory modeling by incorporating scene semantics. Huang et al. [24] proposed a feature-level fusion approach, employing a concatenation strategy for facial and pedestrian features along with a maximum-value-based decision mechanism. However, their method requires manual alignment due to feature dimensionality discrepancies and exhibits limited robustness in occlusion scenarios. Li [25] developed a cross-camera system that leverages head features to assist ReID and face recognition, significantly reducing the missed detection rate through a parallel matching mechanism (where a match is confirmed upon the success of any single feature). Nevertheless, the field still faces challenges, including insufficient feature complementarity under occlusion, bottlenecks in cross-modal fusion, and high computational costs that hinder real-time performance.

2.5 Object Tracking Datasets

Object tracking datasets can be categorized into single-object tracking (SOT) datasets and multi-object tracking (MOT) datasets based on the number of tracked targets.

The commonly used single target tracking data include OTB data set, VOT data set, GOT-10K data set, LaSOT data set, etc. The OTB [26,27] dataset pioneered this field, propelling early advancements in target tracking algorithms. The VOT [28,29] dataset, updated annually for the Visual Object Tracking competition, has expanded to include real-time, long-term, and RGBD tracking challenges. More recent contributions include GOT-10K [30], a large-scale dataset covering diverse scenarios with 1.5 million annotated bounding boxes, and LaSOT [31], featuring 1400 high-quality sequences with over 3.5 million densely annotated frames, each frame in the sequence was annotated manually, making LaSOT the largest annotation-intensive tracking benchmark of its time.

In the realm of multi-object tracking, the MOT series [32–34] has been instrumental. This series, evolving from MOT15 to MOT20, progressively introduced more complex scenarios, increased occlusions, and faster movement speeds. The STEP-ICCV21 dataset [35] advanced the field further by focusing on Multi-Object Tracking and Segmentation tasks. Additionally, the TAO VOS dataset [36] represents a significant contribution to large-scale video object segmentation and tracking, aimed at enhancing research in video understanding, particularly for long videos and extensive category ranges.

Table 1 presents a summary of the basic information for these datasets.

2.6 Face Tracking Datasets

Publicly available face tracking datasets are limited, with MobiFace and Chokepoint being the most commonly used (see Table 2).

MobiFace, proposed by Lin et al. [37] in 2019, is a novel dataset for mobile face tracking and grouping, containing 80 unedited videos recorded via smartphone live streaming. It provides over 95,000 annotated bounding boxes.

The Chokepoint dataset, introduced by Wong et al. [38] in 2019, is designed for multi-face tracking. It comprises two scenarios recorded one month apart: Scenario 1 includes 25 individuals (19 male, 6 female), while Scenario 2 has 29 (23 male, 6 female). Three cameras were mounted above an office doorway, capturing 78 video sequences with 64,204 annotated face images (excluding 6 unlabeled sequences). The first 100 frames of each sequence are reserved for background modeling, with no foreground objects present.

While datasets for pedestrians, vehicles, and animals are relatively mature, face tracking datasets and evaluation metrics remain scarce. Even the widely used Chokepoint dataset only annotates eye trajectories for simple scenarios (single pedestrian per frame). Notably, its six complex sequences (“P2E_S5” and “P2L_S5”), featuring multiple pedestrians with mutual occlusion, lack annotations, highlighting the need for further refinement in face tracking datasets.

3 Method

3.1 Face-pedestrian Joint Tracking Framework

3.1.1 Model Architecture Design

The proposed Face-Pedestrian Joint Tracking framework (Fig. 3) implements “intra-class independent tracking + cross-category dynamic binding”, Figs. 4 and 5 present the detailed architectural diagrams of intra-class independent tracking and cross-category dynamic binding, respectively. Based on an enhanced FairMOT architecture, it extends the detection head and increases feature channels to convert single-category tracking into multi-category tracking for faces and pedestrians. Cross-category dynamic binding then enables joint tracking through a joint matching mechanism. The architecture comprises two stages:

images

Figure 3: Face-pedestrian joint tracking framework

images

Figure 4: Intra-class independent tracking framework

images

Figure 5: Cross-category dynamic binding framework

(1) Intra-Class Independent Tracking

Face Tracking Branch: Facial features are extracted via the DLA-34 backbone network, outputting face bounding boxes Bf∈R(Nf×4) and facial features Ff∈R128. The JDETracker generates face trajectories Tf={(IDf,BBOXf,Ff)}.

Pedestrian Tracking Branch: Using the shared DLA-34 backbone network, pedestrian features are extracted, outputting pedestrian bounding boxes Bp∈R(Np×4) and ReID features Fp∈R128. The JDETracker generates pedestrian trajectories Tp={(IDp,BBOXp,Fp)}.

(2) Cross-Category ID Binding

Based on the spatio-temporal and appearance joint metric (detailed in Section 3.2), trajectories Tf and Tp are dynamically associated, outputting the bound joint trajectory Tjoint={(IDjoint,BBOXf,BBOXp,Ff,Fp)}

3.1.2 Face-Pedestrian Tracking Dataset Construction

Building upon the Chokepoint dataset, this work employs a combination of “automatic tracking” and “manual annotation” to label “face tracking” and “pedestrian tracking” data across 78 videos from Chokepoint, forming the Chokepoint_Face_Pedestrian_Track dataset. Of these, 72 videos represent simple scenes with minimal inter-pedestrian occlusion, as shown in Fig. 6a,b. The remaining 6 videos depict complex scenes with significant inter-pedestrian occlusion, as shown in Fig. 6c,d. The dataset comprises approximately 130,000 face tracking annotations and 160,000 pedestrian tracking annotations. These are divided into training, validation, and test sets using a 70%, 15%, 15% split. The statistical information of the Chokepoint_Face_Pedestrian_Track dataset is shown in Table 3. To evaluate the zero-shot tracking performance of Our joint tracking model, we constructed the Challenging Face_Pedestrian_Track Dataset. Compared to the Chokepoint_Face_Pedestrian_Track Dataset, the Challenging dataset contains numerous side-view and back-view scenarios where faces are partially visible or completely occluded, as illustrated in Fig. 7. The dataset statistics are presented in Table 4.

images

Figure 6: Scene examples from chokepoint dataset. (a, b) Examples of simple scenes in the chokepoint dataset. (c, d) Examples of complex scenes in the chokepoint dataset

images

Figure 7: Scene examples from challenging Face_Pedestrian_Track Dataset. (a) Example of multiple pedestrian targets in Scene 1 (front view). (b) Example of multiple pedestrian targets in Scene 1 (rear view), where the red bounding box represents the simulated face bounding box used for performance evaluation. (c) Example of multiple pedestrian targets in Scene 2 (front view). (d) Example of pedestrian targets in Scene 2 (side view), where the red bounding box represents the simulated face bounding box used for performance evaluation

images

Annotation Format:

cls_id obj_id center_x center_y width height

cls_id: Target category (0: “Face”, 1: “Pedestrian”)

obj_id: Unique identifier for the target instance.

center_x: Normalized x-coordinate of the bounding box center.

center_y: Normalized y-coordinate of the bounding box center.

width: Normalized width of the bounding box.

height: Normalized height of the bounding box.

3.1.3 Loss Function Calculation

(1) Detection Task Loss Calculation

Heatmap Prediction: Detection head output channels match class count (e.g., one per pedestrian/face). CTFocalLoss is computed via pixel-wise cross-entropy over all classes, summed to a scalar.

Lheatmap=1N∑c=01∑i,jFocalLoss(Pc,i,j,Yc,i,j)(1)

FocalLoss(P,Y)=−α(1−P)γYlog⁡P−βPγ(1−Y)log⁡(1−P)(2)

where c∈{0,1} is the class index (0: “Face”, 1: “Pedestrian”); Pc,i,j is the confidence score after Sigmoid at position (i,j) for class c predicted by the model; Yc,i,j is the ground-truth heatmap label (0 or 1) at position (i,j); N represents the total number of pixels marked as targets across all classes; α, β, γ are FocalLoss hyperparameters.

Size/Offset Regression Loss: The regression losses for target size and center offset are calculated directly based on the locations of all positive samples:

Lsize=1M∑k=1M‖S^k−Sk‖1(3)

Loffset=1M∑k=1M‖O^k−Ok‖1(4)

where M denotes the total number of positive samples for both faces and pedestrians; S^k represents the predicted target size; Sk is the ground-truth target size; O^k represents the predicted center offset; Ok is the ground-truth center offset.

(2) ReID Task Loss Calculation

ID Feature Hybrid Learning: Pedestrian and face IDs share a unified global pool (e.g., pedestrians: 1–9, faces: 11–19). Cross-entropy loss uniformly distinguishes all ID features as independent classes.

Lreid=Lface+Lpedestrian(5)

Lc=1Nc∑i=1NcCrossEntropy(Wc⋅ϕ(xic),yic)(6)

where c∈{face,pedestrian} is the category identifier; ϕ(xic) denotes the L2-normalized feature vector of sample xic; Wc represents the parameter matrix of the independent classifier for category c; yic is the identity label corresponding to sample xic; Nc is the number of valid samples for category c in the current batch; CrossEntropy indicates the cross-entropy loss function.

(3) Total Loss Calculation

The total loss function is the weighted sum of the individual loss components:

Ltotal=λ1Lheatmap+λ2Lsize+λ3Loffset+λ4Lreid(7)

where λ1, λ2, λ3, λ4 are the respective loss weighting coefficients.

The model training pipeline is illustrated in Fig. 8, and the training and validation loss curves are shown in Fig. 9. The primary focus of this study is to validate the effectiveness of the face-pedestrian joint matching method, rather than pursuing state-of-the-art (SOTA) model performance. Due to the relative scarcity of training datasets for face tracking tasks, the proposed method may exhibit limited generalization capability in scenarios with significant domain discrepancies.

images

Figure 8: Training pipeline for face-pedestrian tracking model

images

Figure 9: Training and validation loss curves for face-pedestrian tracking model

3.2 Face-Pedestrian Matching Method

To mitigate ID switches and trajectory fragmentation issues inherent in traditional single-category pedestrian tracking algorithms within complex scenarios like similar attire and mutual occlusion, this paper further optimizes the face-pedestrian matching strategy within the proposed joint detection and tracking framework. This is achieved by designing a multi-modal joint matching metric, establishing a dynamic threshold filtering mechanism, and utilizing the Hungarian algorithm to achieve global optimal matching for face-pedestrian pairs within a single frame.

3.2.1 Multi-Modal Joint Matching Metric

In complex MOT scenarios, occlusion and similar clothing degrade ReID features, causing ID switches. We propose a multi-modal matching metric with three mechanisms: (1) Spatio-Temporal Constraint: Combines centroid motion with IoA/IoU to suppress mismatches. (2) Cross-Category Feature Verification: Uses historical feature pool for bidirectional validation. (3) Dynamic Weighting: Adapts constraint weights to scene complexity.

Ci,j=[ω1⋅(IOA(bif,bjp)+IOU(bif,bjp))+ω2⋅SReID(fjp,f¯ip)+ω3⋅SFace(fif,f¯jf)]⋅Dm(8)

After normalization:

Ci,jnorm=ω1⋅(IOA(bif,bjp)+IOU(bif,bjp))+ω2⋅SReID(fjp,f¯ip)+ω3⋅SFace(fif,f¯jf)2ω1+ω2+ω3⋅Dm(9)

where IOA(bif,bjp)=Area(bif∩bjp)Area(bif)∈[0,1] is the ratio of the intersection area between face box i and pedestrian box j to the area of face box i. IOU(bif,bjp)=Area(bif∩bjp)Area(bif∪bjp)∈[0,1] is the standard Intersection over Union between face box i and pedestrian box j. S(⋅)∈[−1,1] denotes the cosine similarity metric. f¯ represents the historical feature mean. Dm∈[0,1] is the face-pedestrian centroid motion constraint term.

(1) Weight Calculation Method

ω1+ω2+ω3=1. ω1 is an adaptive weight parameter adjusted dynamically based on occlusion level:

ω1=αα+β⋅Occl_Level(10)

The empirical coefficients are set as α=0.6 and β=0.3. Occl\_Level indicates the occlusion level of pedestrian bounding boxes. ω2 and ω3 represent the weights for pedestrian ReID and facial feature similarities, determined by their proportion in the total similarity score.

(2) Face-Pedestrian Centroid Motion Constraint

Dm=exp⁡(−|dcurrent−d¯|22σ2)(11)

Here, dcurrent is the current face-pedestrian centroid distance, d¯ the historical mean distance, and σ the motion difference tolerance. Smaller σ enforces stricter motion matching constraints.

(3) Occlusion Level Assessment Method

The occlusion level Occl_Levelit∈[0,1] for target bounding box bit in frame t is calculated as follows:

Spatial Occlusion Detection: Compute the maximum IoU (IOUmax) of bit with other bounding boxes {bjt}j≠i in the same frame. Utilize a sigmoid-like nonlinear mapping to enhance sensitivity:

occraw=11+exp⁡(−k(IOUmax−τ))(12)

where k=15 is the slope factor, τ=0.3 is the effective occlusion threshold. When IOUmax>τ, occraw rapidly approaches 1.

Temporal Smoothing: Maintain a temporal smoothing window of length L=5 frames ({occrawi(t−l)}l=0L−1). Apply a moving average to suppress the influence of transient noise:

Occl_Levelit=∑l=0L−1occrawi(t−l)L(13)

For new tracks, the queue is initialized by repeating the current frame’s occrawit value L times to avoid bias before the window is full.

3.2.2 Face-Pedestrian Association via Dual Matching Strategy

We propose a robust face-pedestrian matching method combining Hungarian algorithm-based maximum-weighted bipartite matching (using feature similarity and spatio-temporal constraints) with dynamic threshold filtering. The system adaptively adjusts thresholds via environmental complexity assessment to mitigate occlusion-induced ID switches, while employing bidirectional historical mapping for ID consistency and feature inheritance for unmatched targets to maintain trajectory continuity.

(1) Face-Pedestrian Matching via Hungarian Algorithm

Construct a cost matrix Mf∈RM×N, where element mijf=1−Ci,j and Ci,j is the joint matching metric between face i and pedestrian j. The goal is to find a Boolean matrix X that minimizes:

min∑i=1M∑j=1Nmijfxij(14)

Subject to the constraints:

∑ixij≤1,∑jxij≤1,xij∈{0,1}(15)

The solution matrix X corresponds to the frame’s face and pedestrian indices. xij=1 indicates a preliminary match between face i and pedestrian j.

(2) Dynamic Joint Threshold Filtering Mechanism

An initial face-pedestrian match pair (fi,pj) must simultaneously satisfy the following conditions to enter the ID correction stage and be confirmed as a valid match:

{Cfi,pj≥τ(t)Occl_Levels[fi][pj]≤θoccl(16)

where Cfi,pj is the joint matching metric value for the pair (fi,pj), τ(t) is the current frame’s dynamic matching threshold, and Occl_Levels[fi][pj] is the dynamic occlusion threshold for the pair. The calculation methods for both thresholds are detailed in Section 3.2.3.

(3) Historical ID Correction Mechanism

Maintain two bidirectional mapping tables to record historical associations: Face_to_Pedestrian (F2P): maps Face IDs to Pedestrian IDs. Pedestrian_to_Face (P2F): maps Pedestrian IDs to Face IDs.

Face-Dominated Correction: For a current matched pair (fcurr,pcurr), if ∃phist=F2P(fcurr) and P2F(phist)=fcurr, update the pedestrian ID: pcorrected=phist.

Pedestrian-Dominated Correction: For a current matched pair (fcurr,pcurr), if ∃fhist=P2F(pcurr) and F2P(fhist)=pcurr, update the face ID: fcorrected=fhist.

The pseudo-code for the face-pedestrian association algorithm using the dual matching strategy is as Algorithm 1.

images

3.2.3 Dynamic Threshold Mechanism

To address the adaptability limitations of fixed matching thresholds in dynamic environments, this study designs an environmental complexity-based assessment framework to construct a threshold adaptation model. This model dynamically adjusts the multi-modal joint matching metric threshold based on real-time tracking conditions (specifically, pedestrian density and mutual occlusion within the current frame), thereby avoiding fragmented trajectories caused by excessively high thresholds in complex scenes or mismatches resulting from overly low thresholds in simple scenes.

(1) Dynamic Matching Threshold

The dynamic matching threshold τ(t) is adaptively adjusted based on the scene complexity factor:

τ(t)=τbase⋅(1+α⋅tanh⁡(C(t)−μσ))(17)

where τbase is the base matching threshold. α=0.2 is a tuning coefficient. μ=0.6 is the normalized center value for scene complexity. σ=0.2 is the standard deviation parameter.

The scene complexity factor C(t), comprehensively reflecting the tracking difficulty of the current frame, is calculated as follows:

C(t)=1N⋅∑i=1N(Densityi+λ⋅Occl_Leveli)(18)

where N is the total number of targets in the current frame. Densityi is the local density around target i. Occl\_Leveli is the occlusion level estimate based on the bounding box of the pedestrian corresponding to target i in the current frame. λ=0.5 is the weighting coefficient for the occlusion term, balancing the influence of density and occlusion.

The local density Densityi for the center point of each pedestrian target is computed as:

Densityi=Ni−1π⋅r2(19)

where Ni denotes the number of neighboring targets within a radius r of target i (computed via KDTree query). r is the density calculation radius, related to scene resolution:

r=β⋅avg_height+avg_width2(20)

where β=0.5 is an empirically determined coefficient. avg\_height and avg\_width represent the average height and width of targets within the dataset. For online tracking, these values can be replaced by the average height and width of pedestrian targets in the current frame.

(2) Dynamic Occlusion Threshold

The dynamic occlusion threshold θoccl is determined based on the average occlusion level among potential face-pedestrian match pairs in the current frame:

θoccl=δ⋅1Nf×Np∑i=1Nf∑j=1NpOccl_Levels[i][j](21)

where Nf and Np are the number of detected faces and pedestrians in the current frame, respectively. Occl\_Levels[i][j] represents the occlusion assessment value between face i and pedestrian j. δ is a scaling coefficient, configured based on the specific context.

4 Experiments

To validate the effectiveness of the proposed method, this study designs four experimental components: (1) Face-Pedestrian Matching Accuracy Test, (2) Face-Pedestrian Matching Ablation Study, (3) Face-Pedestrian Tracking Performance Evaluation, and (4) Real-Time Performance and Computational Cost Analysis. The first experiment evaluates the accuracy of the face-pedestrian matching component, while the second quantifies the contribution of individual modules in the matching approach. The third assesses the overall performance of the joint face-pedestrian tracking model, and the fourth systematically examines the computational efficiency through runtime and frame rate measurements. Experiments 1 and 2 adopt F1 AUC as the primary evaluation metric, Experiment 3 employs IDF1 and MOTA as key performance indicators, and Experiment 4 focuses on processing time and average FPS comparisons.

4.1 Face-Pedestrian Matching Accuracy Testing

To validate the effectiveness of the proposed face-pedestrian association matching method, an evaluation framework based on dual IoU thresholds was designed to quantify the face-pedestrian matching results.

4.1.1 Association Matching Strategy

This evaluation framework employs a dual-constraint Hungarian matching algorithm for association determination. A successful match is defined as follows:

Match(P,T)={1,if IoU(Pf,Tf)≥θ and IoU(Pp,Tp)≥θ0,otherwise(22)

where Pf and Pp represent the predicted face and pedestrian bounding boxes, respectively. Tf and Tp represent the ground-truth face and pedestrian bounding boxes, respectively. θ is the IoU threshold.

4.1.2 Evaluation Metrics

Precision: Reflects the proportion of correct associations among all predicted associations, primarily measuring the control capability over false positives.

Recall: Characterizes the proportion of correct associations among all ground-truth associations, evaluating the method’s ability to suppress missed detections.

F1 Score: A balanced metric combining Precision and Recall. F1 is set to zero when P+R=0.

4.1.3 Performance Quantitative Evaluation

To overcome the limitations of single-threshold evaluation, this paper conducted multi-threshold testing within the range θ∈{0.1,0.2,…,0.9}, comprehensively reflecting the matching adaptability of the association matching algorithm under different application scenarios.

The experiment was tested on 12 video sequences from the Chokepoint_Face_Pedestrian_Track test set. Among these, 9 represent simple scenes with minimal inter-pedestrian occlusion, and 3 represent complex scenes with significant inter-pedestrian occlusion. Test results are aggregated separately for each scene type. For an intuitive comparison of different methods’ performance across various metrics and thresholds, the AUC was employed for global performance assessment:

AUC=∫0.10.9M(θ)dθ(23)

where M(θ) is the measured value of the corresponding metric at threshold θ. The AUC, computed via numerical integration, represents the area under the metric curve; a higher AUC indicates better performance for that metric.

4.1.4 Performance Comparison of Face-Pedestrian Matching Methods

We evaluate four methods (IoU, IoA, FP, Our) across simple, complex, and hybrid scenarios using Precision, Recall, F1 Score at different IoU thresholds and their respective AUC values (Figs. 10–12). Analysis of the face-pedestrian matching test results across the three scenarios reveals:

images

Figure 10: Face-pedestrian matching performance comparison in simple scenes

images

Figure 11: Face-pedestrian matching performance comparison in complex scenes

images

Figure 12: Face-pedestrian matching performance comparison in hybrid scenes

Simple Scenes: Performance differences between methods are minimal. Our method achieves the highest F1 AUC (0.706), slightly outperforming IoA (0.705), FP (0.674), and IoU (0.697). All methods effectively balance Precision and Recall in simple scenes. Notably, IoA attains the best Recall AUC (0.726), while Our method demonstrates superior overall performance with balanced Precision (0.687) and Recall (0.725), highlighting its adaptability.

Complex Scenes: Our method exhibits significant advantages, with Precision AUC (0.676), Recall AUC (0.633), and F1 AUC (0.654) substantially exceeding other methods (e.g., FP achieves around only 0.225 for all metrics). Traditional methods (FP, IoA, IoU) perform poorly, with IoA (0.611) and IoU (0.511) outperforming FP but still lagging behind Our method, demonstrating its robustness in complex scenes.

Hybrid Scenes: Our method achieves the best performance across all metrics: Precision AUC (0.681), Recall AUC (0.670), and F1 AUC (0.675). While IoA shows competitive Recall AUC (0.658), its Precision (0.642) and F1 (0.650) remain inferior. Traditional methods (IoU, FP) exhibit inconsistent performance, performing well in simple but poorly in complex scenes, revealing their limited adaptability. This underscores the effectiveness of Our method’s multi-modal joint matching, dual matching strategy, and scene adaptation mechanisms for cross-scenario applications.

The results demonstrate Our method’s consistent superiority in simple, complex, and hybrid scenes. In complex scenes, it significantly outperforms geometric (IoA, IoU) and feature-based (FP) methods, with F1 AUC improvements ranging from 4 to 43 percentage points, validating its robustness and accuracy for face-pedestrian matching across diverse environments.

4.2 Face-Pedestrian Matching Ablation Study

The novelty of the proposed face-pedestrian matching method lies mainly in five key mechanisms: (1) Spatial matching based on IoA-IoU joint metric, denoted IoA-IoU. (2) Cross-category reverse feature retrieval strategy, denoted FPF (Face-Pedestrian Features). (3) Spatial constraint mechanism based on centroid distance, denoted CD (Centroid Distance). (4) Environment-aware dynamic threshold filtering method, denoted DT (Dynamic Threshold). (5) Target ID correction mechanism based on historical matching, denoted IDC (ID Correction). To validate the effectiveness of each module, an ablation analysis was performed using comprehensive testing on 9 simple scenes and 3 complex scenes. The five core components were systematically verified. Specific experimental results are shown in Fig. 13.

images

Figure 13: Ablation study on face-pedestrian matching in hybrid scenes

Analysis of the ablation results reveals:

IoA-IoU Joint Metric: Introducing this module significantly improved performance, increasing F1 AUC from the baseline of 0.58747 to 0.668036 (relative increase of 13.7%). This demonstrates that the IoA mechanism effectively alleviates the high sensitivity of single IoU to target deformation and occlusion, robustly validating the necessity of the joint metric strategy for spatial alignment.

FPF Module: Adding this module resulted in a marginal performance increase, with F1 AUC rising only by 0.00033 (0.668036 → 0.668366). This suggests limitations of the cross-category feature retrieval strategy within the existing feature space, potentially due to insufficient model feature extraction capability.

CD Module: Introducing this constraint increased F1 AUC by 0.001301 (0.668366 → 0.669667). Precision AUC and Recall AUC also improved simultaneously. This indicates the effectiveness of the centroid distance constraint module in suppressing incorrect associations of spatially shifted targets, enhancing robustness in complex motion scenarios.

DT Module: Enabling this mechanism improved F1 AUC and Recall AUC, but caused a slight decrease in Precision AUC (–0.000198, 0.675589 → 0.675391). This reveals that the dynamic threshold filtering mechanism moderately relaxes matching conditions to increase recall rate. Although this slightly reduces precision, the overall F1 AUC improvement validates its effectiveness in balancing detection sensitivity and false positive suppression.

IDC Module: Finally, adding this module for identity correction using historical matching information further enhances long-term identity consistency for face-pedestrian matching. Built upon the accumulation of all previous modules, it raises the final model’s F1 AUC to 0.67533 (+0.82% compared to without IDC), providing a solution to mitigate ID drift issues in cross-category tracking.

The proposed face-pedestrian matching framework systematically optimizes performance by integrating the IoA-IoU joint metric, FPF cross-category feature retrieval, CD spatial constraint, DT dynamic threshold, and IDC historical correction mechanisms. The ablation study indicates: IoA-IoU serves as the foundational module, contributing the primary performance gain (F1 AUC increase of 13.7%), robustly validating the joint metric’s effectiveness for spatial alignment. CD and IDC further enhance performance through spatial constraints and historical consistency optimization, enabling the final model to achieve an F1 AUC of 0.67533 (15% improvement over the baseline). FPF and DT modules, while providing smaller contributions, offer optimization direction for cross-category alignment and environmental adaptability.

4.3 Face-Pedestrian Tracking Model Performance Testing

This experiment is also conducted on the 12 video sequences from the Chokepoint_Face_Pedestrian_Track test set, systematically evaluating two model architectures:

Baseline Model: The “Face-Pedestrian Multi-Category Object Independent Tracking Model” without integrated identity association information. Its performance is tested separately for pedestrian and face tracking.

Our Model: The joint tracking model integrating the “Face-Pedestrian Cross-category Matching Mechanism.” Recognizing that pedestrian targets generally maintain detection and tracking continuity more effectively than faces in real-world scenarios, while also being more susceptible to occlusion in complex settings, this experiment focuses solely on evaluating the corrected pedestrian tracking results output by Our Model. This targeted evaluation aims to specifically validate the effectiveness of the proposed occlusion-handling ID correction mechanism.

The experimental design serves two primary verification objectives: (1) Assess the feasibility of the multi-category object independent tracking framework via performance evaluation of the baseline model. (2) Quantitatively analyze the optimization effect of the cross-category identity matching mechanism on tracking performance through comparative experiments with the baseline model.

Performance evaluation follows the MOT Challenge standard evaluation protocol, employing a multi-dimensional assessment system. Core metrics include: IDF1: Comprehensively measures ID precision and recall. IDs: Specifically assesses target identity stability (number of identity switches). MOTA: Integrally reflects detection and tracking capability. MOTP: Quantifies trajectory localization precision. All metrics were calculated using the motmetrics toolkit. Results are presented in Table 5.

images

4.3.1 Validation of Multi-Category Object Independent Tracking Framework Feasibility

Results indicate that the baseline model (Baseline Model) achieved the following pedestrian tracking performance across different scenarios: Simple Scenes: IDF1: 89.4749%, MOTA: 82.7753%. Complex Scenes: IDF1: 80.1621%, MOTA: 89.4503%. Hybrid Scenes: IDF1: 84.2776%, MOTA: 86.5413%. Its face tracking performance was: Simple Scenes: IDF1: 94.1164%, MOTA: 88.7521%. Complex Scenes: IDF1: 92.3022%, MOTA: 89.8148%. Hybrid Scenes: IDF1: 92.9690%, MOTA: 89.4326%. These results validate the fundamental tracking capability of the proposed multi-category object independent tracking framework for both pedestrian and face targets. Note: The performance evaluation of Our Model is conducted only on the pedestrian tracking trajectories output by its cross-category association correction mechanism. Accordingly, for direct comparison, the performance metrics reported for the Baseline Model in the following sections pertain specifically to its pedestrian tracking results.

Comparison of Our Model results reveals that in all three scenarios, its IDF1 (89.5370%, 81.7344%, 85.1825%) and MOTA (82.8376%, 89.4984%, 86.5956%) metrics are superior to the corresponding values of the Baseline Model. This suggests room for improvement in the baseline model’s ability to maintain target identity consistency and trajectory continuity, thereby confirming the necessity of introducing the cross-category association mechanism.

It should be specifically noted that in the experimental test dataset, pedestrian targets predominantly exhibit frontal faces throughout the sequence, and the ground truth annotations for evaluating face tracking performance are based on ground truth face tracking data rather than ground truth object tracking data. Consequently, the face tracking results appear favorable. However, in cross-camera scenarios or situations where face detection and tracking are more challenging, standalone face tracking performance would decline significantly. For tracking pedestrian-like targets, whether employing face tracking or pedestrian tracking, the primary objective is to maintain target identity consistency and continuity. Since pedestrian tracking is less affected by camera angles (compared to face tracking, which requires near-frontal views), using ground truth pedestrian tracking annotations to evaluate algorithm performance is more reliable. Therefore, in the experiments, Our Model only presents test results based on the corrected pedestrian tracking data derived from the proposed method.

4.3.2 Optimization Effect of Cross-Category Identity Matching Mechanism

The core innovation of Our Model lies in effectively correcting trajectory interruptions and ID switches caused by single-modal occlusion or tracker limitations via the face-pedestrian association mechanism. To validate this ID correction mechanism, parameter combination experiments were conducted using face_matrix_tau (association confidence threshold) and occl_threshold (occlusion level threshold), with the filtering logic: reject a match pair if its association confidence is below face_matrix_tau or its occlusion level exceeds occl_threshold.

Analysis shows: When adjusting only face_matrix_tau, IDF1 exhibits an initial increase followed by a decrease as the threshold rises from 0.1 to 0.9 (e.g., in Hybrid Scenes, IDF1 peaked at 85.4087% when face_matrix_tau = 0.1 but declined towards baseline levels when face_matrix_tau > 0.6). This is attributed to low thresholds retaining excessive mismatches causing error propagation, while high thresholds over-filter valid matches. MOTA showed the opposite trend, peaking at 86.5413% (baseline level) when face_matrix_tau = 0.9. This highlights the inherent limitation of a single parameter in balancing identity consistency (IDF1) and tracking accuracy (MOTA).

Given that most mismatches occur in heavily occluded scenes, occl_threshold was introduced for joint filtering. In Complex Scenes, setting face_matrix_tau = 0.1 or 0.3 and occl_threshold = 0.1 reduced IDs from the baseline 91 to 89 and increased MOTA from 89.4503% to 89.4743%. The same settings in Hybrid Scenes reduced IDs by 6 (102 → 96) and increased MOTA from 86.5413% to 86.5821%. Simple Scenes, with low occlusion frequency, were insensitive to occl_threshold; applying only face_matrix_tau = 0.1 reduced IDs from 11 to 7 and increased MOTA from 82.7753% to 82.8376%. However, applying occl_threshold in Simple Scenes resulted in some correct matches being filtered, causing a slight drop in IDF1.

Our Model, utilizing a collaborative filtering mechanism that dynamically adjusts face_matrix_tau and occl_threshold, achieved an IDF1 of 85.1825% and MOTA of 86.5956% in Hybrid Scenes. This improves upon the Baseline Model by +0.91pp and +0.06pp on IDF1 and MOTA respectively, while reducing IDs to 94. Despite its weighted sum (IDF1 + MOTA = 171.7781%) being marginally lower than the best manual combination (171.8211%), the latter’s lower MOTA (86.4124% compared to baseline 86.5413%) compromised overall balance, whereas Our Model achieved a synchronous improvement in both IDF1 and MOTA over the baseline, demonstrating superior balance capability. These results demonstrate that the dynamic adaptation mechanism effectively tailors filtering strategies to task priorities: combinations favoring low thresholds enhance tracking performance, while single low face_matrix_tau filtering benefits long-term identity consistency.

Fig. 14 presents the tracking results on a subset of the test dataset. Specifically, Fig. 14a,c,e displays the tracking results of the Baseline Model in one simple and two complex scenarios, while Fig. 14b,d,f shows the corresponding results of Our Model. For clarity in evaluating tracking performance, relevant targets are annotated with color-coded bounding boxes. As seen in Fig. 14(a2,a4,a5), the Baseline Model assigns inconsistent IDs (21, 23, 24) to the same pedestrian, indicating severe ID_SWITCH. In contrast, Our Model maintains a consistent pedestrian ID (24) in the corresponding frames Fig. 14(b2,b4,b5). This is because, in the joint tracking framework, the face ID remains stable (e.g., ID 23 in Fig. 14(b1,b3)). When pedestrian tracking suffers from ID_SWITCH, the face ID consistency is leveraged to correct it. In Fig. 14(b2,b4,b5), when face and pedestrian detections are successfully matched, the system displays the pedestrian ID (24) by default. If matching fails or either detection is missing, the initial tracking ID (uncorrected) is shown instead. Thus, face IDs in Fig. 14b appear as both 23 and 24, though the underlying face ID is consistently 23—this visualization is used solely to illustrate the correction mechanism. Similarly, in complex scenarios (Fig. 14c–f), Our Model demonstrates clear advantages. For instance, in Fig. 14c, the Baseline Model assigns pedestrian IDs 31, 33, and 34, whereas Our Model corrects these to 49 and 52 (Fig. 14d). In Fig. 14, the Baseline Model produces IDs 12 and 15, while Our Model maintains ID 13 (Fig. 14f). These observations confirm that face information serves as a critical identity anchor in tracking scenarios. By associating stable face IDs with pedestrian tracking results, Our Model robustly mitigates frequent ID_SWITCH in the baseline approach, improving continuous tracking accuracy and identity consistency under challenges such as occlusion and deformation.

images

Figure 14: The side-by-side comparison visualizations of tracking results from the baseline model and our model

The core value of the proposed face-pedestrian joint framework lies in its complementary and synergistic mechanisms. Body tracking features exhibit superior robustness to viewpoint variations (e.g., side/back views) and partial occlusions compared to face tracking. When facial information degrades or fails due to occlusion or pose changes, body cues maintain tracking continuity and identity consistency. Compared to pedestrian tracking alone, in crowded environments, human bodies are often partially or fully occluded, whereas faces typically remain the only consistently visible feature. Crucially, the cross-category dynamic matching strategy adaptively selects optimal matching clues (face, body, or fused features) based on real-time confidence metrics (e.g., face detection quality, occlusion severity). This enables the model to leverage high-discriminability facial features when reliable, while seamlessly switching to robust body features or weighted fusion in challenging scenarios (e.g., severe occlusion or cross-camera views). Consequently, the framework achieves consistently stable tracking performance across diverse conditions, particularly in occlusion-prone and viewpoint-varying environments—a capability unattainable by unimodal approaches.

4.4 Zero-Shot Performance Evaluation of Face-Pedestrian Tracking Models

To further validate the advantages of the Our joint tracking model over standalone pedestrian tracking or standalone face tracking models in handling pedestrian viewpoint variations (e.g., side-view/back-view) and occlusion scenarios, this paper evaluates three models—Baseline Model (pedestrian), Baseline Model (face), and Our Model—on the Challenging Face_Pedestrian_Track Dataset. It should be noted that in the Challenging dataset, targets exhibit significant viewpoint variations in camera footage, with frequent side-view and back-view scenarios, resulting in some targets’ faces being undetectable or untrackable. To ensure a fair comparison of tracking performance across models, for side-view and back-view scenarios, even if faces cannot be detected or tracked by the model, simulated face tracking annotations are used as ground-truth face tracking annotations during performance evaluation. This ensures that the total number of targets (faces or pedestrians) in the ground-truth annotations remains consistent for all three models during evaluation. As shown in Fig. 7b,d, the test results for each model are presented in Table 6.

images

(1) Severe Limitations of Standalone Face Models

In critical scenarios such as side-view and back-view, the standalone face-tracking model (Baseline Model (face)) reveals fundamental flaws. It heavily relies on visible frontal faces; once the target turns sideways or away from the camera, causing face information to be lost, the model fails. This directly leads to catastrophic false negatives (FN = 1096, far exceeding the other two models), reflected in an extremely low identification recall rate (IDR = 50.7761%). Although its precision (IDP) is acceptable, the abysmal recall rate drags down its overall identification performance (lowest IDF1) and accuracy (MOTA = 68.3204%), indicating that the model cannot stably track targets under pose variations and lacks robustness.

(2) Significant Advantages of the Joint Model

The proposed face-pedestrian joint tracking model (Our Model) effectively overcomes the limitations of single-modality approaches by integrating face and pedestrian appearance information. It demonstrates exceptional performance in handling side-view and back-view scenarios: false negatives (FN = 103) are significantly lower than those of the face model and approach the level of the standalone pedestrian model; the identification recall rate (IDR = 60.8093%) is notably higher than that of the face model and slightly better than the pedestrian model, proving its ability to maintain tracking when targets are not facing the camera. Its comprehensive identification performance (IDF1 = 61.4480%) and overall accuracy (MOTA = 95.2051%) are the highest among the three. The key lies in the model’s ability to intelligently switch or fuse information sources (using faces for frontal views and pedestrian data for non-frontal views), ensuring trajectory continuity.

(3) Conclusions and Value

Experiments on the Challenging Dataset, which is rich in side-view and back-view challenges, clearly demonstrate that standalone face-tracking models suffer from severe flaws when target poses vary, lacking the robustness required for practical applications. In contrast, the proposed joint tracking model, through complementary fusion of face and pedestrian information, significantly improves tracking stability and continuity under viewpoint variations and occlusions, achieving optimal comprehensive performance. Its design strategy holds substantial practical value.

4.5 Real-Time Performance and Computational Cost Analysis

To evaluate the practical deployment costs of the proposed method, we conducted comparative experiments between Our_model and two representative tracking models: FairMOT (a JDE-based approach) and ByteTrack (an SDE-based approach). The evaluation was performed on the Chokepoint_Face_Pedestrian_Track dataset, consisting of 12 video sequences (9 simple scenes and 3 complex scenes). For each method, we tested both large-resolution and small-resolution model variants. Each video sequence was processed three times, with the total runtime calculated as the average across these trials.

We recorded two key metrics:

(1) Total runtime across all video sequences, including all computational overhead (system initialization, model loading, parameter initialization, data loading, preprocessing, model inference, post-processing, and result output)

(2) Average FPS, computed across all frames (including frames without targets)

For ablation analysis, Our_model (Without Face-Pedestrian Matching) refers to a modified version where the cross-category dynamic binding module was removed, retaining only the trained McFairmot model for separate face and pedestrian tracking.

Experimental Hardware Configuration:

CPU: Intel(R) Core(TM) i7-14700KF

GPU: NVIDIA GeForce RTX 4060 Ti (16GB VRAM)

System Memory: 48GB RAM

The comparative results (total runtime, average FPS, and performance across different scene types) are summarized in Table 7.

images

This study evaluates the real-time deployment performance of Our_model against representative tracking models (FairMOT and ByteTrack) on the Chokepoint_Face_Pedestrian_Track dataset. As shown in Table 7, the key results are as follows: With the small-resolution model (576 × 320), Our_model achieves an average processing speed of 26.86 FPS (28.52 FPS in simple scenes, 20.75 FPS in complex scenes). This frame rate approaches the real-time tracking threshold (>=25 FPS), demonstrating its capability to meet basic real-time requirements in mid-range hardware environments (e.g., the tested RTX 4060 Ti GPU). However, Our_model still exhibits a performance gap compared to baseline models: FairMOT’s small-resolution model delivers the best performance (36.12 FPS), followed by ByteTrack (31.67 FPS). Resolution significantly impacts speed, with Our_model’s small-resolution model operating at approximately twice the speed of its large-resolution counterpart (1088 × 608, 13.54 FPS).

Further analysis reveals that our model’s unique cross-category dynamic binding module incurs quantifiable computational overhead. When this module is removed (denoted as Our_model (without matching)), the small-resolution model’s speed increases to 30.12 FPS (approximately 10.4% improvement), indicating that this feature currently contributes around 10% additional computational load. Compared to FairMOT’s small-resolution model at 36.12 FPS, Our_model (without matching) exhibits a 16.7% speed reduction. Notably, scene complexity significantly impacts Our_model’s performance (with frame rates dropping by approximately 27.2% from simple to complex scenes), a more pronounced decline than FairMOT’s 3.5% reduction, suggesting potential for improving algorithmic robustness in challenging environments. In summary, Our_model achieves its critical cross-category face-pedestrian tracking capability at a controlled speed cost (maintaining real-time performance with small-resolution inputs), demonstrating a reasonable trade-off between functional richness and computational expense. Excluding system initialization, model loading, and other overheads, its actual frame rate potential retains room for further optimization.

5 Conclusion

This study systematically addresses the challenge of multi-object tracking in complex occlusion scenarios by proposing an innovative solution based on joint face-pedestrian feature modeling. The developed cross-category dynamic binding framework and multi-modal joint matching metric system effectively overcome the performance limitations of conventional single-modality tracking methods in challenging scenarios involving occlusions and viewpoint variations. Experimental validation demonstrates that in face-pedestrian matching tasks, the proposed spatio-temporal-appearance joint constraint mechanism achieves up to 43 percentage points improvement in cross-category matching F1 AUC compared to certain conventional methods. Notably, ablation studies reveal that the IoA-IoU joint metric contributes 13.7% of the F1 AUC gain, while the dynamic threshold adjustment mechanism reduces IDs by 4.3% in complex scenes. In object tracking experiments, the dual matching strategy combined with dynamic threshold adjustment simultaneously improves both IDF1 and MOTA metrics compared to the baseline model, with IDF1 reaching 85.1825% in mixed scenarios, verifying the effectiveness of the cross-category identity correction mechanism. The computational efficiency analysis reveals that Our_model (without matching) exhibits a 16.7% FPS reduction compared to FairMOT’s single-category tracking, with the cross-category dynamic binding module introducing approximately 10% additional computational overhead. Nevertheless, Our_model maintains near-real-time performance with small-resolution inputs, achieving its crucial cross-category face-pedestrian tracking capability at controlled speed costs. These results provide practical technical support for intelligent surveillance and cross-camera tracking applications, demonstrating particular value for continuous identity verification in crowded areas. Future research will focus on optimizing the cross-category feature extraction network and exploring multi-dimensional feature fusionmechanisms to address more extreme occlusion scenarios.

Acknowledgement: Not applicable.

Funding Statement: This work was supported by the confidential research grant No. a8317.

Author Contributions: Qin Hu conceived the research, designed the study framework, collected the data, conducted the experiments, and wrote the manuscript. Hongshan Kong performed the data analysis and interpretation, as well as revised the manuscript. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The data that support the findings of this study are available from the corresponding author upon reasonable request.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Wojke N, Bewley A, Paulus D. Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP); 2017 Sep 17–20; Beijing, China: IEEE. p. 3645–9. [Google Scholar]

2. Zhang Y, Wang C, Wang X, Zeng W, Liu W. Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis. 2021;129(11):3069–87. doi:10.1007/s11263-021-01513-4. [Google Scholar] [CrossRef]

3. Hu Y, Niu A, Sun J, Zhu Y, Yan Q, Dong W, et al. Dynamic center point learning for multiple object tracking under Severe occlusions. Knowl Based Syst. 2024;300(1–2):112130. doi:10.1016/j.knosys.2024.112130. [Google Scholar] [CrossRef]

4. Hassan S, Mujtaba G, Rajput A, Fatima N. Multi-object tracking: a systematic literature review. Multimed Tools Appl. 2024;83(14):43439–92. doi:10.1007/s11042-023-17297-3. [Google Scholar] [CrossRef]

5. Barquero G, Hupont I, Fernández Tena C. Rank-based verification for long-term face tracking in crowded scenes. IEEE Trans Biomet Behav Identity Sci. 2021;3(4):495–505. doi:10.1109/tbiom.2021.3099568. [Google Scholar] [CrossRef]

6. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B. Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP); 2016 Sep 25–28; Phoenix, AZ, USA. New York: IEEE; 2016. p. 3464–8. [Google Scholar]

7. Zhang Y, Sun P, Jiang Y, Yu D, Weng F, Yuan Z, et al. Bytetrack: multi-object tracking by associating every detection box. In: Computer Vision—ECCV 2022. Cham: Springer Nature Switzerland; 2022. p. 1–21 doi: 10.1007/978-3-031-20047-2_1. [Google Scholar] [CrossRef]

8. Aharon N, Orfaig R, Bobrovsky BZ. Bot-sort: robust associations multi-pedestrian tracking. arXiv:2206.14651. 2022. [Google Scholar]

9. Peng J, Wang C, Wan F, Wu Y, Wang Y, Tai Y, et al. Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing; 2020. p. 145–61. doi: 10.1007/978-3-030-58548-8_9. [Google Scholar] [CrossRef]

10. Voigtlaender P, Krause M, Osep A, Luiten J, Sekar BBG, Geiger A, et al. MOTS: multi-object tracking and segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15–20; Long Beach, CA, USA. New York: IEEE; 2019. p. 7934–43. [Google Scholar]

11. Xu Z, Zhang W, Tan X, Yang W, Huang H, Wen S, et al. Segment as points for efficient online multi-object tracking and segmentation. In: Proceedings of the Computer Vision–ECCV 2020: 16th European Conference; 2020 Aug 23–28; Glasgow, UK. p. 264–81. [Google Scholar]

12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008. [Google Scholar]

13. Sun P, Cao J, Jiang Y, Zhang R, Xie E, Yuan Z, et al. Transtrack: multiple object tracking with transformer. arXiv:2012.15460. 2020. [Google Scholar]

14. Meinhardt T, Kirillov A, Leal-Taixé L, Feichtenhofer C. TrackFormer: multi-object tracking with transformers. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. New York: IEEE; 2022. p. 8834–44. [Google Scholar]

15. Cao J, Pang J, Weng X, Khirodkar R, Kitani K. Observation-centric SORT: rethinking SORT for robust multi-object tracking. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. New York: IEEE; 2023. p. 9686–96. [Google Scholar]

16. Shi C, Li SC, Li L. Research on face detection and tracking algorithm based on MTCNN and improved KCF. In: Chinese. In: Proceedings of the 7th China Command and Control Conference. Beijing, China; 2019. [Google Scholar]

17. Lin ZB, Huang ZQ, Yan LM. Research on face detection and tracking algorithm Based on RetinaFace and KCF. Chin Electron Qual. 2021;9:59–64. [Google Scholar]

18. Qi D, Tan W, Yao Q, Liu J. YOLO5Face: why reinventing a face detector. In: Computer Vision–ECCV 2022 Workshops. Cham, Switzerland: Springer; 2023. p. 228–44. [Google Scholar]

19. Jöchl R, Uhl A. FaceSORT: a multi-face tracking method based on biometric and appearance features. arXiv:2501.11741. 2025. [Google Scholar]

20. Stewart R, Andriluka M, Ng AY. End-to-end people detection in crowded scenes. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27–30; Las Vegas, NV, USA. New York: IEEE; 2016. p. 2325–33. [Google Scholar]

21. Sundararaman R, De Almeida Braga C, Marchand E, Pettre J. Tracking pedestrian heads in dense crowd. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. p. 3865–75. [Google Scholar]

22. Sun K, Wang X, Liu S, Zhao Q, Huang G, Liu C. Toward pedestrian head tracking: a benchmark dataset and an information fusion network. arXiv:2408.05877. 2024. [Google Scholar]

23. Sharma N, Dhiman C, Indu S. Pedestrian intention prediction for autonomous vehicles: a comprehensive survey. Neurocomputing. 2022;508(5):120–52. doi:10.1016/j.neucom.2022.07.085. [Google Scholar] [CrossRef]

24. Huang X, Cai XD, Hu YL, Cao Y, Liu YZ. Target dynamic identification method based on feature fusion. Chin Video Eng. 2020;44(6):6–10,38. [Google Scholar]

25. Jin Li Y. Research on cross-camera target recognition and tracking method based on face and pedestrian features [Ph.D. dissertation]. Beijing, China: People’s Public Security University of China; 2022. [Google Scholar]

26. Wu Y, Lim J, Yang MH. Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition; 2013 Jun 23–28; Portland, OR, USA. New York: IEEE; 2013. p. 2411–8. [Google Scholar]

27. Wu Y, Lim J, Yang M-H. Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1834–48. doi:10.1109/TPAMI.2014.2388226. [Google Scholar] [PubMed] [CrossRef]

28. Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kamarainen JK, et al. The seventh visual object tracking VOT2019 challenge results. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). Seoul, Republic of Korea; 2019. p. 2206–41. [Google Scholar]

29. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Kämäräinen JK, et al. The eighth visual object tracking VOT2020 challenge results. In: Proceedings of the Computer Vision–ECCV 2020 Workshops; 2020 Aug 23–28; Glasgow, UK. Cham, Switzerland: Springer; 2020. p. 547–601. [Google Scholar]

30. Huang L, Zhao X, Huang K. Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell. 2019;43(5):1562–77. doi:10.1109/tpami.2019.2957464. [Google Scholar] [PubMed] [CrossRef]

31. Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, et al. LaSOT: a high-quality benchmark for large-scale single object tracking. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15–20; Long Beach, CA, USA. New York: IEEE; 2019. p. 5369–78. [Google Scholar]

32. Leal-Taixé L, Milan A, Reid I, Roth S, Schindler K. Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942. 2015. [Google Scholar]

33. Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K. MOT16: a benchmark for multi-object tracking. arXiv:1603.00831. 2016. [Google Scholar]

34. Dendorfer P, Rezatofighi H, Milan A, Shi J, Cremers D, Reid I, et al. Mot20: a benchmark for multi object tracking in crowded scenes. arXiv:2003.09003. 2020. [Google Scholar]

35. Weber M, Xie J, Collins M, Zhu Y, Voigtlaender P, Adam H, et al. Step: segmenting and tracking every pixel. arXiv:2102.11859. 2021. [Google Scholar]

36. Voigtlaender P, Luo L, Yuan C, Jiang Y, Leibe B. Reducing the annotation effort for video object segmentation datasets. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV); 2021 Jan 3–8; Waikoloa, HI, USA. New York: IEEE; 2021. p. 3059–68. [Google Scholar]

37. Lin Y, Cheng S, Shen J, Pantic M. MobiFace: a novel dataset for mobile face tracking in the wild. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019); 2019 May 14–18; Lille, France: IEEE; 2019. p. 1–8. [Google Scholar]

38. Wong Y, Chen S, Mau S, Sanderson C, Lovell BC. Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition. In: CVPR 2011 WORKSHOPS; 2011 Jun 20–25; Colorado Springs, CO, USA. New York, NY, USA: IEEE; 2011. p. 74–81. [Google Scholar]

Cite This Article

APA Style

Hu, Q., Kong, H. (2026). Face-Pedestrian Joint Feature Modeling with Cross-Category Dynamic Matching for Occlusion-Robust Multi-Object Tracking. Computers, Materials & Continua, 86(1), 1–31. https://doi.org/10.32604/cmc.2025.069078

Vancouver Style

Hu Q, Kong H. Face-Pedestrian Joint Feature Modeling with Cross-Category Dynamic Matching for Occlusion-Robust Multi-Object Tracking. Comput Mater Contin. 2026;86(1):1–31. https://doi.org/10.32604/cmc.2025.069078

IEEE Style

Q. Hu and H. Kong, “Face-Pedestrian Joint Feature Modeling with Cross-Category Dynamic Matching for Occlusion-Robust Multi-Object Tracking,” Comput. Mater. Contin., vol. 86, no. 1, pp. 1–31, 2026. https://doi.org/10.32604/cmc.2025.069078

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Face-Pedestrian Joint Feature Modeling with Cross-Category Dynamic Matching for Occlusion-Robust Multi-Object Tracking

Abstract

Keywords

References

Cite This Article

543

159

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link