Segment-Conditioned Latent-Intent Framework for Cooperative Multi-UAV Search

Gang Hou; Aifeng Liu; Tao Zhao; Wenyuan Wei; Bo Li; Jiancheng Liu; Siwen Wei

doi:10.32604/cmc.2026.073202

icon Open Access

ARTICLE

Segment-Conditioned Latent-Intent Framework for Cooperative Multi-UAV Search

Gang Hou^1,#, Aifeng Liu^1,#, Tao Zhao¹, Wenyuan Wei², Bo Li¹, Jiancheng Liu^3,*, Siwen Wei^4,5,*

1 Northwest Institute of Mechanical and Electrical Engineering, Xianyang, 712099, China
2 Department of Railway Transportation Operations Management, Baotou Railway Vocational & Technical College, Baotou, 014060, China
3 School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
4 Shaanxi Key Laboratory of Antenna and Control Technology, Xi’an, 710076, China
5 39th Research Institute of China Electronics Technology Group Corporation, Xi’an, 710076, China

* Corresponding Authors: Jiancheng Liu. Email: email ; Siwen Wei. Email: email
# These authors contributed equally to this work

(This article belongs to the Special Issue: Cooperation and Autonomy in Multi-Agent Systems: Models, Algorithms, and Applications)

Computers, Materials & Continua 2026, 87(1), 96 https://doi.org/10.32604/cmc.2026.073202

Received 12 September 2025; Accepted 24 December 2025; Issue published 10 February 2026

Abstract

Cooperative multi-UAV search requires jointly optimizing wide-area coverage, rapid target discovery, and endurance under sensing and motion constraints. Resolving this coupling enables scalable coordination with high data efficiency and mission reliability. We formulate this problem as a discounted Markov decision process on an occupancy grid with a cellwise Bayesian belief update, yielding a Markov state that couples agent poses with a probabilistic target field. On this belief–MDP we introduce a segment-conditioned latent-intent framework, in which a discrete intent head selects a latent skill every K steps and an intra-segment GRU policy generates per-step control conditioned on the fixed intent; both components are trained end-to-end with proximal updates under a centralized critic. On the 50×50 grid, coverage and discovery convergence times are reduced by up to 48% and 40% relative to a flat actor-critic benchmark, and the aggregated convergence metric improves by about 12% compared with a state-of-the-art hierarchical method. Qualitative analyses further reveal stable spatial sectorization, low path overlap, and fuel-aware patrolling, indicating that segment-conditioned latent intents provide an effective and scalable mechanism for coordinated multi-UAV search.

Keywords

Multi-agent reinforcement learning; Markov decision process; multi-UAV cooperative search

1 Introduction

Cooperative search with multiple unmanned aerial vehicles (UAVs) enables rapid, wide-area situational awareness for surveillance, humanitarian response, and defense operations, where parallel sensing and timely localization of mission-relevant targets are critical [1,2]. These demands are further amplified by the proliferation of IoT devices and the emergence of next-generation (6G-ready) network architectures, in which UAV-assisted infrastructures are increasingly deployed as agile aerial nodes for real-time sensing, surveillance, and data collection in dense IoT environments [3–5]. In such settings, decision policies must simultaneously promote broad spatial coverage, high-probability target discovery, and fuel-aware persistence under partial observability and dynamic interaction among agents, which renders joint planning inherently high dimensional and nonstationary.

Conventional approaches to multi-UAV search span graph-based planning, swarm heuristics, game-theoretic coordination, and evolutionary/metaheuristic optimizers [6–8]. While these methods provide valuable baselines, they typically assume static or fully observable environments, rely on strong prior modeling or hand-crafted heuristics, and offer limited support for online adaptation, multi-agent credit assignment, and principled trade-offs between coverage, discovery, and endurance. Deep reinforcement learning has recently advanced UAV navigation by learning directly from interaction and coping with uncertainty [9–11]. In multi-agent settings, centralized training with decentralized execution (CTDE) alleviates nonstationarity [12–14], yet flat (single-level) DRL still struggles with long-horizon exploration, joint action-space blowup, and ambiguous credit assignment as team size grows [15].

Hierarchical reinforcement learning (HRL) provides temporal abstraction and subgoal structure through manager–worker decompositions and value-function factorizations [16,17]. Canonical architectures such as FeUdal Networks and the Option-Critic framework instantiate these principles via goal-conditioned managers and learnable options [18], but typically require delicate intrinsic-reward design or learned termination functions that are prone to option collapse and training instability. Recent multi-agent extensions further exploit hierarchical organization to facilitate cooperative decision-making [19,20], yet often rely on hand-crafted subtask taxonomies, predefined communication patterns, or centralized coordinators that do not transfer seamlessly to fully cooperative, partially observed settings. Surveys underscore the potential of HRL for scalable aerial coordination [21,22]; nevertheless, these designs can incur additional variance and computational overhead when applied to belief-based multi-UAV search with tightly coupled objectives. Maximum-entropy regularization has been shown to enhance exploration and coordination [23], yet systematically aligning exploratory behavior with mission-level coverage objectives and energy constraints remains a central open challenge.

This work introduces a segment-conditioned latent-intent framework for cooperative multi-UAV search (SCLI–CMUS) that unifies temporal abstraction, coordinated exploration, and endurance awareness within a single CTDE policy. The environment is modeled as a discounted Markov decision process on a discretized workspace endowed with a Bayesian cellwise update of the occupancy field. Within one end-to-end differentiable policy, a discrete intent head selects a latent skill every K steps to guide medium-horizon behavior, while an action head driven by an intra-segment GRU issues per-step yaw increments conditioned on the fixed intent and local features. Compared with FeUdal-style and Option-Critic architectures, this fixed-horizon, discrete-intent design retains sufficient temporal expressiveness while avoiding termination-related instabilities and keeping per-step computation close to that of a recurrent actor-critic with a single additional categorical head. To reconcile heterogeneous signal scales and stabilize training, we employ a three-parameter, scale-calibrated saturated reward that jointly accounts for information gain, coverage efficiency, and energy–time cost. The principal contributions of this work are summarized as follows:

1. We propose Segment–Conditioned Latent–Intent for Cooperative Multi–UAV Search (SCLI–CMUS), a CTDE framework that couples a discrete intent selector—updated at fixed segment boundaries—with an intra-segment recurrent controller in a single end-to-end differentiable policy.

2. We develop a three-coefficient, scale-calibrated saturated reward that jointly balances information gain, coverage efficiency, and energy–time costs.

3. Comprehensive experiments demonstrating faster learning, improved overage/discovery convergence times, and robust qualitative behaviors across representative UAV team sizes.

The remainder of this paper is structured as follows. Section 2 reviews related work on cooperative multi-UAV search, planning-based coordination, and (hierarchical) multi-agent reinforcement learning. Section 3 presents the belief-based MDP formulation and the proposed segment-conditioned latent-intent framework, while Section 4 details the experimental setup, benchmarks, and ablation studies. Finally, Section 5 summarizes the findings and outlines directions for future research.

2 Related Work

Conventional planning approaches for UAV search largely build on graph-search and shortest-path heuristics, which perform well in static and fully observable environments [6,24]. However, such methods do not naturally accommodate multi-agent credit assignment, online replanning under sensing uncertainty, or principled division of labor among multiple vehicles. Swarm-style schemes based on artificial potential fields, flocking rules, or pheromone deposition provide lightweight, decentralized coordination with low communication burden [7,25], and learned heuristics can amortize local perception-to-action mappings [26]. These approaches, though attractive for their simplicity, are susceptible to local minima, lack mechanisms to globally optimize coupled coverage–discovery–endurance trade-offs, and often require extensive manual retuning when environment statistics change.

Game-theoretic and metaheuristic frameworks offer alternative tools for cooperative search and routing. Potential games and market-based task allocation furnish equilibrium concepts and scalable assignment rules [27,28], while differential games capture adversarial or pursuit–evasion interactions in continuous time [28]. Evolutionary and metaheuristic optimizers traverse nonconvex search spaces and can handle multiple objective criteria in path planning and routing [29,30], with recent advances in motion-encoded multi-parent crossovers improving solution diversity [8]. Nonetheless, their reliance on strong prior modeling, offline optimization, and substantial computational budgets limits their ability to adapt online in uncertain, time-critical environments.

Deep reinforcement learning (DRL) has achieved notable success in UAV navigation by learning policies directly from interaction, thereby coping with uncertainty and enabling online adaptation [9]. Single-vehicle studies have demonstrated obstacle-aware, threat-aware maneuvering and memory-augmented exploration in complex environments [10,11,31,32]. In multi-agent settings, centralized training with decentralized execution (CTDE) has become a standard paradigm: decentralized actors are trained against a centralized critic to mitigate nonstationarity and stabilize learning [12], and recent work emphasizes resilience and meta-adaptation under distribution shift [13,14]. However, flat (single-level) DRL typically struggles to align long-horizon exploration with local motion control, suffers from joint action-space blowup as team size grows, and faces ambiguous multi-agent credit assignment [15]. Maximum-entropy and entropy-regularized formulations can improve exploration and coordination [23], but designing reward structures that explicitly couple information gain, coverage efficiency, and energy–time costs remains challenging in belief-based multi-UAV search.

Hierarchical reinforcement learning (HRL) introduces temporal abstraction and subgoal structure through manager–worker decompositions and value-function factorizations [16,17]. Canonical architectures such as FeUdal Networks and the Option-Critic framework realize these principles via goal-conditioned managers and learnable options [18,33], but typically require carefully designed intrinsic rewards or learned termination functions that are prone to option collapse, premature termination, and training instability. Against this backdrop, the segment-conditioned latent-intent framework developed in this work is intended to preserve the advantages of temporal abstraction while mitigating practical difficulties associated with option termination and hand-crafted subtask hierarchies.

3 Method

Fig. 1 provides an overview of the proposed framework. The environment yields a probabilistic occupancy map b(t), local observations otu, and agent coordinates ptu. These signals are fused by an encoder (GRU) into global information features that condition two heads within a single policy: a discrete skill head πϕ that selects a segment intent ztku every K steps, and an action head πθ that issues per-step yaw-increment commands conditioned on the fixed intent and an autoregressive hidden state. A centralized critic Vψ evaluates behaviour on the belief state and supplies advantages for step-level and segment-level updates; experience tuples are stored in a replay buffer.

images

Figure 1: Overview of the segment-conditioned latent-intent framework for cooperative multi-UAV search (SCLI-CMUS). Left: environment and trajectories of multiple UAVs (colours identify agents; stars indicate targets; orange dots denote end points). Middle: inputs to the encoder comprising the probability map b(t), local observation otu, and UAV coordinates ptu; the encoder (GRU) produces global information features. Right: policy heads and learning signals. The skill head πϕ selects a discrete intent every K steps; the action head πθ outputs per-step yaw increments conditioned on the intent and an autoregressive state

The main symbols used in the formulation and policy parameterization are summarized in Table 1.

images

3.1 MDP Formulation for Cooperative Multi-UAV Search

We model cooperative multi-UAV search as a discounted Markov decision process on a discretized workspace. Let the two-dimensional workspace be discretized as

𝒟={(x,y)∣x=1,…,DX, y=1,…,DY},(1)

where each cell (x,y) either contains a target or is empty, and |𝒟|=DXDY denotes the total number of grid cells. Time is discrete t=0,1,2,…. The cellwise occupancy posterior is

bx,y(t)=P(cell (x,y) contains a target|𝒵1:t),(2)

with 𝒵1:t the σ-algebra generated by all measurements up to t. The initial prior bx,y(0) is specified from domain knowledge (uniform). Denote by b(t)={bx,y(t)}(x,y)∈𝒟 the full occupancy field.

We now specify the agent set, action space, kinematics, joint control, and state representation. Let 𝒰={1,…,N} index the UAVs. UAV u has planar position ptu∈R2 and heading ψtu∈(−π,π]. Each UAV selects a discrete yaw–increment action

atu∈𝒜={−α,0,+α} (degrees),(3)

where α=45, and the action evolves under constant–speed kinematics with sampling step Δt>0 and speed v>0.

Given a chosen yaw–increment action atu and constant-speed motion, the heading and planar position of UAV u evolve according to

{ψt+1u=ψtu+π180atu,pt+1u=ptu+vΔt[cos⁡ψt+1usin⁡ψt+1u],(4)

where v is the constant forward speed and Δt is the sampling interval.

Collecting all per-agent yaw–increment actions yields the joint action vector

at=(at1,…,atN)∈𝒜N.(5)

The MDP state collects agent poses and the occupancy field, namely

St=({ptu,ψtu}u∈𝒰, b(t)).(6)

Each UAV carries an omnidirectional sensing disc of radius Rsen>0. A grid cell (x,y) is classified as fully observed by UAV u at time t if all four vertices ∂𝒟(x,y) lie within the disc centered at ptu:

Covu(x,y;t)={1,max(x′,y′)∈∂𝒟(x,y)‖ptu−(x′,y′)‖≤Rsen,0,otherwise.(7)

The instantaneous covered set is defined by

𝒱t={(x,y)∈𝒟: ∑u=1NCovu(x,y;t)≥1},(8)

where the coverage ratio at time t is given by CovRate(t)=|𝒱t||𝒟|. The cumulative exploration status of a cell is recorded by

Visited(x,y;t)=I{∃τ≤t: ∑u=1NCovu(x,y;τ)≥1}.(9)

Let Ztu denote the measurement field collected by UAV u at time t. For any covered cell (x,y) with Covu(x,y;t)=1, the binary detector on UAV u obeys

P(Zx,yu(t)=1|Tx,y)={PDu,if Tx,y=1,PFAu,if Tx,y=0,(10)

where Zx,yu(t)∈{0,1} is the binary measurement at cell (x,y) and time t, Tx,y∈{0,1} denotes the hidden occupancy of cell (x,y), and PDu, PFAu are the detection and false–alarm probabilities of UAV u, respectively. The set of UAVs that cover (x,y) at time t is

𝒰x,y(t)={u∈𝒰: Covu(x,y;t)=1}.(11)

Under conditional independence given Tx,y, the joint likelihoods for the measurements on (x,y) at time t are

{ℒ1=∏u∈𝒰x,y(t)(PDu)Zx,yu(t)(1−PDu)1−Zx,yu(t),ℒ0=∏u∈𝒰x,y(t)(PFAu)Zx,yu(t)(1−PFAu)1−Zx,yu(t).(12)

The cellwise Bayesian update is then

bx,y(t+1)={bx,y(t)ℒ1bx,y(t)ℒ1+(1−bx,y(t))ℒ0,𝒰x,y(t)≠∅,bx,y(t),𝒰x,y(t)=∅.(13)

The transition kernel induced by the deterministic kinematics and the cellwise Bayesian update is

P(St+1∣St,at)=∏u=1Nδ(pt+1u−ptu−vΔt[cos⁡ψt+1u,sin⁡ψt+1u]⊤)∏(x,y)∈𝒟δ(bx,y(t+1)−Bayes(bx,y(t),Zx,y(t))).(14)

The team reward adopts a scale-calibrated saturated form, first define analytic normalizers

Umax=NπRsen2,Emax=c0NΔt+cψπ180Nα,(15)

with fixed coefficients c0=1.0×10−3 and cψ=1.0×10−2, and construct dimensionless components

ℐ^t=H(b(t))−H(b(t+1))Umaxln⁡2,𝒞^t=𝒞t,ℰ^t=ℰtEmax.(16)

The saturated team reward is then

Rt=λℐtanh(ℐ^t)+λ𝒞tanh(𝒞^t)−λℰtanh(ℰ^t),(17)

where λℐ, λ𝒞, and λℰ are positive weighting coefficients that control the relative importance of information gain, coverage efficiency, and energy–time cost, respectively.

The reward components are defined as follows. The entropy of the belief field is

H(b(t))=∑(x,y)∈𝒟[−bx,y(t)ln⁡bx,y(t)−(1−bx,y(t))ln⁡(1−bx,y(t))],(18)

the coverage efficiency term is

𝒞t=1|𝒟|∑(x,y)∈𝒟bx,y(t)ΔVx,y(t)⋅∑(x,y)∈𝒟I{∑u=1NCovu(x,y;t)≥1}bx,y(t)max{1, ∑(x,y)∈𝒟∑u=1NCovu(x,y;t)bx,y(t)},(19)

where the energy–time cost term is defined as

ℰt=c0NΔt + cψπ180∑u=1N|atu|,(20)

with ΔVx,y(t)=Visited(x,y;t)−Visited(x,y;t−1) denoting the incremental visitation indicator at cell (x,y), and I{⋅} the indicator function. The coefficients c0>0 and cψ>0 control the time-related and turning-related energy costs, respectively.

The planning objective is the discounted return

J=E[∑t=0∞γtRt],γ∈(0,1),(21)

which fully specifies the discounted MDP on the state St and underpins the two-timescale latent–skill policy in Section 3.2.

3.2 Segment-Conditioned Latent-Intent Policy Parameterization

On the discounted MDP of Section 3.1, each agent is equipped with a slow latent skill that governs behaviour over a fixed segment of K steps, while fast per-step actions are generated conditionally on the current skill and an autoregressive hidden state.

Let segment boundaries be tk=kK with K∈N. At each segment start tk, agent u samples a categorical latent skill from a skill head conditioned on a summary of global and local information,

ztku∼πϕu(z∣Gtku),ztu≡ztku,(22)

where z∈{1,…,M} denotes the discrete intent index, and the skill remains fixed within the current segment t∈[tk,tk+K−1]. The summary Gtu aggregates a spatial feature of the belief field with local encodings and team statistics,

Gtu=Agg(gtmap,gtloc,u,{ptv}v∈𝒰),(23)

where gtmap=CNN(b(t)) encodes the global belief map, and gtloc,u=enc(otu) encodes the local observation of UAV u. Agg(⋅) is a permutation-invariant aggregation function over team states, and Emb(z)∈Rdz denotes a learnable embedding associated with intent z.

For each agent u, a hidden state evolves within the segment via a GRU driven by local features and the current skill embedding,

htu=GRUω(ht−1u, [gtloc,u ⊕ Emb(ztu)]),(24)

with parameters ω and concatenation ⊕. The resulting state, coupled with the belief feature, parameterizes the action head:

ℓtu=MLPθ([htu ⊕ gtmap])∈R3,πθu(atu=k∣htu,Emb(ztu),gtmap)=exp⁡(ℓt,ku)∑k′∈{−α,0,+α}exp⁡(ℓt,k′u),(25)

with parameters θ.

Collecting parameters Θ=(ϕ,θ,ω), the joint policy over a segment factorizes into skill selections at tk and per-step action selections conditioned on the fixed skill and the evolving hidden state,

ΠΘ(ztk,atk:tk+K−1∣Stk:tk+K−1)=∏u=1Nπϕu(ztku∣Gtku) ∏t=tktk+K−1 ∏u=1Nπθu(atu∣htu,Emb(ztku),gtmap),(26)

where htu evolves according to (24).

Learning objectives are matched to these timescales. The segment return starting at tk is

Gtk(K)=∑τ=0K−1γτRtk+τ,(27)

where γ∈(0,1) is the discount factor, and the stepwise (infinite-horizon) return is given by Gt=∑τ=0∞γτRt+τ. A centralized value function on the belief state augmented by latent and hidden summaries is introduced as

Ξt=(St, {ztku}u∈𝒰, {htu}u∈𝒰),Vψ(Ξt)≈E[Gt∣Ξt],(28)

which supports low-variance advantage estimates at both levels. With temporal-difference residuals and generalized advantage estimation,

δt=Rt+γVψ(Ξt+1)−Vψ(Ξt),A^t=∑ℓ=0∞(γλ)ℓδt+ℓ,λ∈[0,1],(29)

the segment-level advantage at tk is

A^tkskill=Gtk(K)−b¯φ(Υtk),Υtk=(Stk,{Gtku}u∈𝒰),(30)

where b¯φ is a K-step baseline with parameters φ.

Optimization is carried out by proximal policy updates at both timescales. For the action head, define the probability ratio

rtu(θ)=πθu(atu∣htu,Emb(ztku),gtmap)πθoldu(atu∣htu,Emb(ztku),gtmap),(31)

and minimize the clipped surrogate aggregated over agents,

ℒPPO-act(θ,ψ)=−∑u=1NE[min(rtuA^t, clip(rtu,1±ϵ)A^t)]+cvE[(Vψ(Ξt)−V^t)2]−cent∑u=1NE[ℋ(πθu)],(32)

with clip parameter ϵ>0, weights cv,cent>0, and bootstrap target V^t. For the skill head, the segment–start ratio

rtk(z),u(ϕ)=πϕu(ztku∣Gtku)πϕoldu(ztku∣Gtku)(33)

leads to the segment–level surrogate

ℒPPO-skill(ϕ)=−∑u=1NE[min(rtk(z),uA^tkskill, clip(rtk(z),u,1±ϵ)A^tkskill)]−cent-z∑u=1NE[ℋ(πϕu)],(34)

with cent-z>0. Combining both timescales yields the overall objective

minθ,ϕ,ψ ℒtotal=ℒPPO-act(θ,ψ)+ℒPPO-skill(ϕ).(35)

4 Experiments

4.1 Parameter Setting

All experiments strictly follow the discounted MDP and sensing specification described above, and adopt the conventional benchmark setting used in prior cooperative search studies [28,34], with stationary targets and ideal, latency-free inter-UAV communication. The basic setup consists of a 50×50 grid with three UAVs and ten stationary targets. The sensor footprint is set to Rsen=0.8 (in grid-cell units), which determines per-step observable area and thereby the rate of uncertainty reduction. Optimization proceeds with PPO using γ=0.99, λ=0.95, and clipping coefficient ϵ=0.20; actor and critic learning rates are both 3×10−4. Reward shaping follows the three-parameter saturated design in Eq. (17), with (λℐ,λ𝒞,λℰ)=(1.0,1.0,0.1) to balance information gain and coverage against energy–time penalization on a common scale.

All baseline networks are configured with comparable capacity and are trained using similar batch sizes. For the PPO-based SCLI–CMUS, we use on-policy trajectory batches without long-term replay, whereas the MADDPG-based methods rely on a replay buffer of fixed capacity with minibatch sampling.

4.2 Performance Benchmark

1. DQN method [35]: Each UAV runs an independent Deep Q-Network on its local observation to estimate action values for the discrete yaw increments. The absence of explicit coordination limits information sharing and typically degrades scalability in larger or cluttered maps.

2. ACO method [36]: Ant Colony Optimization governs motion through a pheromone field over the grid, where each UAV behaves as an “ant” that deposits and follows trails. The induced three-dimensional pheromone tensor encodes heading preferences per cell and per agent.

3. MADDPG [28]: A canonical CTDE actor–critic baseline on the same Markov decision formulation. It employs decentralized actors with a centralized critic and experience replay. Using identical observation and reward interfaces enables a direct assessment of the gains attributable to hierarchical intent mechanisms and difference-reward shaping.

4. Maximum-Entropy RL (ME-RL) [23]: An entropy-regularized extension of MADDPG that incorporates spatial entropy and fuzzy logic to encourage exploration and coordination under communication and energy considerations.

5. DTH–MADDPG [34]: A hierarchical reinforcement-learning framework with a slow strategic controller and a set of fast decentralized executors. The strategic layer updates intermittently to assign high-level intents (region/waypoint directives) to the team, while the executor layer implements per-UAV control via MADDPG under CTDE with replay.

4.3 Evaluation Metrics

We report task performance through spatial coverage and target discovery, and we quantify search efficiency via convergence times to fixed performance levels. Let 𝒟 denote the grid, 𝒱t⊆𝒟 the set of cells visited at least once by time t, N⋆ the total number of targets, and Ndet(t) the number detected by time t. Define the instantaneous fractions

κ(t)=|𝒱t||𝒟|,δ(t)=Ndet(t)N⋆.(36)

To capture the speed at which operational effectiveness is achieved, introduce coverage and discovery convergence times as first hitting times of prescribed thresholds ρcov,ρdet∈(0,1]:

τcov=inf{t∈N0: κ(t)≥ρcov},τdet=inf{t∈N0: δ(t)≥ρdet}.(37)

In all experiments we set ρcov=ρdet=0.85. Smaller values of τcov and τdet indicate faster attainment of wide-area exploration and target acquisition, respectively, and correlate with reduced flight time and energy expenditure.

4.4 Performance Evaluation

This section quantifies learning efficiency and asymptotic performance of the proposed method relative to a strong baseline. Fig. 2 reports episode–wise learning curves, where the horizontal axis denotes the training episode index and the vertical axis denotes the total episode reward computed under the reward design in Section 3.1. Curves correspond to the mean over repeated runs, and shaded bands depict variability across runs.

images

Figure 2: Episode reward vs. training episode for MADDPG (blue) and SCLI–CMUS (red). The horizontal axis denotes episode index; the vertical axis denotes total reward per episode; shaded regions indicate variability across runs. (a) N=3 UAVs, (b) N=5 UAVs, (c) N=7 UAVs

A consistent pattern emerges across all subplots. The proposed SCLI–CMUS (red) rises sharply at early episodes and reaches a high plateau with markedly reduced dispersion, whereas MADDPG (blue) exhibits slower ascent, a lower steady level, and wider fluctuations. This behaviour is most pronounced in Fig. 2a with three agents, where SCLI–CMUS achieves a visibly higher steady reward and converges in substantially fewer episodes. The gap persists in Fig. 2b, indicating that the advantage is robust when scaling to five agents. The reduced variance of SCLI–CMUS is consistent with the segment–conditioned intent mechanism and the saturated, scale–balanced reward, which together suppress redundant exploration and stabilize gradient updates.

The scaling trend with agent count is also informative. Moving from three to seven agents, both methods display a gradual reduction in asymptotic reward, which is consistent with fixed–horizon evaluation: faster attainment of high coverage leaves a longer terminal phase dominated by energy–time penalization. Despite this shift in absolute level, SCLI–CMUS maintains a persistent margin and tighter confidence bands in Fig. 2c, indicating improved coordination under higher platform density.

In Table 2 (Search Area = 50 × 50), the proposed SCLI–CMUS achieves the best coverage and discovery convergence across all team sizes. For N=3, SCLI–CMUS reduces τcov to 1102 and τdet to 1232, yielding improvements of approximately 36% and 29% relative to MADDPG (1718/1723) and 32% and 23% relative to ME–RL (1611/1598). Against the strongest hierarchical baseline (DTH–MADDPG), SCLI–CMUS still provides 21% faster coverage and 9% faster discovery (1397/1348 vs. 1102/1232). As the team scales, the margins persist. For N=5, τcov and τdet fall to 860/1137, improving over MADDPG by 43%/20% and over DTH–MADDPG by 13%/10%. At N=7, SCLI–CMUS attains 710/997, exceeding MADDPG by 48%/40% and DTH–MADDPG by 11%/6%. The aggregate “Total” column confirms the trend: 6038 for SCLI–CMUS vs. 6853 for DTH–MADDPG (≈12% gain) and 9400 for MADDPG (≈36% gain). These gains are attributable to segment-conditioned intent selection and scale-calibrated reward saturation, which jointly suppress redundant footprint overlap, prioritize high-entropy regions, and stabilize critic estimates under CTDE.

images

The scaling behavior with N is also consistent and informative. All methods exhibit decreasing τcov and τdet as the team grows, reflecting the intrinsic parallelism of multi-UAV coverage. However, SCLI–CMUS shows the steepest decline, indicating that additional agents are efficiently utilized rather than inducing interference. In particular, the improvement from N=3 to N=7 is 36% for coverage (1102→710) and 19% for discovery (1232→997), whereas MADDPG improves by 21% and 3% over the same range. The hierarchical baseline DTH–MADDPG narrows the gap relative to flat actor–critic learners, yet it remains consistently behind SCLI–CMUS, suggesting that segment-consistent skill conditioning and the three-parameter saturated reward yield more effective division of labor and faster attainment of operational performance.

To further probe this behaviour, we extended the comparison between SCLI–CMUS and the strongest hierarchical baseline (DTH–MADDPG) from the 50×50 workspace to larger search areas of 60×60 and 70×70. The 50×50 case already shows that DTH–MADDPG is the closest competitor in terms of coverage and discovery convergence, so these larger maps provide a more stringent test of scalability. As the search area grows to 60×60, the advantage of SCLI–CMUS becomes most pronounced: the total convergence-score gap between the two methods increases to 6797 vs. 7698, i.e., an absolute difference of 901, larger than the corresponding gap on the 50×50 grid (6038 vs. 6853). This indicates that on moderately larger workspaces, segment-conditioned intents and belief-based reward shaping yield more efficient spatial partitioning and reduce redundant coverage more effectively than the dual-timescale controller in DTH–MADDPG.

4.5 Sensitivity Analysis

We next examine the sensitivity of SCLI–CMUS to the three reward weights λℐ, λ𝒞, and λℰ in (17). Since information gain is the primary driver of target discovery in belief-based search, we fix λℐ=1 throughout and vary λ𝒞 and λℰ around the nominal setting (λ𝒞,λℰ)=(1.0,0.1) used in Section 4.1.

Representative results for N=5 are summarized in Table 3. The coverage weight λ𝒞 controls how strongly the policy prioritizes expanding the visited set: a low value (λ𝒞=0.5) yields markedly larger τcov and τdet (up to 1150/1450), whereas a high value (λ𝒞=1.5) achieves the fastest coverage (851 steps at λℰ=0.05) but consistently slower target discovery (τdet≈1230–1280). The energy–time coefficient λℰ regulates motion aggressiveness: with λ𝒞=1.0, the setting (λ𝒞,λℰ)=(1.0,0.10) attains the best overall trade-off, with τcov=860 and the globally minimal τdet=1137, while both smaller and larger λℰ slightly degrade either coverage or discovery.

images

4.6 Case Study

This case study provides a qualitative examination of cooperative behaviour under the proposed policy in a 50×50 workspace with three UAVs and ten fixed targets. Fig. 3 depicts colour–coded trajectories at three representative time stamps. At t=300 (Fig. 3a), the team has already established a clear spatial allocation: trajectories exhibit strong inter-agent separation and limited crossovers. Large uncovered areas are partitioned implicitly, and each UAV conducts frontier-seeking sweeps within its assigned sector. The resulting footprints cover disjoint corridors with small overlap, which accelerates global coverage while preventing early concentration around the same cells.

images

Figure 3: Trajectories of three UAVs in a 50×50 workspace with ten fixed targets at representative time stamps. Dashed paths are colour coded by agent (red, blue, green); black stars mark target locations. Panels: (a) t=300, early exploration with clear sector separation; (b) t=1000, intensified sampling around informative regions with limited boundary crossings; (c) t=1500, steady patrolling within sectors with low redundant coverage

At t=1000 (Fig. 3b), the belief map has concentrated around multiple target locations, and paths become denser in those neighbourhoods. Agents maintain sector integrity while adapting their local loops to repeatedly interrogate high-probability cells. Boundary incursions are rare and occur only where adjacent sectors meet, indicating stable intent selection and limited handover cost. The joint pattern reflects a balanced exploration–exploitation regime: residual unexplored pockets are swept, and detected vicinities receive increased sampling frequency.

At t=1500 (Fig. 3c), the team enters a persistent monitoring phase. Each UAV continues to patrol its sector with short, recurrent loops centred on previously informative regions. The path overlap remains low and the blank areas show no redundant revisits, which is consistent with the energy–time penalization in the reward and the segment-consistent action generation.

5 Conclusion

This work has presented a segment-conditioned latent-intent framework for cooperative multi-UAV search that formulates the problem as a discounted MDP on an occupancy grid with a cellwise Bayesian belief update and parameterizes decision making by a single end-to-end policy combining a discrete intent head, updated every K steps, with an intra-segment GRU action head trained under a centralized critic, together with a three-coefficient, scale-calibrated saturated reward balancing information gain, coverage efficiency, and energy–time cost. Across grids of size 50×50, 60×60, and 70×70, the proposed method consistently outperforms strong flat and hierarchical reinforcement-learning baselines: on the 50×50 workspace, coverage and discovery convergence times are reduced by up to 48% and 40% relative to a flat actor–critic method, and the aggregated convergence metric improves by about 12% compared with a state-of-the-art hierarchical baseline, with the largest total improvement observed on the 60×60 grid. Future work will extend the framework to adaptive intent durations and heterogeneous platforms, incorporate bandwidth–limited communication and collision–avoidance constraints, model moving targets and three–dimensional kinematics, and pursue field deployment with sim–to–real transfer and formal performance guarantees.

Acknowledgement: None.

Funding Statement: The authors received no specific funding for this study.

Author Contributions: Gang Hou, Aifeng Liu, Tao Zhao: Investigation, Data Curation, Writing—Original Draft. Siwen Wei, Wenyuan Wei, Bo Li: Review and Editing, Visualization. Jiancheng Liu, Siwen Wei: Writing—Review and Editing, Supervision. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: Not applicable.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Yanmaz E. Joint or decoupled optimization: multi-UAV path planning for search and rescue. Ad Hoc Netw. 2023;138(3):103018. doi:10.1016/j.adhoc.2022.103018. [Google Scholar] [CrossRef]

2. Liu X, Su Y, Wu Y, Guo Y. Multi-conflict-based optimal algorithm for multi-UAV cooperative path planning. Drones. 2023;7(3):217. doi:10.3390/drones7030217. [Google Scholar] [CrossRef]

3. Chen Z, Zhang T, Hong T. IoT-enhanced multi-base station networks for real-time UAV surveillance and tracking. Drones. 2025;9(8):558. doi:10.3390/drones9080558. [Google Scholar] [CrossRef]

4. Andreou A, Mavromoustakis CX, Markakis E, Bourdena A, Mastorakis G. UAV-asisted IoT network framework with hybrid deep reinforcement and federated learning. Sci Rep. 2025;15(1):37107. doi:10.1038/s41598-025-21014-5. [Google Scholar] [PubMed] [CrossRef]

5. Nguyen DC, Ding M, Pathirana PN, Seneviratne A, Li J, Niyato D, et al. 6G Internet of Things: a comprehensive survey. IEEE Internet Things J. 2021;9(1):359–83. doi:10.1109/jiot.2021.3103320. [Google Scholar] [CrossRef]

6. Qu L, Fan J. Unmanned combat aerial vehicle path planning in complex environment using multi-strategy sparrow search algorithm with double-layer coding. J King Saud Univ—Comput Inf Sci. 2024;36(10):102255. doi:10.1016/j.jksuci.2024.102255. [Google Scholar] [CrossRef]

7. Elmokadem T, Savkin AV. Computationally-efficient distributed algorithms of navigation of teams of autonomous UAVs for 3D coverage and flocking. Drones. 2021;5(4):124. doi:10.3390/drones5040124. [Google Scholar] [CrossRef]

8. Alanezi MA, Bouchekara HR, Apalara TAA, Shahriar MS, Sha’aban YA, Javaid MS, et al. Dynamic target search using multi-UAVs based on motion-encoded genetic algorithm with multiple parents. IEEE Access. 2022;10:77922–39. doi:10.1109/access.2022.3190395. [Google Scholar] [CrossRef]

9. Wang F, Zhu XP, Zhou Z, Tang Y. Deep-reinforcement-learning-based UAV autonomous navigation and collision avoidance in unknown environments. Chin J Aeronaut. 2024;37(3):237–57. doi:10.1016/j.cja.2023.09.033. [Google Scholar] [CrossRef]

10. Chen J, Liu L, Zhang Y, Chen W, Zhang L, Lin Z. Research on UAV path planning based on particle swarm optimization and soft actor-critic. In: Proceedings of the 2024 China Automation Congress (CAC); 2024 Nov 1–3; Qingdao, China. p. 6166–71. [Google Scholar]

11. Xue D, Lin Y, Wei S, Zhang Z, Qi W, Liu J, et al. Leveraging hierarchical temporal importance sampling and adaptive noise modulation to enhance resilience in multi-agent task execution systems. Neurocomputing. 2025;637(1–2):130134. doi:10.1016/j.neucom.2025.130134. [Google Scholar] [CrossRef]

12. Lee J, Friderikos V. Interference-aware path planning optimization for multiple UAVs in beyond 5G networks. J Commun Netw. 2022;24(2):125–38. doi:10.23919/jcn.2022.000006. [Google Scholar] [CrossRef]

13. Wang K, Gou Y, Xue D, Liu J, Qi W, Hou G, et al. Resilience augmentation in unmanned weapon systems via multi-layer attention graph convolutional neural networks. Comput Mater Contin. 2024;80(2):2941–62. doi:10.32604/cmc.2024.052893. [Google Scholar] [CrossRef]

14. Wang K, Xue D, Gou Y, Qi W, Li B, Liu J, et al. Meta-path-guided causal inference for hierarchical feature alignment and policy optimization in enhancing resilience of UWSoS. J Supercomput. 2025;81(2):358. doi:10.1007/s11227-024-06848-6. [Google Scholar] [CrossRef]

15. Wang N, Li Z, Liang X, Li Y, Zhao F. Cooperative target search of UAV swarm with communication distance constraint. Math Probl Eng. 2021;2021(1):3794329. doi:10.1155/2021/3794329. [Google Scholar] [CrossRef]

16. Gou Y, Wei S, Xu K, Liu J, Li K, Li B, et al. Hierarchical reinforcement learning with kill chain-informed multi-objective optimization to enhance resilience in autonomous unmanned swarm. Neural Netw. 2025;195(2):108255. doi:10.1016/j.neunet.2025.108255. [Google Scholar] [PubMed] [CrossRef]

17. Huh D, Mohapatra P. Multi-agent reinforcement learning: a comprehensive survey. arXiv:2312.10256. 2023. [Google Scholar]

18. Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, et al. Feudal networks for hierarchical reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning; 2017 Aug 6–11; Sydney, NSW, Australia, pp. 3540–9. [Google Scholar]

19. Ahilan S, Dayan P. Feudal multi-agent hierarchies for cooperative reinforcement learning. arXiv:1901.08492. 2019. [Google Scholar]

20. Tang H, Hao J, Lv T, Chen Y, Zhang Z, Jia H, et al. Hierarchical deep multiagent reinforcement learning with temporal abstraction. arXiv:1809.09332. 2018. [Google Scholar]

21. Lyu M, Zhao Y, Huang C, Huang H. Unmanned aerial vehicles for search and rescue: a survey. Remote Sens. 2023;15(13):3266. doi:10.3390/rs15133266. [Google Scholar] [CrossRef]

22. Rahman M, Sarkar NI, Lutui R. A survey on multi-UAV path planning: classification, algorithms, open research problems, and future directions. Drones. 2025;9(4):263. doi:10.3390/drones9040263. [Google Scholar] [CrossRef]

23. Zhao L, Gao Z, Hawbani A, Zhao W, Mao C, Lin N. Fuzzy-MADDPG based multi-UAV cooperative search in network-limited environments. In: Proceedings of the 2024 International Conference on Information and Communication Technologies for Disaster Management (ICT-DM); 2024 Nov 19–21; Setif, Algeria, p. 1–7. [Google Scholar]

24. Kelner JM, Burzynski W, Stecz W. Modeling UAV swarm flight trajectories using rapidly-exploring random tree algorithm. J King Saud Univ—Comput Inf Sci. 2024;36(1):101909. doi:10.1016/j.jksuci.2023.101909. [Google Scholar] [CrossRef]

25. Zhang X, Ali M. A bean optimization-based cooperation method for target searching by swarm UAVs in unknown environments. IEEE Access. 2020;8:43850–62. doi:10.1109/access.2020.2977499. [Google Scholar] [CrossRef]

26. Chaves AN, Cugnasca PS, Jose J. Adaptive search control applied to search and rescue operations using unmanned aerial vehicles (UAVs). IEEE Latin Am Trans. 2014;12(7):1278–83. doi:10.1109/tla.2014.6948863. [Google Scholar] [CrossRef]

27. Qamar RA, Sarfraz M, Rahman A, Ghauri SA. Multi-criterion multi-UAV task allocation under dynamic conditions. J King Saud Univ—Comput Inf Sci. 2023;35(9):101734. doi:10.1016/j.jksuci.2023.101734. [Google Scholar] [CrossRef]

28. Hou Y, Zhao J, Zhang R, Cheng X, Yang L. UAV swarm cooperative target search: a multi-agent reinforcement learning approach. IEEE Trans Intell Veh. 2023;9(1):568–78. doi:10.1109/tiv.2023.3316196. [Google Scholar] [CrossRef]

29. Hou K, Yang Y, Yang X, Lai J. Distributed cooperative search algorithm with task assignment and receding horizon predictive control for multiple unmanned aerial vehicles. IEEE Access. 2021;9:6122–36. doi:10.1109/access.2020.3048974. [Google Scholar] [CrossRef]

30. Phung MD, Ha QP. Safety-enhanced UAV path planning with spherical vector-based particle swarm optimization. Appl Soft Comput. 2021;107(2):107376. doi:10.1016/j.asoc.2021.107376. [Google Scholar] [CrossRef]

31. Tang J, Liang Y, Li K. Dynamic scene path planning of uavs based on deep reinforcement learning. Drones. 2024;8(2):60. doi:10.3390/drones8020060. [Google Scholar] [CrossRef]

32. Sabzekar S, Samadzad M, Mehditabrizi A, Tak AN. A deep reinforcement learning approach for UAV path planning incorporating vehicle dynamics with acceleration control. Unmanned Syst. 2024;12(03):477–98. doi:10.1142/s2301385024420044. [Google Scholar] [CrossRef]

33. Bacon PL, Harb J, Precup D. The option-critic architecture. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence; 2017 Feb 4–9; San Francisco, CA, USA. p. 1726–34. [Google Scholar]

34. Liu J, Wei S, Li B, Wang T, Qi W, Han X, et al. Dual-timescale hierarchical MADDPG for Multi-UAV cooperative search. J King Saud Univ Comput Inf Sci. 2025;37(6):1–17. doi:10.1007/s44443-025-00156-6. [Google Scholar] [CrossRef]

35. Harikumar K, Senthilnath J, Sundaram S. Multi-UAV oxyrrhis marina-inspired search and dynamic formation control for forest firefighting. IEEE Trans Autom Sci Eng. 2018;16(2):863–73. doi:10.1109/tase.2018.2867614. [Google Scholar] [CrossRef]

36. Perez-Carabaza S, Besada-Portas E, Lopez-Orozco JA, de la Cruz JM. Ant colony optimization for multi-UAV minimum time search in uncertain domains. Appl Soft Comput. 2018;62(4):789–806. doi:10.1016/j.asoc.2017.09.009. [Google Scholar] [CrossRef]

Cite This Article

APA Style

Hou, G., Liu, A., Zhao, T., Wei, W., Li, B. et al. (2026). Segment-Conditioned Latent-Intent Framework for Cooperative Multi-UAV Search. Computers, Materials & Continua, 87(1), 96. https://doi.org/10.32604/cmc.2026.073202

Vancouver Style

Hou G, Liu A, Zhao T, Wei W, Li B, Liu J, et al. Segment-Conditioned Latent-Intent Framework for Cooperative Multi-UAV Search. Comput Mater Contin. 2026;87(1):96. https://doi.org/10.32604/cmc.2026.073202

IEEE Style

G. Hou et al., “Segment-Conditioned Latent-Intent Framework for Cooperative Multi-UAV Search,” Comput. Mater. Contin., vol. 87, no. 1, pp. 96, 2026. https://doi.org/10.32604/cmc.2026.073202

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Segment-Conditioned Latent-Intent Framework for Cooperative Multi-UAV Search

Abstract

Keywords

References

Cite This Article

319

131

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link