Motion In-Betweening via Frequency-Domain Diffusion Model

Qiang Zhang; Shuo Feng; Shanxiong Chen; Teng Wan; Ying Qi

doi:10.32604/cmc.2025.068247

icon Open Access

ARTICLE

Motion In-Betweening via Frequency-Domain Diffusion Model

Qiang Zhang¹, Shuo Feng¹, Shanxiong Chen², Teng Wan¹, Ying Qi^1,*

1 Department of Computer Science and Engineering, Northwest Normal University, Lanzhou, 730070, China
2 Department of Computer and Information Science, Southwest University, Chongqing, 400715, China

* Corresponding Author: Ying Qi. Email: email

Computers, Materials & Continua 2026, 86(1), 1-22. https://doi.org/10.32604/cmc.2025.068247

Received 23 May 2025; Accepted 17 July 2025; Issue published 10 November 2025

Abstract

Human motion modeling is a core technology in computer animation, game development, and human-computer interaction. In particular, generating natural and coherent in-between motion using only the initial and terminal frames remains a fundamental yet unresolved challenge. Existing methods typically rely on dense keyframe inputs or complex prior structures, making it difficult to balance motion quality and plausibility under conditions such as sparse constraints, long-term dependencies, and diverse motion styles. To address this, we propose a motion generation framework based on a frequency-domain diffusion model, which aims to better model complex motion distributions and enhance generation stability under sparse conditions. Our method maps motion sequences to the frequency domain via the Discrete Cosine Transform (DCT), enabling more effective modeling of low-frequency motion structures while suppressing high-frequency noise. A denoising network based on self-attention is introduced to capture long-range temporal dependencies and improve global structural awareness. Additionally, a multi-objective loss function is employed to jointly optimize motion smoothness, pose diversity, and anatomical consistency, enhancing the realism and physical plausibility of the generated sequences. Comparative experiments on the Human3.6M and LaFAN1 datasets demonstrate that our method outperforms state-of-the-art approaches across multiple performance metrics, showing stronger capabilities in generating intermediate motion frames. This research offers a new perspective and methodology for human motion generation and holds promise for applications in character animation, game development, and virtual interaction.

Keywords

Motion generation; diffusion model; frequency domain; human motion synthesis; self-attention network; 3D motion interpolation

1 Introduction

With the rapid development of virtual reality, computer animation, and the digital game industry, the importance of human motion modeling has become increasingly prominent [1,2]. Accurately and efficiently generating realistic and natural human motion sequences has emerged as a key technology for enhancing the immersive experience of digital content.

Human motion modeling can be categorized into several directions based on different research objectives, including motion synthesis, motion prediction, and motion interpolation. Among these, motion interpolation constitutes a fundamental yet challenging task [3]: it aims to generate continuous and natural in-between motion sequences given only sparse conditions, such as the initial and terminal frames. This task is widely employed in practical applications like character animation and motion completion. For instance, in animation production, animators typically design only the key poses [4], with the remaining frames generated via interpolation algorithms and subsequent manual refinement to complete the desired motion transitions. In game development, interpolation techniques are often used to connect predefined motion clips, enabling diverse character behaviors and seamless transitions, thereby enhancing overall immersion and expressiveness.

To address the aforementioned challenges, researchers have proposed various interpolation strategies, from rule-driven methods to deep generative models, yet the quality bottleneck in sparse keyframe scenarios remains unresolved. Early methods like linear interpolation [5,6] and trajectory planning are simple and low-cost, but their underlying linear assumptions prevent them from capturing the non-linear coordination between joints, resulting in transitional motions that lack dynamic variation and expressiveness. With the advancement of deep learning, data-driven generative models have become the mainstream approach. Although Generative Adversarial Network (GAN) [7,8] can produce sharper details, their training process is unstable and prone to mode collapse, and they lack fine-grained control over the overall motion structure. Variational Autoencoder (VAE) [9,10] aim to improve in-betweening by modeling the latent distribution of motion, but their generated results often lack sufficient dynamic variation and rhythmic hierarchy, leading to indistinct dynamic features. Transformer-based temporal model [11,12] leverage global attention to capture long-range dependencies and excel at full sequence prediction. However, when anchored only by start and end frames, the attention mechanism lacks reliable constraints, often resulting in style convergence, significant local deformations, or discontinuities in motion. In summary, these methods struggle to simultaneously guarantee naturalness, stability, and structural coherence in interpolation tasks characterized by sparse keyframes, long sequences, and high diversity demands.

Synthesizing existing research reveals that although various generative models have made progress in specific areas, they commonly face a bottleneck when handling sparse keyframe interpolation tasks. The fundamental reason lies in the coexistence of information sparsity inherent to the task and the high complexity of motion data itself. The task typically relies on only sparse keyframe inputs, requiring the generation of complete, natural motion sequences from extremely limited information. Human motion data, often represented as the 3D coordinates of multiple joints per frame, is characterized by high dimensionality, complex multi-joint coordination, and intricate temporal dependencies. This requires a generative model not only to reconstruct spatial structures but also to capture dynamic changes across frames. For longer sequences or more complex actions, problems such as noise propagation and motion discontinuities become more pronounced, placing even greater demands on the model’s capabilities. Therefore, the core challenge of motion interpolation lies in simultaneously achieving naturalness, coherence, and structural plausibility under these sparse conditions.

To better address the challenges of human motion interpolation under sparse conditions, this paper constructs a suitable framework by combining frequency-domain modeling with a diffusion-based generative mechanism. As an emerging generative paradigm, diffusion models learn the data distribution through a progressive denoising process, offering inherent training stability and high-quality generation capabilities [13,14], which are well-suited for interpolation tasks that demand strong structural coherence. Our approach, grounded in diffusion modeling, progressively recovers an intermediate motion sequence from random noise while adhering to keyframe constraints, demonstrating the ability to model high-dimensional and complex motion distributions. Given that human motion data is predominantly composed of low-frequency components, with high-frequency parts often corresponding to local perturbations and noise, we introduce the Discrete Cosine Transform to map motion sequences into the frequency domain. This approach not only compresses redundant dimensions but also reinforces the modeling of global motion trends, thereby enhancing the coherence and stability of the generated results. Furthermore, a denoising network built with a self-attention mechanism is employed to improve the model’s perception of sequential structure and dynamics. Concurrently, a multi-objective loss function is designed to optimize for smoothness, diversity, and structural consistency. The entire generation process, constrained by keyframes, progressively samples from noise to create a complete sequence, enabling the generation of natural, stable, and structurally plausible human motion even under sparse conditions, as visually demonstrated in Fig. 1.

images

Figure 1: Demonstration of diverse motion in-betweening. Given a single pair of start and end frames (left and right), our model synthesizes multiple, physically plausible and distinct motion sequences (center)

The main contributions of this paper are as follows:

• We propose a human motion interpolation framework based on a frequency-domain diffusion model that can generate natural, coherent, and physically plausible intermediate motion sequences given only the start and end frames.

• The integration of the Discrete Cosine Transform for frequency-domain modeling with a self-attention-based denoising network, which enhances the model’s ability to capture long-term dependencies and improves the stability of generated sequences.

• A multi-objective loss function that jointly optimizes the generation process for motion smoothness, pose diversity, and anatomical consistency.

Experiments on the Human3.6M and LaFAN1 datasets demonstrate that our proposed method outperforms current state-of-the-art approaches on several key metrics, exhibiting excellent generation quality and structural control. This work provides a new modeling paradigm for motion generation under sparse conditions and is broadly applicable to practical scenarios such as virtual human animation, game character synthesis, and human-computer interaction systems.

2 Related Work

2.1 Models for Motion Generation

In recent years, data-driven deep generative models have made significant strides in the field of human motion synthesis. Different model paradigms have demonstrated distinct characteristics and limitations in generation tasks.

GANs [15] are renowned for producing sharp, detail-rich samples through a zero-sum game between a generator and a discriminator [7,8]. Their main advantage lies in avoiding the blurring effect caused by traditional loss functions like L2 loss. However, the training process of GANs is notoriously challenging, often facing issues such as gradient vanishing and mode collapse, which leads to insufficient diversity in the generated results [16]. Furthermore, their one-shot generation nature makes it difficult to directly capture and maintain complex long-term temporal dependencies.

VAEs [17] learn a regularized latent space through an encoder-decoder architecture, from which new motions can be generated by sampling [18,19]. The training of VAEs is relatively stable, and they can learn smooth latent representations. However, their core Evidence Lower Bound (ELBO) loss, in its effort to ensure a regularized latent distribution, often comes at the cost of reconstruction fidelity, leading to overly-smoothed outputs that lack high-frequency details and dynamic expressiveness [10].

Flow-Based Models [20] utilize a series of invertible transformations to precisely fit the true data distribution, thereby enabling the computation of exact likelihoods. While this gives them a theoretical advantage in probability density estimation, their architectural design (e.g., the invertibility constraint) is typically more complex than that of GANs and VAEs, and they face significant computational overhead when dealing with high-dimensional data.

Transformer-Based Models [21,22] have come to dominate sequence modeling tasks thanks to their powerful self-attention mechanism. They can effectively capture long-range dependencies in sequences, making them excel in motion generation tasks that require understanding global context. However, their performance is highly dependent on the quality and density of the input sequence; when faced with sparse conditions, how to effectively constrain their powerful attention mechanism becomes a new challenge.

2.2 Motion In-Betweening

Motion in-betweening is a fundamental yet highly challenging branch of motion generation [1]. The task is essentially to find a plausible path on a high-dimensional, non-linear pose manifold that connects two sparse keypoints (the start and end frames). The existence of an infinite number of possible transitions between two points makes this a classic ill-posed problem.

Early research primarily employed geometric or physics-based methods, such as linear interpolation [23] or dynamic planning [5,6]. Although simple to implement, these methods typically operate in Euclidean space, ignoring the rotational properties of poses and skeletal constraints, which often leads to visual artifacts like foot-skating and unnatural limb lengths.

With the rise of deep learning, researchers have attempted to apply mainstream generative models to this task. However, the inherent weaknesses of these models are amplified when faced with the sparse constraints of motion in-betweening. For GANs, the weak conditioning signal (only start and end frames) is insufficient to effectively guide the generation process, leading to intermediate frames that are nearly unconstrained and often resulting in chaotic or incoherent motions. For VAEs, their inherent tendency to regress towards the mean is particularly detrimental in interpolation tasks, as it erases all potential vivid details and generates a simplistic, averaged transition that lacks expressiveness.

To address these challenges, researchers have designed various specialized models. For example, Qin et al. [12] leveraged the powerful context modeling capabilities of Transformers, but their attention mechanism can still lose focus without intermediate anchors, leading to motion drift. The CVAE-based scheme by Ren et al. [9] can enhance diversity but remains limited by the inherent smoothing problem of VAEs. Hong et al. [24] tackled the long-horizon problem by predicting intermediate keyframes—an effective strategy, but one that introduces an additional prediction stage and may restrict the freedom of non-key frames.

In summary, existing methods for motion in-betweening have not yet found an ideal balance, struggling to simultaneously satisfy the three core requirements of physical realism, motion diversity, and structural coherence over long-horizon transitions.

2.3 Diffusion Models for Motion Generation

Denoising Diffusion Probabilistic Models (DDPMs) [25] offer a novel path to overcome the aforementioned difficulties. These models generate data by learning the reverse of a diffusion process, progressively denoising a sample from pure Gaussian noise back to a target sample. This mechanism grants them several key advantages: firstly, their training process is highly stable, avoiding the adversarial training challenges of GANs; secondly, they can fit complex data distributions with high fidelity, achieving superior generation quality and diversity compared to traditional VAEs.

The effectiveness of diffusion models in the domain of human motion generation has been widely validated. From general-purpose high-quality motion synthesis [13], to text-driven motion generation [14], and physics-constrained realistic motion simulation [26], diffusion models have consistently demonstrated State-of-the-Art (SOTA) performance. Therefore, with their powerful distribution-fitting capabilities and robustness to sparse conditions, diffusion models offer a highly promising technical path to solve the long-standing challenges of generation quality and diversity in motion in-betweening tasks.

2.4 Frequency-Domain Diffusion Models

Beyond architectural advancements, another emerging research direction seeks to enhance the performance of diffusion models by incorporating a frequency-domain perspective. Recently, a multitude of works have explored this direction within the fields of image and video generation. For example, FreeU [27] introduces a frequency-domain reweighting mechanism during inference to modulate the U-Net’s feature maps for improved image fidelity. DNI [28] proposes a spectral decomposition of the initial noise via adaptive filters to enable more flexible video editing. FlexiEdit [29] utilizes frequency-aware optimization to suppress high-frequency components in specific regions for better image layout editing. Furthermore, to enhance the temporal consistency of generated videos, FRAG [30] and FreeInit [31] have explored approaches from the perspectives of frequency-domain grouping and optimizing the low-frequency components of noise, respectively. Collectively, these works demonstrate the immense potential of frequency-domain operations in improving the quality and temporal coherence of generated content.

However, these cutting-edge explorations are largely confined to editing or optimizing existing visual content. A research gap remains in applying frequency-domain modeling as a core mechanism for end-to-end sequence generation from pure noise, especially for the task of motion in-betweening under extremely sparse conditions. The work presented in this paper aims to fill this gap, systematically investigating how frequency-domain representations can empower diffusion models to tackle the core challenges of motion interpolation.

3 Method

Given the initial frame x0 and the final frame xT, our objective is to synthesize the full motion trajectory X=[x0,x1,…,xT], over T steps. Here, each xt∈RJ×3 encodes the three-dimensional joint coordinates at time t, with J denoting the number of joints and 3 the spatial dimensions. Unlike most existing motion synthesis methods that rely on linear models, we aim to tackle the challenge of intermediate frame generation by considering the overall distribution of action sequences. Thus, we propose a framework based on diffusion models. To accommodate variable-length inputs, we replace the original DDPM [25] U-Net with a denoising network composed of a self-attention–based linear layer. Meanwhile, during the training stage, we have designed various loss functions from a human kinematics perspective to improve the authenticity of the synthesized motions. Subsequent sections delve into the specifics of each component.

3.1 Framework Overview

The proposed diffusion process consists of two main phases, as illustrated in Fig. 2. A key aspect of our framework is performing the diffusion process in the frequency domain, which allows for a more robust handling of long-term temporal structures in motion data. During the training phase, we apply noise addition and removal processes in the frequency domain after performing DCT on the complete action sequences. At each denoising time step, we calculate multiple losses on the spatiotemporal domain sequence obtained through iDCT. In the inference phase, we implement conditional generation using implicit guidance: we manually add the noise intensity corresponding to the current time step to the start and end frames, while the intermediate frames are obtained from the previous denoising network step. For each time step, the denoising network’s input is obtained by concatenating these two components. We have named our model FreqDiffusionBridge.

images

Figure 2: Overview of the proposed diffusion model. Left panel shows the training stage, right panel illustrates the sampling stage, and the central section details the denoising network architecture

DCT/iDCT Transformation: As motion data is inherently a time-series signal with strong structural properties, analyzing it from a frequency-domain perspective offers significant advantages. Human motion is naturally composed of dominant, low-frequency components that define the overall trajectory and primary actions, alongside high-frequency components that represent subtle gestures or noise. By transforming the motion sequence into the frequency domain, we can effectively decouple these elements. This allows the model to focus on learning the crucial low-frequency structures, which is essential for maintaining long-term coherence and smoothness, while simultaneously being more robust to high-frequency noise. This principle is why we adopt the DCT as a foundational step in our pipeline.

Given the high dimensionality of motion sequences, dimensionality reduction is essential for efficient processing. The adoption of DCT transformation not only addresses this challenge but also brings multiple advantages to motion sequence processing. DCT effectively decomposes temporal sequences into frequency components, allowing the model to focus on essential motion patterns while naturally filtering out high-frequency noise. This frequency-domain representation is particularly beneficial for human motion as it captures both the dominant low-frequency movement patterns and subtle motion details at different temporal scales. Moreover, by processing signals in the frequency domain, the model can better handle the periodic nature of human movements and long-term dependencies in action sequences. Specifically, for each coordinate component of every joint, we perform a one-dimensional DCT along the time dimension. Given an action sequence X∈RT×J×3, we reshape it to X∈RT×D, where D=J×3. Then, for each dimension d∈[1,2,…,D], we apply a one-dimensional DCT to the time series x:,d of length T:

cd,k=∑t=0T−1xt,dcos⁡[πT(t+12)k],k=0,1,…,T−1,(1)

the original action sequence is converted into a frequency domain representation c∈RD×T, where each row corresponds to the frequency domain representation of a joint coordinate component. Here, cd,k represents the (d,k)-th element of matrix c. We define this transformation as cd,k=DCT(x:,d).

During generation, we invert the DCT coefficients from the frequency domain to reconstruct the motion sequence in the time domain. For this purpose, we perform the inverse DCT (iDCT) on the DCT coefficients. For each dimension d, the iDCT is defined as:

x^t,d=1T(12cd,0+∑k=1T−1cd,kcos⁡[πTk(t+12)]), t=0,1,…,T−1,(2)

through the iDCT, we convert the frequency domain representation c back to the time-domain action sequence x^. Here, x^t,d represents the (t,d)-th element of matrix x^, and cd,k is the frequency coefficient at the k-th frequency component of the d-th feature dimension. We define this transformation as x:,d=iDCT(cd,:), where cd,: denotes the d-th row vector of matrix c.

Self-Attention Mechanism: Conventional diffusion approaches commonly adopt a U-Net architecture for the denoising network. However, this convolution-based architecture requires padding or cropping when applied to variable-length sequence data like human motion. To effectively capture long-term temporal dependencies in action sequences, we adopt linear layers composed of self-attention mechanisms as the denoising network. Compared to traditional deep network structures, linear layers have lower computational complexity, and the self-attention mechanism can capture correlations across the entire sequence globally.

The denoising network takes the noisy DCT coefficients cn and the time step embedding E(n) as input, and estimates the noise ε^θ using a self-attention mechanism. This mechanism allows the model to weigh the influence of every frame on every other frame, effectively capturing global dependencies across the entire sequence.

To make the model aware of the current noise level, the integer timestep n is transformed into a fixed-dimensional vector E(n) using sinusoidal positional embeddings, adapted from the original Transformer architecture. This allows the network to condition its behavior on the current step in the diffusion process. The embedding is computed as follows:

E(n)2i=sin⁡(n/100002i/dmodel)E(n)2i+1=cos⁡(n/100002i/dmodel),(3)

where i is the dimension index and dmodel is the dimensionality of the embedding vector.

The self-attention process begins by projecting the input sequence cn into three distinct representations: the Query (Q), Key (K), and Value (V). This is achieved by multiplying the input with three learnable weight matrices:

Q=cnWQ, K=cnWK,V=cnWV.(4)

In this mechanism, these three matrices serve distinct roles. The Q matrix contains a representation for each frame that is used to score its similarity against all other frames. The K matrix provides the corresponding representations that are matched against the queries. The dot product between a query and a key vector determines the attention weight. Finally, the V matrix holds the content-carrying representations of each frame, which are aggregated based on the computed attention weights to form the output.

Next, we compute the attention scores, which determine how much focus each frame should place on every other frame. This is done by calculating the dot product of the Query and Key matrices. The result is scaled and passed through a softmax function to obtain the final attention weights. These weights are then used to create a weighted sum of the Value vectors:

Attention(Q,K,V)=softmax(QK⊤dk)V,(5)

where dk is the dimensionality of the key vectors, used as a scaling factor to ensure numerical stability. The output of this operation, which we denote as h, is a new representation of the sequence where each frame’s feature vector is a rich, context-aware combination of all other frames.

Finally, this contextually-enriched representation h is passed through a final linear layer to produce the estimated noise for the current timestep:

ε^θ=hWout,(6)

where Wout is the weight matrix of the output layer.

Diffusion Process: In diffusion frameworks, input data is progressively corrupted with noise during the forward pass, and the model is trained to iteratively remove this noise in the reverse pass, restoring the original data distribution. In training, entire motion sequences undergo noise perturbation and subsequent denoising to approximate the target data distribution.

The forward diffusion process is defined as:

q(ct|ct−1)=𝒩(ct;αtct−1,(1−αt)I),(7)

where c0=C, αt∈(0,1) is a predefined noise schedule parameter.

3.2 Diffusion Model Training

During the training phase, our objective is to guide the model to capture the data distribution so that it can generate realistic action sequences from noise, as illustrated in Fig. 3. Adhering to the DDPM protocol, we sample Gaussian noise ε∈RT×D from the standard normal distribution 𝒩(0,I), where T denotes the sequence length and D=J×3 represents each frame’s feature dimensionality (with J joints and 3 spatial coordinates). Subsequently, we perform a DCT on the Gaussian noise ε along the time dimension. For each feature dimension d∈[1,2,…,D], we execute cN,d,k=DCT(ε:,d). Subsequently, the frequency domain representation of the noise cT is fed into the denoising network. For each denoising time step n=N,N−1,…,1, the denoising network ε^θ predicts the noise:

ε^θ=εθ(cn,n),(8)

where εθ denotes the denoising network composed of linear layers with self-attention mechanisms, and n is the current time step. Subsequently, based on the predicted noise ε^θ, the frequency domain representation cn−1 is updated:

cn−1=1αn(cn−(1−αn)ε^θ),(9)

where αn is a predefined noise schedule parameter. After completing the denoising for all time steps, we obtain c0. Subsequently, we apply the inverse Discrete Cosine Transform to reconstruct the time-domain motion from its spectral coefficients: x^t,d=iDCT(c0,d,:). Finally, we use the backpropagation algorithm to update the model parameters θ based on the loss.

images

Figure 3: In the training phase, we perform DCT operation on the complete action sequence to achieve spectrum conversion, and then perform denoising steps at each time step to capture the overall distribution of motion

3.3 Diffusion Model Inference

In most conditional diffusion models, the denoising process is explicitly guided by external control signals, such as text, pose, or image features. These signals are injected via architectural mechanisms like cross-attention or FiLM layers. The conditional denoising sampling process typically follows:

x0∼pθ(x0|x1,c),(10)

where c is the conditioning signal. In this setup, the network is explicitly designed to learn how the condition c influences the generation trajectory throughout the reverse diffusion process.

In contrast, our method tackles a more challenging scenario: we utilize an unconditional diffusion model, yet aim to achieve conditionally guided action sequence generation. That is, we do not embed the given start and end frames (xstart, xend) into the network via additional conditioning modules. Instead, we rely on a novel implicit guidance mechanism, which achieves conditional control through data manipulation and architectural design rather than modifying the diffusion backbone.

This implicit conditioning approach is inherently more difficult than conventional methods, as the model must infer the temporal evolution from sparse keyframe constraints without access to explicit embeddings or attention mechanisms. To clarify the distinction, we contrast the conventional and our implicit guidance approaches:

• Explicit conditioning (conventional):

xt−1∼pθ(xt−1|xt,condition),(11)

where the conditioning vector c is explicitly injected into the denoising network, for example, via cross-attention.

• Implicit guidance (ours): The denoising network εθ directly processes the full concatenated sequence, [xnstart;xnmid;xnend], at timestep n to predict the corresponding noise. This is achieved without any separate conditioning vectors or modifications to the network architecture.

This design allows us to retain the simplicity and generalization power of unconditional diffusion models, while still producing coherent, physically plausible sequences aligned with the sparse keyframe constraints.

The detailed inference process proceeds as follows:

As our method utilizes an unconditional diffusion backbone, the key to conditional generation lies in carefully constructing the input tensor Xn=[xnstart;xnmid;xnend] at each denoising step n. The central challenge here is to ensure the boundary components (xnstart and xnend), which are derived from our known conditions, are statistically consistent with the intermediate component (xnmid), which is the noisy output from the previous step.

To achieve this consistency, we apply noise to the original clean start and end frames in accordance with the diffusion schedule, precisely matching the noise level of the current timestep. We accomplish this by applying the official forward process (q-sample) formula from the DDPM framework [25]. This step is not an arbitrary perturbation; it is a mathematically grounded procedure that guarantees the resulting noised boundary frames have the correct statistical distribution for a given timestep.

The noised boundary frames, xn−1start and xn−1end, for the subsequent denoising step are thus computed as:

xn−1start=α¯n−1xt=1+1−α¯n−1ε,xn−1end =α¯n−1xt=T+1−α¯n−1ε.(12)

This data manipulation step is the core of our implicit guidance strategy. By consistently providing the network with a fully noised sequence where the boundaries are anchored to the ground truth, we effectively steer the unconditional model to generate a coherent intermediate trajectory that honors the given conditions.

For the intermediate frames, the process is initialized at step N by sampling from a standard Gaussian distribution, xNmid∼𝒩(0,I), to create a random sequence for the frames to be generated. In the subsequent denoising time steps [n−1, 0], the intermediate frames are provided by the denoising network:

xn−1mid=iDCT(1αn(cn−1−αn1−α¯nεθ(cn,n))+σε).(13)

Since we choose to perform diffusion inference in the frequency domain, the denoised sequences must undergo an iDCT to be concatenated. At each denoising time step, the network input is the full sequence formed by concatenating the start frame, the intermediate frames, and the end frame:

Xn=[xnstart;xnmid;xnend].(14)

At each time step n, the denoising network takes cn=DCT(Xn) as its input. After completing the denoising for all time steps, we obtain c0. Performing the iDCT on c0 yields the generated complete action sequence X^.

The entire inference process is summarized visually in Fig. 4, highlighting the temporal integration of keyframes with intermediate frames at each denoising step.

images

Figure 4: At each denoising time step, the noise in the middle frame is predicted by the denoising network. The noise intensity of the starting and ending frames is manually controlled to maintain consistency with the middle frame. These frames are then concatenated (denoted as ©) to form the input for the next time step of the denoising network

3.4 Loss

While the simple noise prediction loss (ℒsimple) forms the foundation of the diffusion model, it is insufficient on its own to guarantee that the generated motions are diverse, and physically plausible. To address this multi-faceted objective, we introduce a comprehensive loss function that incorporates several auxiliary terms designed to enforce physical and kinematic plausibility. The smoothness loss (ℒsmooth) ensures kinematic continuity by penalizing jitter and abrupt transitions, particularly at the boundaries with the given keyframes. Concurrently, the bone length loss (ℒlength) and bone angle loss (ℒangle) enforce anatomical constraints. This is particularly crucial in the early stages of training, where the model is essentially learning the basic rules of the human skeletal structure, and these losses significantly accelerate convergence towards plausible poses.

Beyond plausibility, another key goal is to ensure generative diversity and combat mode collapse, a common failure mode in generative models. To this end, we incorporate a diversity loss (ℒdiv), which explicitly encourages a high variance among different output sequences generated from the same start and end frames. This pushes the model to explore the full, rich space of plausible human motions rather than settling on a single, averaged solution. By combining the core diffusion loss with these carefully designed auxiliary terms, we create a balanced training objective that guides the model from learning fundamental physical principles to generating expressive and varied motion sequences.

Noise Prediction: We employ the DDPM’s ℒsimple loss to quantify the discrepancy between the model’s predicted noise ε^θ and the actual noise ε. By minimizing this loss, the model can more accurately estimate the noise, thereby better removing noise during the generation process and recovering high-quality action sequences. Specifically, the loss is given by:

ℒsimple=En,ε[‖ε−ε^θ(cn,n)‖22],(15)

where cn is the noisy data at time step n, calculated as:

cn=αncn−1+1−αnε.(16)

We incorporate a diversity-promoting regularizer that penalizes similarity among generated motion sequences, thereby discouraging mode collapse and ensuring a wide range of outputs. It does this by calculating the average difference between different generated samples, ensuring that the generated results have sufficient diversity:

ℒdiv=−1N(N−1)∑i≠j‖X^(i)−X^(j)‖2,(17)

in this context, N represents the total number of synthesized samples, and X^(i) and X^(j) refer to the i-th and j-th synthesized motion sequences, respectively.

The smoothness loss is used to ensure that the generated action sequences remain smooth and continuous over time, preventing unnatural abrupt changes in the generated sequences. It does this by calculating the differences between adjacent time frames, encouraging the frames in the sequence to transition gradually.

ℒsmooth=1(T−1)D∑t=1T−1∑d=1D(x^t,d−x^t−1,d)2.(18)

The bone length loss is used to ensure that the generated action sequences are physically plausible by maintaining the consistency of human bone lengths. It does this by constraining the distances between pairs of joints connected by bones, ensuring they are close to predefined bone lengths.

ℒlength=1T|ϵ|∑t=0T−1∑(i,j)∈ϵ(||p^t,i−p^t,j||2−lij)2.(19)

In this equation, T indicates the total frame count of the motion sequence, and ℰ represents the skeleton’s edge set (pairs of connected joints). p^t,i and p^t,j denote the predicted 3D positions of joints i and j at frame t, respectively. lij denotes the fixed length of the bone between joints i and j, as determined from the initial pose frame.

To ensure biomechanically plausible joint rotations and prevent unrealistic joint twisting, we introduce a bone angle loss based on joint-wise maximum angular limits:

ℒangle=1T|𝒥|∑t=0T−1∑i∈J(max(0,|θt,i|−θimax))2.(20)

Here, T indicates the sequence’s frame count, 𝒥 specifies the set of joints, and θt,i signifies the estimated rotation angle of joint i at frame t. The threshold θimax is obtained by traversing the training set and recording the maximum observed angle for each joint. This formulation penalizes angle magnitudes that exceed anatomically plausible limits, thereby regularizing the model to generate more natural human motions.

In the end, the model’s training aims to optimize every facet of the synthesized motion sequences by combining all the loss functions with appropriate weights.

ℒtotal=ℒsimple+λdivℒdiv+λsmoothℒsmooth+λlengthℒlength+λangleℒangle,(21)

where λdiv, λsmooth, λlength and λangle are scalar hyperparameters that balance the contribution of each auxiliary loss term.

4 Experimental

4.1 Experimental Details

Our implementation is based on the PyTorch framework. During training, we set the number of denoising steps to 1000, the learning rate to 1×10−4, and the batch size to 128. We adopt the Adam optimizer, and configure the multi-head self-attention mechanism with 8 heads. Training was conducted for 1000 epochs. During the inference phase, we used the DDIM (Denoising Diffusion Implicit Models) [32] method to perform 100 denoising steps, generating 98 intermediate frames conditioned on one start frame and one end frame. We evaluated our approach on two large-scale public datasets: Human3.6M [33] and LaFAN1 [23]. Human3.6M comprises 3D joint coordinates of 11 subjects performing 15 distinct everyday activities. LaFAN1 consists of 78 long motion sequences performed by 5 subjects, making it particularly suitable for evaluating the generation of long-horizon motions. For fair comparison, we followed the standard train/test splits for both datasets: using data from 5 subjects for training and 2 for testing on Human3.6M, and using 4 subjects for training and 1 for testing on LaFAN1.

4.2 Evaluation Metrics

In evaluating the performance of generated motion sequences, five key metrics are employed to comprehensively analyze the model’s capabilities. L2P quantifies spatial discrepancies between generated and reference motions by computing the Euclidean distance for each corresponding joint, thereby gauging the model’s accuracy in reproducing motion trajectories. L2Q focuses on the evaluation of motion poses, utilizing quaternions to represent joint rotations and calculating the error in each frame’s pose between generated and ground-truth sequences [34]. NPSS (Normalized Power Spectrum Similarity) [1,35] emphasizes frequency-domain analysis by comparing the dynamic characteristics of generated motions with real ones, revealing whether the rhythm and variation of the generated movements are natural and smooth. APD (Average Pairwise Diversity) highlights the diversity among generated motions, assessing the model’s ability to avoid mode collapse by evaluating the differences between multiple generated sequences. Finally, ADE (Average Displacement Error) [36] offers a global perspective, measuring the overall deviation of generated motion sequences relative to reference trajectories. To harmonize datasets with differing scales and feature units, we apply normalization to the motion data prior to evaluation, thereby enhancing the generality and comparability of the results.

4.3 Results and Comparisons

We compare our work against a suite of classical and state-of-the-art baselines, which are grouped by their underlying architecture. These include the GAN-based Two-stage [8], which employs a conditional GAN for pose sequence generation. The Transformer-based methods we evaluate are Δ-Interp [11], which operates in a delta-based local coordinate system, and τdet [12], which utilizes a two-stage framework to first generate a coarse transition and then refine its details. Finally, we benchmark against two techniques grounded in Variational Autoencoders (VAEs): the masking-based CVAE_based [10], and the bidirectional Stitching-CVAE [9], which proposes a novel generation and stitching scheme.

Our approach is benchmarked against multiple top-tier baselines on two large-scale datasets: Human3.6M and LAFAN1. The quantitative results for generating intermediate frames of varying lengths (30, 50, and 100) are summarized in Tables 1 and 2, respectively. In these tables, we highlight the best-performing method for each metric in bold and the second-best in underline. For deterministic models like Δ-Interp and τdet, which generate a single output for a given input pair, the diversity metric APD is not applicable.

images

The results across both datasets demonstrate the superior performance of our proposed method. On Human3.6M, our model achieves the best results on nearly all error-based metrics (ADE, L2P, L2Q) across all sequence lengths, indicating high accuracy in pose and trajectory generation. While Stitching-CVAE shows higher diversity (APD), our model provides a better balance of diversity and accuracy. This strength is particularly evident in the NPSS metric, where our results are consistently among the top, signifying that the generated motions are spectrally similar to real human movements. On the LAFAN1 dataset, which features more diverse and complex motions, our model continues to excel, particularly in pose accuracy (L2P, L2Q), significantly outperforming other methods. These comprehensive results confirm that our framework can generate motion sequences that are not only physically plausible and accurate but also diverse and natural, maintaining high effectiveness across different datasets and sequence lengths.

As shown in the tables, for the task of intermediate frame generation, our method outperforms existing methods on most metrics, especially on the NPSS metric. This demonstrates that our model can more accurately capture the original data distribution and generate more realistic intermediate frame sequences. Meanwhile, our results on APD are second only to the current best model, indicating that our method can generate diverse results while maintaining authenticity, as illustrated in Fig. 5. Notably, we also tested our model under a scenario with 100 frames, revealing that it remains highly effective across sequences of moderate to extended length.

images

Figure 5: Visualization results. We selected four actions from the Human3.6M dataset, each of which generates 118 intermediate frames through a start frame and a stop frame. We sampled every 5 frames and presented two different generated sequences for each action. Our framework synthesizes long-horizon motion sequences that are both diverse and plausible

4.4 Qualitative Analysis

In addition to quantitative metrics, we provide a qualitative analysis to visually assess the quality and practical utility of our generated motions. To validate our method’s applicability in standard animation pipelines, the output motion data from our model was directly imported into the industry-standard software Autodesk MotionBuilder and rendered on a 3D character, as shown in Fig. 6. These results highlight our model’s ability to produce high-fidelity and physically coherent motions for various activities.

images

Figure 6: Qualitative results of our motion in-betweening method on the LaFAN1 dataset. To demonstrate practical applicability, the generated motion trajectories for three actions (Run, Walk, and Jump) were imported into Autodesk MotionBuilder and rendered on a 3D character model. Given only the start and end poses, our model synthesizes smooth and physically plausible intermediate motions

For the Run sequence, our model generates a dynamic and consistent running cycle, where the coordination of limbs and the forward lean of the torso appear natural. In the Walk example, the trajectory shows a stable gait without noticeable artifacts such as foot-sliding, and the subtle vertical oscillations of the body are captured realistically. In the more complex Jump motion, our model successfully synthesizes an explosive take-off and a controlled landing, with the character’s center of mass following a plausible ballistic arc. Overall, these visualizations corroborate our quantitative findings, demonstrating that our framework can generate motion sequences that are not only accurate but also aesthetically pleasing and physically sound.

4.5 Ablation Study

Ablation experiments were performed on the Human3.6M dataset to isolate and quantify the impact of each model module. Table 3 reports the outcomes of our ablation experiments. The table rows detail these experiments, with each row representing a different variant of the full model where one or more components have been disabled to evaluate their contributions.

images

The introduction of DCT plays a crucial role in our model. By mapping the action sequences into the spectral domain, DCT helps reduce high-frequency noise and eliminate jitter in the generated sequences, thereby enhancing the smoothness and naturalness of the generated actions. As shown in Table 3, removing DCT leads to significant drops in several evaluation metrics. While the increase in APD somewhat indicates that the absence of high-dimensional data can affect the diversity of the generated sequences, leading to a slight enhancement in diversity, the increases in ADE, L2P, and L2Q suggest a decline in the accuracy of position and pose predictions. The rise in NPSS indicates a reduced similarity between the spectral characteristics of the generated and real actions.

To further assess the impact of DCT preprocessing, we examined the power spectral distribution of the DCT-processed sequences. As shown in Fig. 7, after applying DCT, the model’s power spectrum distribution becomes more uniform in the low-frequency region, effectively covering the entire low-frequency range. This indicates that DCT preprocessing aids the model in capturing the characteristics of motion sequences in the low-frequency band, leading to a smoother and more natural overall distribution of the generated results in the frequency domain. It also avoids the issue of local frequency components being overly concentrated. This result underscores the significant role of DCT in boosting the model’s capacity to extract spectral characteristics of motion.

images

Figure 7: The comparison of the spectrograms of motion sequences generated with and without DCT shows that the sequences generated using DCT exhibit improved performance across different frequency bands

To foster varied outputs, the diversity loss maximizes the mean pairwise distance among generated sequences. After removing the diversity loss, while some accuracy metrics may improve, the significant decrease in APD indicates that this loss is crucial for maintaining the richness and diversity of the generated actions. Without the diversity loss, the model tends to generate more consistent and homogeneous sequences, which, although performing better on certain metrics, limits overall generation quality and diversity. Although removing the bone length loss and bone angle loss does not significantly improve the evaluation metrics, we observed during the model training process that these two losses significantly enhance the convergence speed of the model and improve generation quality in the early stages of training. Fig. 8 compares the motion sequences produced by models incorporating or omitting bone length and angle penalties at the same training epoch. The model incorporating bone length and angle losses generates more stable and natural sequences, whereas the model lacking these losses exhibits more anomalies and unreasonable poses in the generated sequences.

images

Figure 8: After incorporating the physical structure losses ℒangle and ℒlength, our model learns plausible and anatomically correct human poses even in the early denoising steps

The ablation study provides clear insights into the distinct role of each component and their interplay. The single-component results confirm foundational contributions: removing the DCT transformation significantly degrades motion accuracy and spectral consistency (NPSS), highlighting its crucial role in establishing a stable feature space, while removing the diversity loss term leads to a drastic drop in APD, confirming its necessity for preventing mode collapse.

Furthermore, the study reveals complex interactions when components are removed simultaneously. For instance, removing both bone length and angle losses results in a disproportionate increase in pose rotation errors (L2Q), demonstrating that these physical constraints work synergistically to ensure anatomical plausibility. Conversely, other combinations highlight critical trade-offs; removing both the diversity and angle constraints, for example, yields poses that, while less varied, lose significant rotational realism, underscoring the delicate balance required to generate outputs that are both diverse and physically correct.

5 Limitations and Future Work

While the proposed frequency-domain diffusion model achieves competitive results on the motion in-betweening task, we acknowledge that it has several limitations, which in turn point to promising directions for future research. First, the model’s generalization capability requires further investigation. Although it performs well on the Human3.6M and LaFAN1 datasets, its generation quality may degrade for out-of-distribution (OOD) motions not seen during training, such as professional acrobatics or highly stylized dance forms. Second, regarding inference efficiency, our method shares the inherent limitations of diffusion models. Although we already employ DDIM [32] to accelerate the sampling process by reducing the number of denoising steps from 1000 to 100, this iterative procedure still incurs considerable computational cost. This makes the current model challenging to deploy in real-time applications that require instantaneous feedback, such as interactive games or live virtual avatar driving. Furthermore, the current model’s generation process is conditioned solely on the start and end keyframes, lacking higher-level controllability; for instance, it cannot be guided by semantic information such as textual descriptions or musical rhythms. Finally, the quality of the generated motion is, to some extent, dependent on the quality of the input keyframes, and the model’s output may be unpredictable when given unreasonable or physically impossible start and end poses. Future work could explore avenues to address these limitations, such as enhancing the model’s generalization, investigating more efficient sampling techniques like consistency models, and integrating multi-modal conditioning.

6 Conclusion

In this work, we introduced a novel frequency-domain diffusion model for the task of human motion in-betweening. By operating in the frequency domain via the Discrete Cosine Transform, employing a self-attention-based denoising network, and designing a multi-objective loss function, our method effectively generates high-quality, natural, and physically plausible motion sequences from only start and end frames. The superiority of this framework is validated by comprehensive experiments on the Human3.6M and LaFAN1 datasets. The quantitative results demonstrate that our method achieves state-of-the-art performance on key error-based metrics, such as L2P and L2Q, while maintaining a strong balance between generation diversity (APD) and spectral realism (NPSS). This research confirms the significant potential of leveraging frequency-domain representations within diffusion models and offers a robust solution for creative applications in animation and virtual reality.

Acknowledgement: The authors thank the anonymous reviewers for their valuable suggestions.

Funding Statement: This work was supported by the National Natural Science Foundation of China (Grant No. 72161034).

Author Contributions: Conceptualization, Qiang Zhang, Shuo Feng and Ying Qi; methodology, Shuo Feng, Shanxiong Chen and Teng Wan; software, Shuo Feng and Shanxiong Chen; validation, Shanxiong Chen and Teng Wan; formal analysis, Shuo Feng and Shanxiong Chen; investigation, Shuo Feng, Shanxiong Chen and Teng Wan; resources, Qiang Zhang and Ying Qi; data curation, Shuo Feng and Shanxiong Chen; writing—original draft preparation, Shuo Feng and Teng Wan; writing—review and editing, Qiang Zhang, Shanxiong Chen and Ying Qi; visualization, Shuo Feng and Teng Wan; supervision, Qiang Zhang and Ying Qi; project administration, Qiang Zhang and Ying Qi; funding acquisition, Qiang Zhang and Ying Qi. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The Human3.6M dataset used in this study is publicly available at http://vision.imar.ro/human3.6m/ (accessed on 11 July 2025). The LAFAN1 dataset is publicly available at https://github.com/ubisoft/ubisoft-laforge-animation-dataset (accessed on 11 July 2025). Other data generated during the study are available from the corresponding author upon reasonable request.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Holden D, Saito J, Komura T. A deep learning framework for character motion synthesis and editing. ACM Trans Graph (TOG). 2016;35(4):1–11. doi:10.1145/2897824.2925975. [Google Scholar] [CrossRef]

2. Li H, Liu H, Zhao W, Liu H. The human motion behavior recognition by deep learning approach and the internet of things. Int J Interact Multimed Artif Intell. 2024;8(7):55–65. [Google Scholar]

3. Pavllo D, Feichtenhofer C, Auli M, Grangier D. Modeling human motion with quaternion-based neural networks. Int J Comput Vis. 2020;128(4):855–72. doi:10.1007/s11263-019-01245-6. [Google Scholar] [CrossRef]

4. Holden D, Komura T, Saito J. Phase-functioned neural networks for character control. ACM Trans Graph (TOG). 2017;36(4):1–13. doi:10.1145/3072959.3073663. [Google Scholar] [CrossRef]

5. Chai J, Hodgins JK. Performance animation from low-dimensional control signals. ACM Trans Graph (TOG). 2005;24(3):686–96. doi:10.1145/1073204.1073248. [Google Scholar] [CrossRef]

6. Harvey FG, Pal C. Recurrent transition networks for character locomotion. In: SIGGRAPH Asia 2018 Technical Briefs; 2018 Dec 4–7; Tokyo, Japan. p. 1–4. doi:10.1145/3283254.3283277. [Google Scholar] [CrossRef]

7. Barsoum E, Kender J, Liu Z. HP-GAN: probabilistic 3D human motion prediction via GAN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 1418–27. [Google Scholar]

8. Cai H, Bai C, Tai YW, Tang CK. Deep video generation, prediction and completion of human action sequences. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8–14; Munich, Germany. p. 366–82. [Google Scholar]

9. Ren T, Yu J, Guo S, Ma Y, Ouyang Y, Zeng Z, et al. Diverse motion in-betweening from sparse keyframes with dual posture stitching. IEEE Trans Vis Comput Graph. 2024;31(2):1402–13. doi:10.1109/tvcg.2024.3363457. [Google Scholar] [PubMed] [CrossRef]

10. Cai Y, Wang Y, Zhu Y, Cham TJ, Cai J, Yuan J, et al. A unified 3D human motion synthesis model via conditional variational auto-encoder. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021 Oct 11–17; Montreal, BC, Canada. p. 11645–55. [Google Scholar]

11. Oreshkin BN, Valkanas A, Harvey FG, Ménard LS, Bocquelet F, Coates MJ. Motion in-betweening via deep Δ-interpolator. IEEE Trans Vis Comput Graph. 2023;30(8):5693–704. doi:10.1109/tvcg.2023.3309107. [Google Scholar] [CrossRef]

12. Qin J, Zheng Y, Zhou K. Motion in-betweening via two-stage transformers. ACM Trans Graph. 2022;41(6):184–1. doi:10.1145/3550454.3555454. [Google Scholar] [CrossRef]

13. Tevet G, Raab S, Gordon B, Shafir Y, Cohen-Or D, Bermano AH. Human motion diffusion model. arXiv:2209.14916. 2022. [Google Scholar]

14. Zhang M, Cai Z, Pan L, Hong F, Guo X, Yang L, et al. Motiondiffuse: text-driven human motion generation with diffusion model. IEEE Trans Pattern Anal Mach Intell. 2024;46(6):4115–28. doi:10.1109/tpami.2024.3355414. [Google Scholar] [PubMed] [CrossRef]

15. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Commun ACM. 2020;63(11):139–44. doi:10.1145/3422622. [Google Scholar] [CrossRef]

16. Gui LY, Wang YX, Liang X, Moura JM. Adversarial geometry-aware human motion prediction. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8–14; Munich, Germany. p. 786–803. [Google Scholar]

17. Kingma DP, Welling M. Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations (ICLR2014); 2014 Apr 14–16; Banff, AB, Canada. [Google Scholar]

18. Komura T, Habibie I, Holden D, Schwarz J, Yearsley J. A recurrent variational autoencoder for human motion synthesis. In: The 28th British Machine Vision Conference; 2017 Sep 4–7; London, UK. 119 p. [Google Scholar]

19. Walker J, Doersch C, Gupta A, Hebert M. An uncertain future: forecasting from static images using variational autoencoders. In: Computer Vision-ECCV 2016: 14th European Conference. Amsterdam, The Netherlands: Springer; 2016. p. 835–51. [Google Scholar]

20. Yin W, Yin H, Kragic D, Björkman M. Graph-based normalizing flow for human motion generation and reconstruction. In: 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN); 2021 Aug 8–12; Vancouver, BC, Canada. p. 641–8. [Google Scholar]

21. Petrovich M, Black MJ, Varol G. Temos: generating diverse human motions from textual descriptions. In: European Conference on Computer Vision; 2022 Oct 23–27; Tel Aviv, Israel. p. 480–97. [Google Scholar]

22. Ahn H, Ha T, Choi Y, Yoo H, Oh S. Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA); 2018 May 21–25; Brisbane, QSL, Australia. p. 5915–20. [Google Scholar]

23. Harvey FG, Yurick M, Nowrouzezahrai D, Pal C. Robust motion in-betweening. ACM Trans Graph (TOG). 2020;39(4):60–1. doi:10.1145/3386569.3392480. [Google Scholar] [CrossRef]

24. Hong S, Kim H, Cho K, Noh J. Long-term motion in-betweening via keyframe prediction. Comput Graph Forum. 2024;43:e15171. [Google Scholar]

25. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Proc Syst. 2020;33:6840–51. [Google Scholar]

26. Yuan Y, Song J, Iqbal U, Vahdat A, Kautz J. Physdiff: physics-guided human motion diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 1–6; Paris, France. p. 16010–21. [Google Scholar]

27. Si C, Huang Z, Jiang Y, Liu Z. FreeU: free lunch in diffusion U-Net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024 Jun 16–22; Seattle, WA, USA. p. 4733–43. [Google Scholar]

28. Yoon S, Koo G, Hong JW, Yoo CD. DNI: dilutional noise initialization for diffusion video editing. In: European Conference on Computer Vision; 2024 Sep 29–Oct 4; Milan, Italy. p. 180–95. [Google Scholar]

29. Koo G, Yoon S, Hong JW, Yoo CD. Flexiedit: frequency-aware latent refinement for enhanced non-rigid editing. In: European Conference on Computer Vision; 2024 Sep 29–Oct 4; Milan, Italy. p. 363–79. [Google Scholar]

30. Yoon S, Koo G, Kim G, Yoo CD. FRAG: frequency adapting group for diffusion video editing. arXiv:2406.06044. 2024. [Google Scholar]

31. Wu T, Si C, Jiang Y, Huang Z, Liu Z. FreeInit: bridging initialization gap in video diffusion models. In: European Conference on Computer Vision; 2024 Sep 29–Oct 4; Milan, Italy. p. 378–94. [Google Scholar]

32. Song J, Meng C, Ermon S. Denoising diffusion implicit models. arXiv:2010.02502. 2020. [Google Scholar]

33. Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell. 2013;36(7):1325–39. doi:10.1109/tpami.2013.248. [Google Scholar] [PubMed] [CrossRef]

34. Martinez J, Black MJ, Romero J. On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 2891–900. [Google Scholar]

35. Gopalakrishnan A, Mali A, Kifer D, Giles L, Ororbia AG. A neural temporal model for human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019 Jun 16–17; Long Beach, CA, USA. p. 12116–25. [Google Scholar]

36. Ghosh P, Song J, Aksan E, Hilliges O. Learning human motion models for long-term predictions. In: 2017 International Conference on 3D Vision (3DV); 2017 Oct 10–12; Qingdao, China. p. 458–66. [Google Scholar]

Cite This Article

APA Style

Zhang, Q., Feng, S., Chen, S., Wan, T., Qi, Y. (2026). Motion In-Betweening via Frequency-Domain Diffusion Model. Computers, Materials & Continua, 86(1), 1–22. https://doi.org/10.32604/cmc.2025.068247

Vancouver Style

Zhang Q, Feng S, Chen S, Wan T, Qi Y. Motion In-Betweening via Frequency-Domain Diffusion Model. Comput Mater Contin. 2026;86(1):1–22. https://doi.org/10.32604/cmc.2025.068247

IEEE Style

Q. Zhang, S. Feng, S. Chen, T. Wan, and Y. Qi, “Motion In-Betweening via Frequency-Domain Diffusion Model,” Comput. Mater. Contin., vol. 86, no. 1, pp. 1–22, 2026. https://doi.org/10.32604/cmc.2025.068247

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Motion In-Betweening via Frequency-Domain Diffusion Model

Abstract

Keywords

References

Cite This Article

1641

796

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link