Open Access
ARTICLE
H-LoRA: Rethinking Rank Selection for Controllable Knowledge Retention in Edge AI
Centre for Business Incubation and Entrepreneurial Ventures, Tunku Abdul Rahman University of Management and Technology, Jalan Genting Kelang, Setapak, Kuala Lumpur, Malaysia
* Corresponding Author: Lim Tong Ming. Email:
(This article belongs to the Special Issue: Advanced Edge Computing and Artificial Intelligence in Smart Environment)
Computers, Materials & Continua 2026, 88(1), 87 https://doi.org/10.32604/cmc.2026.080068
Received 02 February 2026; Accepted 30 March 2026; Issue published 08 May 2026
Abstract
The deployment of specialized language models in resource-constrained edge environments (Keywords
1.1 The Imperative for Specialized AI on the Edge
The rapid proliferation of edge computing has created an urgent demand for deploying specialized Artificial Intelligence (AI) models directly on resource-constrained devices, where memory, latency, and communication bandwidth are tightly limited. Unlike cloud-based AI systems that can leverage virtually unlimited computational resources, edge AI applications operate under stringent constraints that demand a fundamental rethinking of model architecture and optimization strategies. These deployment scenarios impose strict resource limitations: memory footprints must typically remain below
However, the inherent trade-off between model compactness and task-specific performance presents a significant challenge. While general-purpose compact models can provide broad functionality, they often lack the specialized knowledge and fine-grained understanding required for domain-specific applications such as medical diagnosis, legal document analysis, or technical support systems for a specific company or region-specific knowledge. This limitation creates an urgent need for novel fine-tuning methodologies that can instill deep domain expertise into compact models without compromising the efficiency demanded by edge environments.
While recent large-scale language models demonstrate impressive performance, their computational and memory requirements remain prohibitive for many edge deployments (e.g., Internet of Things (IoT) gateways, embedded systems, and mobile devices). This work therefore focuses on sub-
1.2 The Fine-Tuning Dilemma for Compact Models
Supervised Fine-Tuning (SFT) has long been regarded as the gold standard for achieving domain specialization in language models, demonstrating exceptional capability in adapting general-purpose models to specific application domains. However, when applied to compact models in edge scenarios, SFT reveals a critical vulnerability: it induces irreversible catastrophic forgetting that fundamentally compromises the model’s general capabilities. This phenomenon, first systematically studied in connectionist networks [6,7], becomes particularly acute in resource-constrained models where the reduced parameter space amplifies interference effects [8]. This limitation is particularly acute in compact models where the reduced parameter space makes the trade-off between specialization and retention more pronounced, and the resulting knowledge destruction renders the model unsuitable for dynamic deployment scenarios that require adaptability across multiple domains or the ability to recover general capabilities post-deployment.
Traditional Low-Rank Adaptation (LoRA), while successful in mitigating catastrophic forgetting in larger models, faces substantial limitations when applied to compact architectures. Crucially, existing LoRA-based studies treat the rank primarily as a tunable compression hyperparameter, without examining its role in governing the nature of forgetting and knowledge retention under edge deployment constraints. The conventional approach of using conservative rank settings (
The current state of research presents a fundamental gap: no existing methodology enables compact models to achieve SFT-level domain performance while avoiding the destructive forgetting patterns that compromise their utility in dynamic edge environments. This gap represents not merely a technical limitation, but a barrier to understanding the fundamental trade-offs between specialization and retention that govern parameter-efficient fine-tuning.
1.3 A New Paradigm: Destructive Forgetting vs. Controllable Knowledge Retention
To address this gap, we introduce a new theoretical paradigm centered on the distinction between destructive forgetting and controllable knowledge retention. We observe that existing approaches exhibit two fundamentally different modes of capability change: destructive forgetting, characterized by irreversible degradation of the original knowledge structure, and controllable knowledge retention, where apparent performance degradation masks preserved underlying knowledge structures that remain recoverable and adjustable.
SFT exemplifies destructive forgetting through its direct manipulation of pre-trained weights, which overwrites the carefully learned representations that encode general knowledge. This process fundamentally destroys the model’s original knowledge architecture, rendering any capability loss irreversible and eliminating the possibility of post-deployment adaptation. In contrast, our investigation reveals that properly configured high-rank adaptation can achieve controllable knowledge retention, where domain specialization occurs through additive low-rank modifications that maintain the structural integrity of the original knowledge subspace, providing an architectural guarantee for capability recovery and adjustment.
This paradigm shift builds upon recent advances in understanding LoRA’s low-rank constraints but fundamentally challenges the conventional wisdom of rank limitations, particularly in the context of compact model deployment. The theoretical foundations of catastrophic forgetting [6] and neural network representation learning [10] provide crucial insights into why existing approaches fail to achieve controllable knowledge retention. To realize this vision, we introduce H-LoRA (High-Rank LoRA), a novel approach that combines strategic high-rank adaptation with simplified scaling mechanisms to unlock the full potential of compact models while maintaining knowledge recoverability and achieving superior specialization performance.
This work makes the following key contributions to parameter-efficient fine-tuning for compact models under edge deployment constraints:
1. H-LoRA for Edge AI Specialization. We propose H-LoRA, a high-rank adaptation framework with a simplified scaling mechanism, specifically designed to enable efficient and stable specialization of compact models in resource-constrained edge environments.
2. Destructive Forgetting vs. Controllable Knowledge Retention. We introduce a clear conceptual distinction between destructive forgetting induced by supervised fine-tuning and controllable knowledge retention achieved through additive adaptation, supported by quantitative retention analysis.
3. An Empirical Rank Selection Principle. We identify a consistent empirical guideline showing that the optimal adaptation rank for compact models occurs at approximately two-thirds of the hidden dimension, providing practical guidance across different architectures.
4. Practical Deployment Feasibility. We demonstrate that H-LoRA achieves specialization performance comparable to full fine-tuning with significantly improved parameter efficiency, enabling multi-domain adaptation and reliable capability recovery for real-world edge deployments.
This section critically examines three core areas of literature that situate the contributions of our work: the evolution and limitations of parameter-efficient fine-tuning for compact models, the prevailing understanding of catastrophic forgetting, and the need for a systematic framework to navigate specialization-retention trade-offs in resource-constrained environments.
We deliberately adopt standard LoRA as the primary baseline to isolate the effect of rank selection on knowledge retention. Introducing adaptive rank allocation (e.g., Adaptive Low-Rank Adaptation (AdaLoRA)) or quantization-based methods (e.g., Quantized Low-Rank Adaptation (QLoRA)) would introduce additional confounding factors such as dynamic sparsity patterns or quantization noise, making it difficult to attribute observed effects solely to rank. Therefore, this work does not aim to outperform all Parameter-Efficient Fine-Tuning (PEFT) variants, but to answer a more fundamental question: “How does rank selection affect knowledge retention in compact models?”
2.1 Parameter-Efficient Fine-Tuning for Compact Models
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a computationally efficient alternative to full fine-tuning, building upon early work in transfer learning and domain adaptation [11,12]. Representative PEFT approaches include adapter-based methods such as Houlsby et al. [13], prompt tuning variants [14,15], and selective weight updates [16]. These methods aim to adapt pre-trained models with minimal parameter modifications, as summarized in recent surveys [17–19].
Among these approaches, Low-Rank Adaptation (LoRA) has become a de facto standard due to its simplicity and zero-overhead inference. LoRA is based on the hypothesis that task-specific weight updates lie in a low-rank subspace, allowing the update matrix to be decomposed as
Several LoRA variants have been proposed to address different efficiency dimensions, including adaptive rank allocation (AdaLoRA; [20]), quantization-aware fine-tuning (QLoRA; [21]), weight decomposition (DoRA; [22]), vector-based random adaptation (VeRA; [23]), and learning-rate reparameterization (LoRA+; [24]). These approaches target complementary efficiency axes such as parameter budgeting, quantization, optimization stability, or decomposition strategies.
AdaLoRA dynamically redistributes a fixed parameter budget across layers based on importance estimation, focusing on budget allocation efficiency under constrained resources. QLoRA, in contrast, reduces memory consumption by applying low-bit quantization to pretrained weights while maintaining adaptation through low-rank updates.
While highly effective in their respective domains, these methods introduce additional mechanisms (e.g., dynamic sparsity patterns or quantization noise) that are orthogonal to the present study. The objective of this work is not to optimize parameter compression or allocation under a fixed budget, but to investigate a more fundamental question: How does absolute rank magnitude influence the structural preservation or destruction of pretrained knowledge in compact models?
To isolate this effect, we deliberately adopt standard LoRA as a controlled baseline. This ensures that observed differences arise from rank scaling itself rather than adaptive allocation strategies or quantization-induced regularization. Therefore, this study does not aim to outperform all PEFT variants in overall efficiency, but to uncover the mechanistic role of rank selection in governing controllable knowledge retention under edge deployment constraints.
2.2 Catastrophic Forgetting in Fine-Tuning
Catastrophic forgetting has long been recognized as a core challenge in neural network training, particularly in continual learning settings where new tasks interfere with previously acquired knowledge [7,25–27]. The phenomenon arises from distributed parameter representations, where gradient updates for new objectives can disrupt existing knowledge structures [6,28].
Prior work shows that forgetting severity is influenced by model scale, with compact models being more vulnerable to interference due to limited representational redundancy [29]. In domain adaptation, PEFT methods such as LoRA have been shown to reduce forgetting relative to full fine-tuning [9]. However, most studies quantify forgetting primarily through performance degradation metrics without distinguishing between fundamentally destructive overwriting and structurally controllable retention.
In contrast, our work emphasizes this distinction: destructive forgetting irreversibly alters pretrained representations, whereas controllable retention preserves the underlying knowledge structure and enables capability recovery. This structural perspective is particularly relevant for compact models deployed under edge constraints, where reliable recovery and update flexibility are essential.
2.3 The Specialization-Retention Trade-off: A Missing Framework
The trade-off between domain specialization and general knowledge retention is a well-documented but poorly formalized aspect of model fine-tuning. While numerous studies report varying performance-efficiency trade-offs for different PEFT methods, the field lacks a unified framework for systematic evaluation and decision-making. Practitioners are often left with ad-hoc heuristics for selecting a fine-tuning strategy, making it difficult to optimize for specific deployment scenarios.
This challenge is exacerbated in the context of Edge AI. The stringent resource constraints of edge devices [1,2] create a unique and unforgiving trade-off landscape, as comprehensively analyzed in recent surveys [4,5]. Decisions about parameter allocation, rank selection, and training strategy have direct and significant impacts on latency, memory usage, and energy consumption [30]. A method that offers a slight performance gain at the cost of a large increase in adapter size might be acceptable for a cloud deployment but entirely infeasible on a mobile device. The absence of a systematic framework for characterizing and navigating these fundamental trade-offs in domain adaptation represents a significant barrier to deploying robust and efficient specialized models at the edge.
This section presents our H-LoRA framework and comprehensive experimental methodology. We first introduce the theoretical foundation and architectural design of H-LoRA in Section 3.1, followed by the detailed experimental setup and training procedures in Section 3.2. Section 3.3 describes the three domain-specific datasets and our domain specialization approach. Section 3.4 outlines our systematic investigation of rank selection principles across different model architectures. Finally, Section 3.5 details our evaluation framework, including metrics for both domain performance and knowledge retention assessment.
Problem Formulation
We consider the problem of adapting a pretrained compact language model
3.1 H-LoRA: High-Rank Adaptation Framework
Traditional LoRA assumes that effective adaptation occurs within a low-dimensional subspace, motivated by the intrinsic dimensionality hypothesis. While this assumption is reasonable for large-scale models, it can impose a severe adaptation bottleneck in compact architectures with limited parameter capacity. Recent theoretical analyses suggest that the optimal adaptation rank depends on task complexity and model structure rather than a universal low-rank constraint.
Motivated by this insight, H-LoRA explores the high-rank regime to restore sufficient adaptation capacity while preserving the pretrained knowledge structure. To simplify deployment and avoid rank-dependent tuning, we adopt a direct scaling mechanism that maintains the additive form of LoRA without relying on complex
where
H-LoRA is applied uniformly to both attention and feed-forward layers within the Transformer architecture. Specifically, low-rank adaptation matrices
where
3.2 Experimental Design and Implementation
We conduct controlled experiments on compact decoder-only Transformer models to evaluate the impact of rank selection under edge deployment constraints. Our primary model is MiniMind, a
All methods are trained under identical optimization settings to ensure fair comparison. We use the AdamW optimizer with a learning rate of
For H-LoRA, the low-rank adaptation matrices are initialized such that the effective weight update is zero at initialization:
3.3 Dataset Configuration and Domain Specialization
We evaluate H-LoRA across three heterogeneous domains to assess robustness under diverse specialization requirements: mathematical reasoning, professional knowledge, and rule-based enterprise scenarios. For mathematics, we use GSM8K, a benchmark dataset for multi-step numerical reasoning. For the medical domain, we adopt MedQA-USMLE-4-options, which evaluates domain-specific expert knowledge. In addition, we construct a domain-specific HR dataset comprising approximately
All datasets follow a unified training and evaluation protocol to ensure fair comparison across domains and architectures. This multi-domain design reduces the risk of domain-specific bias and enables a comprehensive assessment of controllable knowledge retention in edge deployment settings.
3.4 Rank Selection Investigation Methodology
This subsection outlines our systematic approach to investigating optimal rank selection, including the range of ranks tested and the methodology for identifying the empirical relationship between rank and model hidden dimension.
3.4.1 Systematic Rank Exploration Protocol
We conduct a systematic exploration of rank selection to characterize the performance landscape of H-LoRA. Specifically, we evaluate eight rank configurations,
Based on this design, we test the hypothesis that near-optimal performance emerges when the rank is set to approximately two-thirds of the hidden dimension. For the MiniMind model (
3.4.2 Cross-Architecture Validation Protocol
To assess cross-architecture generalizability, we validate the identified rank heuristic on Qwen-
3.5 Evaluation Framework and Metrics
This subsection presents our comprehensive evaluation framework, including domain-specific performance metrics (precision, recall, F1-score) and knowledge retention assessment metrics (topic retention analysis) used to quantify the trade-off between specialization and forgetting.
3.5.1 Domain Specialization Performance Metrics
Primary Performance Indicators.
We evaluate domain specialization performance using standard classification and generation metrics. Specifically, we report precision (
Performance is evaluated separately for the HR, medical, and mathematics domains. We further report the mean and standard deviation across domains to assess aggregate performance and analyze cross-domain consistency.
3.5.2 Controllable Knowledge Retention Assessment
To assess controllable knowledge retention, we adopt a multi-dimensional evaluation framework designed to mitigate known biases in AI-based assessment, following best practices in experimental reporting. For each evaluation, we generate
Evaluation is conducted using Gemini 2.5 Pro as a consistent automated evaluator. Responses are scored on a five-point scale ranging from
3.5.3 Degradation Pattern Classification
To analyze knowledge degradation patterns, we define a concise taxonomy comprising topic drift, factual errors, logic errors, and other non-specific quality degradation, along with a maintained category indicating preserved or improved performance (score
Degradation categories are identified using a hybrid approach, combining keyword-based automated detection for topic drift with manual expert verification to ensure classification accuracy. To improve reliability, multiple evaluators independently assess samples, and inter-rater consistency is verified across evaluations.
3.6 Computational Efficiency Analysis
3.6.1 Training Time Measurement Protocol
We measure computational efficiency using wall-clock training time, defined as the total duration from training start to convergence. All measurements are conducted on an identical NVIDIA RTX 3090 setup under consistent system load and thermal conditions. To reduce variance, reported results are averaged across multiple runs.
Efficiency is analyzed along three dimensions: training time scalability with respect to rank expansion, peak GPU memory usage across different rank settings, and convergence behavior measured by the number of epochs required to reach optimal performance.
3.6.2 Parameter Efficiency Calculation
Parameter efficiency is quantified by analyzing the number of trainable parameters updated during fine-tuning relative to the total model size, including frozen pretrained weights. We report the parameter ratio, defined as the percentage of parameters requiring gradient updates.
We define the efficiency score as:
where E denotes the efficiency score, P represents the downstream task precision, and
To enable fair comparison across parameter-efficient methods, we compute an efficiency score defined as the ratio between the task performance metric and the parameter ratio. This normalized measure captures the trade-off between performance gains and parameter update cost.
3.7 Experimental Controls and Validation
3.7.1 Controlled Comparison Setup
To ensure a fair and controlled comparison, all methods are evaluated under identical training conditions, including the same hyperparameters, hardware, and software environments. We use consistent training, validation, and test splits across methods, and apply uniform evaluation metrics and assessment protocols. All models are trained with an equal computational budget and the same number of training epochs.
We compare the proposed H-LoRA against three baselines: the unmodified pretrained base model as a reference, a standard supervised fine-tuning (SFT) baseline, and traditional low-rank LoRA using conventional rank settings.
3.7.2 Statistical Validation and Reproducibility
Reproducibility is ensured through deterministic seeding with a fixed random seed of
3.7.3 Ethical Considerations and Limitations
We adopt several measures to mitigate evaluation bias, including diverse domain coverage to reduce task-specific overfitting, automated evaluation to limit human subjectivity, and transparent reporting of all experimental procedures.
We acknowledge several limitations of this study. First, our primary evaluation focuses on compact models (
This section presents a comprehensive evaluation of H-LoRA, systematically dissecting its performance across different domain specializations, rank sensitivity, computational efficiency, and knowledge retention. The results collectively demonstrate that H-LoRA not only matches the specialization performance of Supervised Fine-Tuning (SFT) but does so with vastly superior knowledge retention and parameter efficiency, establishing it as a viable solution for edge deployment.
4.1 Cross-Domain and Cross-Architecture Performance
H-LoRA demonstrates consistently high performance across diverse domains and model architectures, confirming its robustness and generalizability. Fig. 1 provides a visual summary, while Table 1 details the comprehensive performance metrics.

Figure 1: H-LoRA cross-domain performance analysis.

Figure 2: H-LoRA rank effect analysis.

Domain-Specific Analysis
H-LoRA’s effectiveness is evident across distinct reasoning types. In the Human Resources domain, it achieves an exceptional
Cross-Architecture Validation
To test the architectural generalizability of our findings, we applied H-LoRA to Qwen-0.5B. Using the theoretically predicted rank (
Statistically, H-LoRA maintains an average precision of
4.2 The Effect of Rank: Discovery of the 2/3 Empirical Principle
Our systematic exploration of LoRA rank reveals a predictable, three-phase performance pattern, as illustrated in Fig. 2. This pattern challenges the conventional low-rank assumption and leads to the discovery of a practical empirical principle for rank selection.
Phase I: Rapid Growth (
Phase II: Diminishing Returns (
Phase III: Overfitting (
This analysis yields our most significant finding for practitioners: optimal performance is consistently achieved when the rank is approximately two-thirds of the model’s hidden dimension (
4.3 Computational Efficiency for Edge Deployment
H-LoRA is not only effective but also remarkably efficient, making it ideally suited for resource-constrained edge environments. Despite a
As shown in Table 2, H-LoRA achieves a

4.4 Distinguishing Forgetting: Controllable Retention vs. Destructive Forgetting
Our analysis reveals the fundamental mechanistic difference between H-LoRA’s and SFT’s impact on a model’s general knowledge. While both induce performance degradation on out-of-domain tasks, H-LoRA facilitates controllable knowledge retention, whereas SFT causes destructive, irreversible forgetting.
Topic Coherence as the Key Differentiator: The most striking evidence is the

Figure 3: Catastrophic forgetting analysis (100 samples). This figure provides the central evidence for our theoretical distinction, demonstrating that H-LoRA preserves knowledge structure (90% topic coherence) while SFT destructively overwrites it (1% coherence).
Fundamentally Different Degradation Patterns: Table 3 shows that SFT’s failure mode is almost exclusively topic drift (

Human Validation of Topic Drift
To verify that the observed destructive forgetting pattern is not an artifact of automated evaluation, we conducted an independent human validation study.
Two annotators independently evaluated 40 randomly sampled outputs (80 total judgments). Each response was classified into one of two categories: Topic Drift or On-topic Degradation.
Table 4 reports the aggregated results.

The findings confirm that SFT exhibits substantially higher topic drift compared to H-LoRA, consistent with the automated evaluation results. Specifically, SFT shows topic drift in 90% of cases, whereas H-LoRA reduces topic drift to 45%, representing a 45-percentage-point improvement in topical coherence.
This human assessment reinforces the conclusion that SFT primarily induces destructive forgetting, while H-LoRA preserves the underlying knowledge structure.
4.5 The Specialization-Retention Trade-off Framework
By synthesizing our findings, we can construct a comprehensive framework to visualize the trade-off landscape between domain specialization and knowledge retention.
As illustrated in Fig. 4, SFT and H-LoRA occupy distinct regions in the performance space. Both achieve the same high level of specialization (

Figure 4: Performance trade-off analysis. The framework visually confirms H-LoRA’s optimal positioning, achieving SFT-level specialization while mitigating severe forgetting, thereby placing it at the edge of the ‘Ideal Zone’.
Fig. 5 provides a practical demonstration of this framework, showing that applying H-LoRA to the general-purpose Qwen model results in performance improvement for

Figure 5: Qwen model specialization enhancement analysis.
4.6 Edge Deployment Analysis: Memory Efficiency and Operational Flexibility
H-LoRA provides quantifiable advantages for practical edge AI deployment by addressing key constraints in memory usage, storage footprint, and update efficiency.
4.6.1 Memory and Storage Efficiency
Inference memory is a primary constraint for edge hardware. As shown in Table 5, activating an H-LoRA adapter (

More significant gains emerge when considering storage and update requirements for practical-scale models. As summarized in Table 6, domain specialization via SFT requires storing a full

This dramatic reduction directly impacts over-the-air (OTA) updates in network-constrained environments, where transmitting a
4.6.2 Operational Flexibility and Deployment Cost Implications
H-LoRA’s non-destructive, additive architecture enables flexible and reliable deployment strategies that are difficult to achieve with full fine-tuning. Because the base model weights remain unchanged, unloading an adapter guarantees immediate recovery of the model’s general capabilities without empirical validation.
Furthermore, a single base model can support multiple domain-specific adapters stored on disk and dynamically loaded at runtime, enabling true multi-domain capability on resource-constrained devices with minimal VRAM overhead.
To contextualize the practical impact of adapter size, we provide a simple transmission estimate under realistic network conditions. Assuming a conservative edge connectivity bandwidth of 5 Mbps (e.g., constrained LTE or rural deployment scenarios), transmitting a 96 MB adapter requires approximately 150–160 s. In contrast, distributing a full 1.4 GB fine-tuned model would require over 30 min under the same conditions.
This order-of-magnitude difference substantially improves the feasibility of frequent over-the-air (OTA) updates, particularly in bandwidth-limited or intermittently connected environments. While 96 MB is larger than conventional low-rank LoRA adapters, it remains within a practically deployable range for modern edge networks, whereas full-model replacement becomes operationally prohibitive.
Beyond memory and bandwidth efficiency, this design mitigates hidden operational costs. Unlike SFT, which often requires repeated retraining to balance specialization and retention, H-LoRA’s controllable adaptation reduces retraining cycles and simplifies deployment at scale, particularly in large fleets of edge devices with limited connectivity.
4.7 Key Findings and Implications
Our comprehensive evaluation establishes several critical findings that advance both theoretical understanding and practical deployment:
1. Performance Equivalence with Superior Efficiency: H-LoRA matches SFT’s specialization performance while requiring
2. Architectural Generalizability: The
3. Controllable vs. Destructive Forgetting: The
4. Edge Deployment Viability: With a lightweight
5. Framework for Trade-off Navigation: Our specialization-retention framework provides practitioners with quantitative tools for method selection based on deployment requirements, moving beyond ad-hoc heuristics to principled decision-making.
6. Computational Feasibility of High-Rank Adaptation: The minimal training overhead (
These results collectively demonstrate that H-LoRA successfully bridges the critical gap between high-performance domain specialization and knowledge preservation, establishing a new paradigm for parameter-efficient adaptation specifically suited for edge AI deployment scenarios where both specialization performance and operational flexibility are paramount.
This section examines the broader implications of our findings, positioning H-LoRA within the evolving landscape of parameter-efficient fine-tuning and edge AI deployment. We analyze the theoretical contributions, mechanistic insights, and practical ramifications that extend beyond the immediate experimental results to influence future research directions and deployment strategies.
We evaluate H-LoRA across diverse task types, including logical reasoning (GSM8K), domain-specific expert knowledge (MedQA), and rule-driven enterprise scenarios (Human Resources (HR)), demonstrating robustness across heterogeneous edge application settings.
5.1 Rethinking the Low-Rank Paradigm in Compact Models
Our findings challenge the conventional assumption that effective adaptation necessarily occurs in a strictly low-rank subspace. While low-rank adaptation has proven sufficient for large-scale models where even modest ranks provide ample capacity, our results indicate that this assumption becomes a limiting factor for compact models under edge constraints. In such settings, overly restrictive rank choices can severely hinder domain adaptation and exacerbate catastrophic forgetting.
Importantly, the observed
Implications for PEFT Design under Edge Constraints
The success of H-LoRA suggests that parameter efficiency should not be equated with aggressive parameter reduction. Instead, effective adaptation in compact models requires allocating sufficient capacity to preserve critical subspaces while maintaining deployment feasibility. From this perspective, rank selection becomes a controllable design parameter rather than a fixed hyperparameter.
The proposed heuristic offers a practical guideline for reducing extensive hyperparameter search in edge deployments. Future work may explore integrating such heuristics into automated or adaptive rank selection mechanisms tailored to heterogeneous edge devices.
5.2 Mechanistic Insight: Additive vs. Overwrite Adaptation
The stark contrast in knowledge retention (
Overwrite Paradigm (SFT). Supervised fine-tuning updates all pretrained parameters directly:
Gradients
Additive Paradigm (H-LoRA). In contrast, H-LoRA constrains adaptation to a dedicated low-rank subspace:
Gradients are restricted to adapter parameters (
From a representational perspective, SFT modifies the original knowledge space itself, whereas H-LoRA augments it with an additional subspace (
5.3 Gradient Flow Analysis and Optimization Dynamics
The distinction between these paradigms becomes particularly clear when analyzing optimization dynamics:
Unconstrained Gradient Flow (SFT): With gradients flowing through all 123M parameters simultaneously, the optimization process faces conflicting objectives—minimizing domain-specific loss while preserving general capabilities. This multi-objective optimization challenge is well-documented in the optimization literature, where gradient conflicts typically resolve through implicit prioritization of the most recent objective function. This multi-objective optimization typically resolves in favor of the immediate training objective, sacrificing general knowledge for task performance [31].
Constrained Gradient Flow (H-LoRA): By restricting gradients to only 25M adapter parameters, H-LoRA eliminates this optimization conflict [32,33]. The base model’s representational capacity remains untouched, while domain-specific knowledge is encoded in dedicated parameters designed specifically for this purpose [13,34].
5.4 Implications for Neural Network Adaptation Theory
This mechanistic understanding establishes a theoretical foundation for controllable neural adaptation where specific capabilities can be modified without disrupting the broader knowledge ecosystem within the model. The additive paradigm suggests that effective adaptation does not require access to all model parameters; instead, it requires architectural guarantees that adaptation occurs in dedicated parameter subspaces.
This insight challenges conventional assumptions in neural network fine-tuning and suggests that future adaptation methods should prioritize architectural preservation over parameter accessibility. The success of high-rank additive updates demonstrates that adaptation capacity is not fundamentally limited by rank constraints, but rather by the intelligent allocation of adaptation parameters within preserved representational structures.
Architectural Adaptation vs. Inference-Time Reflection.
Recent work has explored inference-time self-correction mechanisms, such as self-reflection and dual-loop reasoning strategies, to improve response quality without modifying model parameters. These approaches enhance reasoning reliability by iteratively critiquing and refining generated outputs during inference.
While effective for improving answer accuracy, reflection-based methods operate at the inference level and do not alter the underlying parameter structure of the model. Consequently, they do not address the structural overwriting that occurs during supervised fine-tuning, where pretrained representations may be irreversibly modified.
In contrast, the focus of this work lies in training-time architectural adaptation. H-LoRA prevents destructive forgetting by isolating domain-specific updates within dedicated parameter subspaces, thereby preserving the pretrained representational geometry. Reflection improves reasoning behavior post hoc, whereas architectural isolation governs whether foundational knowledge is structurally preserved during adaptation.
These two directions are therefore complementary rather than competing: architectural preservation ensures recoverability of general knowledge, while inference-time reflection may further enhance reasoning quality on top of a structurally preserved backbone.
Broader Implications: This framework extends beyond LoRA-based methods to inform the design of any parameter-efficient adaptation technique [15,18]. Methods that preserve foundational representational spaces while enabling targeted capability enhancement will likely achieve superior controllability and knowledge retention compared to approaches that modify pre-trained parameters directly [15,18].
5.5 Edge AI Paradigm: From Cloud-Centric to Edge-Native Design
The results of this study highlight the importance of designing adaptation mechanisms that explicitly account for the constraints of edge deployment. Unlike cloud-centric fine-tuning assumptions, edge environments require frequent updates, limited storage, and reliable fallback mechanisms.
By decoupling domain-specific adaptation from the base model, H-LoRA enables lightweight specialization while preserving a shared pretrained backbone. This design facilitates practical multi-domain deployment on resource-constrained devices, where maintaining multiple full models would be infeasible.
Moreover, the ability to activate or deactivate adapters without modifying the base model provides a robust mechanism for capability recovery. Such architectural separation is particularly valuable in autonomous or safety-critical edge scenarios, where preserving general-purpose functionality is essential.
While this study focuses on compact models under edge constraints, the findings provide broader insight into the nature of neural adaptation. The distinction between destructive and controllable forgetting suggests that adaptation mechanisms should be evaluated not only by downstream task performance but also by their structural impact on pretrained representations.
This architectural perspective may inform research in continual learning and adaptive systems, where preserving representational geometry can be more valuable than minimizing short-term performance degradation. The results highlight that parameter isolation, rather than aggressive compression, may serve as a foundational principle for controllable model specialization.
More broadly, the findings encourage a shift from purely performance-centric evaluation toward structural analysis of knowledge preservation during adaptation.
5.7 Limitations and Future Research Directions
This study focuses on controllable adaptation in compact models under edge deployment constraints. Our empirical validation is therefore limited to models of at most one billion parameters, which represent the practical operating regime for memory- and bandwidth-constrained edge devices.
Importantly, the observed
While preliminary cross-architecture validation on Qwen-0.5B suggests generality within the compact regime, extending the analysis to larger-scale models would provide valuable insight into potential scaling transitions in rank–retention dynamics and remains an important direction for future research.
In addition, this work isolates the effect of absolute rank magnitude under standard LoRA to study knowledge retention mechanisms. Integrating high-rank adaptation with quantization-aware techniques such as QLoRA represents a promising future direction to jointly optimize structural knowledge preservation and memory efficiency in practical edge deployment.
Finally, this work intentionally addresses system- and application-level adaptation mechanisms. We do not model physical-layer communication effects or long-term deployment dynamics, as our goal is to study controllable specialization and deployment efficiency rather than wireless transmission or security optimization.
This work introduced H-LoRA, a high-rank adaptation framework designed to balance domain specialization and knowledge retention for compact models under edge deployment constraints. By challenging the conventional low-rank assumption in parameter-efficient fine-tuning, we demonstrated that strategically increasing rank enables SFT-level specialization while preserving foundational knowledge structures.
We established a clear distinction between destructive overwrite (SFT) and controllable additive adaptation (H-LoRA). Empirical results show that H-LoRA maintains 90% topic coherence—an 89 percentage point improvement over SFT—while matching SFT’s 99.81% precision with
The proposed
Acknowledgement: Not applicable.
Funding Statement: The authors received no specific funding for this study.
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Darren Chai Xin Lun; Methodology, Darren Chai Xin Lun; Software, Darren Chai Xin Lun; Validation, Darren Chai Xin Lun; Formal analysis, Darren Chai Xin Lun, Lim Tong Ming; Investigation, Darren Chai Xin Lun; Data curation, Darren Chai Xin Lun; Writing—original draft preparation, Darren Chai Xin Lun; Writing—review and editing, Lim Tong Ming; Visualization, Darren Chai Xin Lun; Supervision, Lim Tong Ming; Project administration, Lim Tong Ming. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The code and training scripts that support the findings of this study are openly available in GitHub at https://github.com/darrencxl0301/Lora-SLM. The Mathematics domain dataset is openly available at Hugging Face (GSM8K) at https://huggingface.co/datasets/openai/gsm8k. The Medical domain dataset is openly available at Hugging Face (MedQA-USMLE-4-options) at https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options. The Human Resources domain dataset was synthetically generated by the authors for this study and is available from the corresponding author upon reasonable request.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Deng L, Li G, Li S, Liu L, Yu FR. Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc IEEE. 2020;108(4):485–532. doi:10.1109/JPROC.2020.2976475. [Google Scholar] [CrossRef]
2. Li E, Zeng L, Zhou Z, Chen X. Edge AI: on-demand accelerating deep neural network inference via edge computing. IEEE Trans Wirel Commun. 2018;19(1):447–57. doi:10.1109/TWC.2019.2946140. [Google Scholar] [CrossRef]
3. Shi W, Cao J, Zhang Q, Li Y, Xu L. Edge computing: vision and challenges. IEEE Internet Things J. 2016;3(5):637–46. doi:10.1109/JIOT.2016.2579198. [Google Scholar] [CrossRef]
4. Wang X, Han Y, Leung VC, Niyato D, Yan X, Chen X. Convergence of edge computing and deep learning: a comprehensive survey. IEEE Commun Surv Tutor. 2020;22(2):869–904. doi:10.1109/COMST.2020.2970550. [Google Scholar] [CrossRef]
5. Zhou Z, Chen X, Li E, Zeng L, Luo K, Zhang J. Edge intelligence: paving the last mile of artificial intelligence with edge computing. Proc IEEE. 2019;107(8):1738–62. doi:10.1109/JPROC.2019.2918951. [Google Scholar] [CrossRef]
6. French RM. Catastrophic forgetting in connectionist networks. Trends Cogn Sci. 1999;3(4):128–35. doi:10.1016/S1364-6613(99)01294-2. [Google Scholar] [PubMed] [CrossRef]
7. McCloskey M, Cohen NJ. Catastrophic interference in connectionist networks: the sequential learning problem. Psychol Learn Motiv. 1989;24:109–65. doi:10.1016/S0079-7421(08)60536-8. [Google Scholar] [CrossRef]
8. Robins A. Catastrophic forgetting, rehearsal and pseudorehearsal. Conn Sci. 1995;7(2):123–46. doi:10.1080/09540099550039318. [Google Scholar] [CrossRef]
9. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. Low-rank adaptation of large language models. arXiv:2106.09685. 2021. [Google Scholar]
10. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798–828. doi:10.1109/TPAMI.2013.50. [Google Scholar] [PubMed] [CrossRef]
11. Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of representations for domain adaptation. In: NIPS’06: Proceedings of the 20th International Conference on Neural Information Processing Systems; 2006 Dec 4–7; Vancouver, BC, Canada. Cambridge, MA, USA: MIT Press; 2006. p. 137–44. doi:10.5555/2976456.2976474. [Google Scholar] [CrossRef]
12. Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010;22(10):1345–59. doi:10.1109/TKDE.2009.191. [Google Scholar] [CrossRef]
13. Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning (ICML). London, UK: PMLR; 2019. p. 2790–9. [Google Scholar]
14. Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. arXiv:2101.00190. 2021. doi:10.48550/arXiv.2101.00190. [Google Scholar] [CrossRef]
15. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. AI Open. 2022;3(120):40–53. doi:10.1016/j.aiopen.2022.10.001. [Google Scholar] [CrossRef]
16. Ben Zaken E, Goldberg Y, Ravfogel S. BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). Stroudsburg, PA, USA: ACL; 2022. p. 1–9. doi:10.18653/v1/2022.acl-short.1. [Google Scholar] [CrossRef]
17. Lialin V, Deshpande V, Yao X, Rumshisky A. Scaling down to scale up: a guide to parameter-efficient fine-tuning. arXiv:2303.15647. 2023. doi:10.48550/arXiv.2303.15647. [Google Scholar] [CrossRef]
18. Ding N, Qin Y, Yang G, Wei F, Yang Z, Su Y, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell. 2023;5(3):220–35. doi:10.1038/s42256-023-00626-4. [Google Scholar] [CrossRef]
19. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv. 2023;55(9):1–35. doi:10.1145/3560815. [Google Scholar] [CrossRef]
20. Zhang R, Zhang A, Dettmers T, Zettlemoyer L. AdaLoRA: adaptive budget allocation for parameter-efficient fine-tuning. arXiv:2303.10512. 2023. [Google Scholar]
21. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: efficient finetuning of quantized LLMs. arXiv:2305.14314. 2023. doi:10.48550/arXiv.2305.14314. [Google Scholar] [CrossRef]
22. Liu S, Han C, Zhao Z, Zhang Z, Zhou H. DoRA: weight-decomposed low-rank adaptation. arXiv:2402.09353. 2024. doi:10.48550/arXiv.2402.09353. [Google Scholar] [CrossRef]
23. Kopiczko D, Blankevoort T, Welling M. VeRA: vector-based random matrix adaptation. arXiv:2310.11454. 2024. [Google Scholar]
24. Hayou S, Ghosh N, Yu B. LoRA+: efficient low rank adaptation of large models. In: Proceedings of the 41st International Conference on Machine Learning (ICML). London, UK: PMLR; 2024. p. 17783–806. [Google Scholar]
25. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA. 2017;114(13):3521–6. doi:10.1073/pnas.1611835114. [Google Scholar] [PubMed] [CrossRef]
26. Luo Y, Yang Z, Meng F, Li Y, Zhou J, Zhang Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv:2308.08747. 2023. doi:10.48550/arXiv.2308.08747. [Google Scholar] [CrossRef]
27. Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113(2):54–71. doi:10.1016/j.neunet.2019.01.012. [Google Scholar] [PubMed] [CrossRef]
28. Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv:1312.6211. 2013. doi:10.48550/arXiv.1312.6211. [Google Scholar] [CrossRef]
29. Ramasesh VV, Lewkowycz A, Dyer E. Effect of scale on catastrophic forgetting in neural networks. 2022 [cited 2026 Jan 23]. Available from: https://openreview.net/forum?id=GhVS8_yPeEa. [Google Scholar]
30. Xu J, Chen Z, Quach TT, Yang K, Chiang P. EdgeBERT: optimizing on-device deep NLP inference. arXiv:1907.00019. 2019. doi:10.48550/arXiv.1907.00019. [Google Scholar] [CrossRef]
31. Yu T, Kumar S, Gupta A, Hausman K, Levine S, Finn C. Gradient surgery for multi-task learning. arXiv:2001.06782. 2020. [Google Scholar]
32. He J, Zhou C, Ma X, Berg-Kirkpatrick T, Neubig G. Towards a unified view of parameter-efficient transfer learning. arXiv:2110.04366. 2022. [Google Scholar]
33. Pfeiffer J, Vulić I, Gurevych I, Ruder S. MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: ACL; 2020. p. 7654–73. doi:10.18653/v1/2020.emnlp-main.617. [Google Scholar] [CrossRef]
34. Kalajdzievski D. A rank stabilization scaling factor for fine-tuning with LoRA. arXiv:2312.03732. 2023. doi:10.48550/arXiv.2312.03732. [Google Scholar] [CrossRef]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools