Open Access
ARTICLE
LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification
1 School of Computer Science and Engineering, Central South University, Changsha, China
2 EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia
* Corresponding Author: Zuping Zhang. Email:
Computers, Materials & Continua 2026, 88(1), 10 https://doi.org/10.32604/cmc.2026.077556
Received 11 December 2025; Accepted 13 February 2026; Issue published 08 May 2026
Abstract
Large Vision-Language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP) have transformed open world image recognition. Nevertheless, few-shot classification, particularly in the extremely low-shot regime, requires not only high accuracy but also reliably calibrated uncertainty for decisions with high confidence. Existing training-free CLIP adapters are primarily designed to increase accuracy and efficiency; integrate the zero-shot text logits with the few-shot feature caches, but not definitely model predictive uncertainty and therefore often exhibit considerable miscalibration and weak selective performance. Bayesian adapters move in the direction of probabilistic modeling by placing priors over adapter parameters and employing task-specific variational training; however, this requires gradient-based optimization for every new task, increases computational costs, and becomes fragile when only one or two labeled examples per class are available. Starting from this observation, we introduce a training-free posterior-predictive Likelihood Ratio Adapter(LiRA-CLIP) for few-shot CLIP classification, which directly addresses probabilistic reliability under strict low-shot and deployment constraints. LiRA-CLIP extends the frozen CLIP head by a text-conditioned generative model in feature space that produces heavy-tailed posterior-predictive likelihood ratios, fused with the CLIP logits via a small, reliability-driven calibration layer. This layer is optimized in order to minimize the negative log-likelihood under an explicit accuracy side constraint, which leads to calibrated probabilities and dependable selective decisions without any gradient-based task-specific training. Extensive experiments show that LiRA-CLIP matches or slightly surpasses strong CLIP adapters in top-1 accuracy, while reducing calibration error by roughly 40%–50% and significantly increasing 95% and 99% reliable coverage in the low-shot regime, and thus establishes a new state of the art with respect to probabilistic reliability for training-free few-shot CLIP models.Keywords
Large pre-trained Vision–Language Models (VLMs) such as CLIP [1] have provided transferable representations and a prompt-based zero-shot encoder. Web-scale variants [2] trained with noisy text supervision increase transfer under distribution shift. Adaptation have been applied in few-shot image classification to improve over zero-shot prompting, but reported gains can strongly depend on task-specific model selection and may collapse under distribution shift in low-data settings [3]. Most adaptation methods are gradient-based, including prompt-learning methods that optimize continuous context tokens while keeping the CLIP backbone frozen [4,5], and lightweight tuning methods that optimize a minimal set of task parameters on top of frozen CLIP encoders [6]. In addition, feature-adapter approaches introduce bottleneck modules downstream of the encoders, where these are the only parameters updated during training to ensure that the CLIP backbone remains frozen [7]. To bridge domain shifts and exploit limited labeled data, few-shot adapters refine or extend CLIP with limited supervision [8,9]. Training-free caching approaches such as Tip-Adapter [10] combine few-shot visual features with text prompts in order to obtain robust performance gains. Subsequent nonparametric and prototype-based adapters show that cache-like adaptation can rival or surpass traditional fine-tuning under strict time and compute budgets [11,12], and kernel-based analyses provide theoretical grounding and starting points for further improvements [13]. Recent methods have explored multimodal fusion, attention mechanisms, and prototype mechanisms in order to improve generalization and robustness [14,15].
Despite these advances, there remains, for current few-shot CLIP adapters, a gap between accuracy and probabilistic reliability, in particular in the extreme low-shot regime. Training-free cache and prototype adapters [8,13] are optimized for discriminative performance. They reweight zero-shot text logits or fuse them with support-set similarities of feature representations, but they treat these scores as deterministic functionals in the embedding space and do not model predictive uncertainty in a probabilistic manner. As a consequence, they often exhibit substantial miscalibration and weak performance under selective classification metrics, even when their top-1 accuracy is high. Few-shot OOD detectors generally target detection metrics instead of calibrated in-domain confidence [16]. Although meta-learned online adapters [9] indeed avoid task-specific fine-tuning, they still rely on extensive offline training and optimized for discriminative accuracy rather than for calibrated uncertainty.
By contrast, Bayesian adapters such as BayesAdapter [17] do introduce probabilistic structure; however, they place priors over adapter parameters and require gradient-based, task-specific training by means of variational inference. This improves the quality of uncertainty estimation in moderate data regimes, but increases adaptation time and, according to reports, shows particular brittleness or instability in the extreme low-shot regime, especially when only one or two labeled examples per class are available. To the best of our knowledge, there currently exists no training-free CLIP adapter that, under a strict few-shot protocol with only a handful of labeled examples per class, simultaneously performs adaptation in closed form without task-specific gradient-based optimization, explicitly targets calibrated probabilities and high-confidence selective metrics, and operates under strict latency and memory constraints that are suitable for deployment. Existing training-free adapters prioritize discriminative accuracy without a well-founded probabilistic uncertainty model, whereas Bayesian adapters trade training-free adaptation for task-specific variational optimization.
We address this gap by proposing LiRA-CLIP, a training-free posterior-predictive Likelihood Ratio Adapter for few-shot CLIP classification. LiRA-CLIP extends the frozen CLIP classifier by a text-conditioned Bayesian generative model over background-whitened image features, whose posterior-predictive likelihood ratios with respect to a pooled background define a single heavy-tailed generative evidence stream. This generative stream is fused with the CLIP logits through a unified, reliability-driven mechanism a margin-based confidence gate and a lightweight calibration layer which together act as a two-stream probabilistic adapter and, in a fully training-free, deterministic setting, produce calibrated probabilities and reliable high-confidence selective decisions. Across six standard benchmarks for few-shot adaptation and two CLIP backbones. Our training-free, text-conditioned Bayesian decision rule consistently improves calibration metrics (ECE and AECE) and high-confidence selective coverage at 95% and 99% target accuracy in the low-shot regime relative to plain CLIP, fine-tuned, and trained Bayesian adapters. As the budget of labeled data grows, LiRA-CLIP remains competitive in top-1 accuracy on all benchmarks while preserving its reliability advantages. These gains persist across a broad range of recent state-of-the-art adapters and support posterior-predictive likelihood-ratio fusion as a well-founded and practical path toward trustworthy few-shot CLIP adaptation.
Our contributions can be summarized as follows:
• We introduce LiRA-CLIP a training-free CLIP adapter that extends CLIP by a text-conditioned, posterior-predictive generative model over whitened image features. Producing a reliable heavy-tailed Student-t likelihood–ratio (t-PLLR) stream specifically in the extreme low-shot regime and can be fused with CLIP without any gradient-based, task-specific optimization.
• We define a few-shot CLIP adaptation as a single reliability-driven calibration problem in a compact probabilistic adapter, where unifying both stream-specific temperatures and a global generative fusion coefficient. Leading to calibrated probabilities and reliable selective decisions at 95% and 99%, while retaining competitive top-1 accuracy.
• Through extensive experiments on six few-shot benchmarks with two CLIP backbones, Consistently LiRA-CLIP improves probabilistic reliability with clearly reduced calibration errors (ECE and AECE) and higher reliable coverage at 95% and 99% in the low-shot regime. It matches or closely trails the best existing adapters in top-1 accuracy. Additionally, we conduct targeted ablation studies on prior hyperparameters and a lightly fine-tuned variant (LiRA-CLIP-F) showing that these reliability gains are robust and that the fully training-free formulation is particularly advantageous under extreme data scarcity.
Few-shot adaptation to new classes can be commonly categorized into three groups: prompt learning, which optimizes continuous tokens while the CLIP encoders remain frozen [4,5], adapters in the embedding space, connect zero-shot and few-shot evidence with small residual modules [3,6,7], and training-free and key-value cache models [10,13]. Although gradient-based prompt learning has demonstrated strong gains [5], its requirement constrains the use of prompt learning in scenarios in which CLIP is available as a frozen, purely forward-going (black-box) feature extractor. Adapter-based strategies instead operate in the CLIP embedding space using either fine-tuning a lightweight linear head or a shallow MLP on frozen features [3,17,18], or by relying on training-free key-value caches that are initialized from the support set [8,10]. LiRA-CLIP belongs to the family of adapter-based procedures, but is fully training-free and designed for a protocol with inaccessible weights and purely forward inference. Where both vision and text encoders of CLIP remain frozen, no gradients are computed, and the adaptation proceeds entirely via a posterior-predictive decision rule and reliability-driven calibration on frozen features.
Cache-based CLIP adaptation combines support-set similarities with zero-shot text logits, as in Tip-Adapter and its variants [8,10,13]. APE [8] refines such cache priors by analyzing inter-class discrepancies and leveraging the three way interaction between image, cache, and text. It offers both a training-free variant (APE) and a lightweight trained variant (APE-T) for high accuracy with few parameters. ProKeR [13] formalizes Tip-Adapter as a Nadaraya-Watson local estimator and highlights the benefit of global information via a proximal RKHS regularizer that is solved in closed form, leads to significant improvements in performance. These cache-based methods; however, treat cache scores as deterministic similarity functionals in feature space, and do not directly target calibrated uncertainty or selective decisions in the extreme low-data regime. Building on the low-overhead cache logic, LiRA-CLIP replaces deterministic cache weightings with text-anchored Bayesian posterior-predictive scoring in a background-whitened CLIP space. And uses a pooled background Student-
2.3 Bayesian Uncertainty and Calibration in CLIP Adapters
Recent work investigates uncertainty and calibration in VLMs adaptation methods [19–21]. BayesAdapter [17] in particular shows that a strong adapter can be interpreted as a maximum a posteriori solution in a probabilistic framework, and that the transition from a point estimate to a Bayesian posterior over adapter parameters improves calibration and selective classification. These Bayesian approaches for few-shot learning employ priors to adjust parameters and refine them through task-specific training. Yet, they have some limitations, specifically in the extreme low shot regime where their advantages diminish in the 1-shot setting or drop, and also demand considerable optimization costs.
LiRA-CLIP is complementary to this Bayesian parameter-space category but instead of placing priors over adapter weights and updating them through task-specific training, it performs training-free Bayesian modeling directly in feature space. Using a text-conditioned generative model over background-whitened CLIP representations to provide a posterior-predictive likelihood-ratio score with respect to a pooled background. Then, fused it with CLIP logits via a small, accuracy-protected calibration layer. Leading to calibrated probabilities and reliable high-confidence selective decisions in a fully training-free, posterior-predictive setting that, to our knowledge, is not obviously addressed by existing CLIP adaptation methods.
Our study focuses on the gap between accuracy and probabilistic reliability in the extreme low-shot regime for few-shot CLIP classification, with frozen encoders and deployment constraints. Targeted calibrated point probabilities and high-confidence, accuracy-constrained selective classification using a training-free adapter that extends the frozen CLIP head without gradient-based, task-specific training.
We have studied training-free few-shot CLIP classification with frozen encoders. LiRA-CLIP performs closed-form posterior-predictive updates in feature space and applies a fixed two-stream fusion rule, producing class probabilities over fused logits, Algorithm 1, Eqs. (20) and (21). We selected the fusion parameters by reliability-driven global calibration on an auxiliary task pool. We reused them unchanged across tasks, Section 3.9, Eq. (19). We evaluate point-probability reliability using Expected Calibration Error (ECE) and Adaptive ECE (AECE) and and assess selective classification by measuring coverage under accuracy constraints at 95% (Sel@95) and 99% (Sel@99) target accuracy as defined in Section 4, Tables 1 and 2. Some of the baseline methods are restricted to CLIP adapters and evaluated under the same strict few-shot protocol with frozen CLIP encoders and logits-to-probabilities outputs, making them directly comparable, Section 4. Other uncertainty quantification approaches target different objectives, conformal prediction [22] outputs prediction sets with coverage guarantees rather than point-probability calibration. Related probabilistic extensions of frozen VLMs also study uncertainty under different downstream tasks or uncertainty representations including generalized few-shot semantic segmentation [23] or post-hoc probabilistic embeddings via GPLVM [24]. These methods are complementary, not directly comparable to our point-probability calibration (ECE) and selective classification (Sel@99) results under the evaluation protocol in Section 4.

2.4 LiRA-CLIP’s Key Distinction from Prior Work
We distinguished LiRA-CLIP from prior probabilistic VLM adaptation not by the presence of uncertainty modeling but by where uncertainty is represented and how it is computed at deployment. BayesAdapter [17] models uncertainty in parameter space employing priors over adapter parameters and performing Bayesian posterior inference via task-time optimization such variational updates, instead to employ closed-form prediction. ProbVLM [20] instead learns a probabilistic embedding adapter, trained offline via gradient-based optimization to produce output distributions over frozen VLM embeddings. By contrast, LiRA-CLIP is a training free method, performs closed-form posterior-predictive inference directly in frozen CLIP feature space at task time, with no gradient-based updates. Specifically, when given only support set sufficient statistics in a background-whitened representation Eq. (2), a text-conditioned bind NIG prior, Eqs. (4) and (5) leading to a Student-
In this section, we present LiRA-CLIP, a training-free posterior-predictive likelihood–ratio adapter for CLIP. LiRA-CLIP couples a text-conditioned Student-

Figure 1: Overview of LiRA-CLIP framework, a training-free posterior-predictive likelihood-ratio adapter for few-shot CLIP classification. Fig. 1a, a frozen CLIP image encoder
We consider a N-way few-shot classification task with a class set
where
Whitening is a technique in signal and image processing, used to transform correlated background or clutter into approximately white noise so as to enhance target detectability in challenging environments such as in SAR speckle whitening [25]. Related ideas appear in deep networks through switchable whitening and normalization layers, which decorrelate features and improve optimization stability across tasks [26]. In LiRA-CLIP we adopt a lightweight variant at the representation level. We standardize all features using the pooled support statistics
Eq. (2) performs pooled-statistics, per-dimension standardization using
To express the text-conditioned generative model in the same background-whitened coordinates as
We have used
3.3 Text Conditioned Diagonal NIG Class Priors
The Normal Inverse Gamma (NIG) prior is a standard conjugate prior for the mean and variance of a normal likelihood and underpins a wide range of Bayesian regression and hierarchical models [27,28]. Integrating out the latent mean and variance yields Student-
with positive hyperparameters
Using the same class embedding that defines CLIP’s frozen zero-shot linear head, Eq. (12). Since CLIP defines the image feature
3.4 Student-
Log-likelihood ratios are central to statistical decision theory, hypothesis testing, and modern likelihood-based machine learning [30,31]. LiRA-CLIP leverages this principle in a posterior-predictive setting. For each class, we form the log-ratio between its Student-
where
We denote the resulting generative log-likelihood by
3.5 Background Predictive Model
We construct a background Student-
corresponding to the whitened mean of the class text embeddings, and take a unit background variance
with degrees of freedom
To stabilize the scale of generative scores across images, we apply per-sample
These standardized posterior-predictive LLRs constitute the generative stream used by the adapter.
3.6 CLIP Discriminative Stream
The CLIP zero-shot classifier provides the discriminative stream via a linear head whose weights are given by the text embeddings
where
3.7 Margin Based Confidence Gate
To modulate the contribution of the generative stream, we construct a scalar confidence gate from the CLIP logits. Let
The margin is large and positive when CLIP is confident and small when it is ambiguous.
We map this margin to a raw gate via a squashing nonlinearity:
where
For each sample, LiRA-CLIP fuses the CLIP discriminative stream with the standardized generative LLR stream. Let
where
and then we calculate class probabilities with:
When
3.9 Reliability-Driven Global Calibration
We define LiRA-CLIP as the solution of a single reliability-driven calibration problem over
Auxiliary Calibration Pool Composition
We built the auxiliary pool from five datasets that span different visual domains to reduce the risk of overfitting to specific task characteristics. These domains include Caltech101 (generic objects), DTD (textures), FGVC-Aircraft (fine-grained), OxfordPets (fine-grained), and UCF101 (actions). We performed a fixed, class-disjoint split into calibration classes and evaluation classes (80/20) for each dataset. Sampling auxiliary few-shot tasks exclusively from the calibration classes.
where
Formally, we select
where
3.10 Test-Time Prediction with Frozen Fusion Parameters
Algorithm 1 describes how LiRA-CLIP adapts itself to a new few-shot task exclusively by means of closed-form computations and a globally calibrated, frozen fusion rule. Given the cached streams and the gate, the final phase applies the globally calibrated fusion parameters
and then, converts these fused logits by a softmax into class probabilities
where the vector

In line with earlier work on CLIP adapters [6,7,10,17], we evaluate on six established vision benchmarks: Oxford Pets [32], Caltech101 [33], FGVC-Aircraft [34], DTD [35], EuroSAT [36], and UCF101 [37]. Taken together, these datasets cover generic object recognition, fine-granular categories, textures, and remote sensing. We adopt the strict few-shot adaptation protocol from [3,6,17], for each task and each class we draw uniformly K labeled support examples from the training split, with
We compare LiRA-CLIP with eight recent CLIP adapters: Linear Probing (LP) [1], TIP-Adapter and TIP-Adapter-f [10], TaskRes [6], CrossModal [38], BayesAdapter [17], LP++ [18], and CLAP [3]. It is noteworthy that all baseline results reported in this work are taken from [17].
We compute CLIP features with two common visual encoders, ResNet-50 [39] and ViT-B/16 [40], unless indicated otherwise, ablation studies are carried out on ResNet-50. During feature extraction we apply random zoom, crop, and horizontal flip augmentations, following [3,6,17], and we reuse the same text prompt templates. For the fine-tuning ablation LiRA-CLIP-F, in which only the fusion parameters are updated while CLIP and the generative head remain frozen, we adopt the training configuration from [17]: 300 epochs, batch size 256, as well as SGD with momentum 0.9 and a learning rate of 0.1. All experiments for LiRA-CLIP and LiRA-CLIP-F are averaged over three random seeds. We reported the mean performance in the main text, while standard errors and dataset-specific results are provided in the appendices.
4.2 Analysis of the Experimental Results
We first investigate how well confidence values reflect actual correctness. In accordance with common practice, we report the Expected Calibration Error (ECE) [17,41] as well as its adaptive variant AECE, which reduces the bias that arises when standard ECE is influenced by bins with insufficient or zero sample size [17,42]. ECE partitions the predictions into B confidence bins and averages the absolute difference between empirical accuracy and mean confidence in each bin; we use
4.2.2 Overall Accuracy and Calibration
We have reported means over six datasets and six shot regimes (
4.2.3 Selective Classification at High Confidence
We evaluate selective classification under high-confidence conditions, a central requirement in safety-critical deployment scenarios [17,43]. Given a confidence threshold
4.3 Per-Dataset Selective Classification
In the low-shot regime (

Figure 2: 99%-reliable coverage (%) on ResNet-50 across low-shot regimes (
Our main method, LiRA-CLIP, is completely training-free. To assess whether the same posterior-predictive fusion architecture remains effective also under limited supervised fine-tuning, we additionally consider LiRA-CLIP-F, which fine-tunes exclusively the three scalar fusion parameters by gradient descent, while all CLIP and generative-model parameters remain frozen. Table 3 reports ablation results on EuroSAT. The training-free LiRA-CLIP already attains accuracy comparable to or slightly higher than recent adapters (64.9% vs. 64.7% for TaskRes and 64.6% for BayesAdapter) and at the same time reduces ECE from 4.8–10.6 to 2.9. The permitted mild fine-tuning of the fusion parameters (LiRA-CLIP-F) further increases accuracy to 69.1%, while maintaining competitive calibration, ECE 4.5%. This suggests that posterior-predictive likelihood-ratio fusion remains effective even when it is deployed as a small trainable adapter. In the extreme low-shot regime (1–2 shots), however, the labeled support set provides only very limited information to reliably re-estimate even a small number of fusion parameters. In this setting, task-specific fine-tuning of

Another ablation we conduct to assess the sensitivity of LiRA-CLIP to the normal-inverse-gamma prior hyperparameters

4.4.1 Computational Complexity and Runtime
To calculate Computational complexity let N be the number of classes (ways), K the number of shots per class, D dimension the CLIP feature, and
4.4.2 Paired Non-Parametric Significance Analysis
The main objectives of LiRA-CLIP is to improve probabilistic reliability under extreme low-shot and deployment constraints while preserving few-shot accuracy. We evaluated its statistical stability at the level of dataset
4.4.3 Implementation and Measured Cost
LiRA-CLIP performs no task-time optimization (training time is 0). Thus we reported in Table 5 its test-time inference cost as average latency per image (ms/img) includes adapter-only cost, which excludes CLIP image encoding and captures only the closed-form scoring and fusion stage. We also reported end-to-end latency, includes CLIP encoding plus evaluation stage. BayesAdapter reported the average adaptation time averaged over three independent seeds (in seconds). All LiRA-CLIP experiments were run on NVIDIA Tesla T4 GPU (16 GB VRAM), PyTorch 2.9.0, CUDA 12.6, and cuDNN 9.1.

4.4.4 LiRA-CLIP Seed-Sensitivity Analysis
We verified the sensitivity of LiRA-CLIP results to the choice of a three-seed protocol by conducting a robustness check on Caltech101 (ResNet50) using five random seeds. The five-seed estimates (mean

4.4.5 Sensitivity to the Accuracy Guard
The reliability-driven calibration in Eq. (19) includes an accuracy-guard constraint

LiRA-CLIP, a training-free text-conditioned, posterior-predictive likelihood-ratio adapter for few-shot CLIP classification. We introduced it in this article to improve probabilistic reliability in the extremely low-shot regime. LiRA-CLIP operations take place in a background-whitened CLIP feature space, and placing diagonal Normal Inverse Gamma priors over both the class-conditional distributions and a pooled background; leading to a Student-t posterior-predictive likelihood-ratio stream (t-PLLR) capturing heavy distributional outliers and data scarcity. We build a two-stream fusion mechanism to combine the generative t-PLLR scores with zero-shot CLIP logits; using a lightweight global calibration layer without gradient based task specific tuning or accessing to the CLIP weights. For a new few-shot task, LiRA-CLIP performs training-free adaptation via closed-form posterior-predictive updates and evaluation of a frozen fusion rule. Experimental results on six standard benchmarks confirm the robust adaptability and reliability of LiRA-CLIP; notably, ablation studies validate these reliability gains stem from the proposed posterior-predictive likelihood-ratio architecture rather than from brittle hyperparameter tuning. Establishing LiRA-CLIP as a simple efficient route to training-free, reliably calibrated few-shot CLIP adaptation. LiRA-CLIP by design trades a small amount of high-shot accuracy for substantially improved low-shot reliability, making it preferable to choose when labels are scarce and complementary to fully fine-tune adapters in data-rich regimes. For future work, we will explore lightweight task-aware refinements and richer generative components to narrow the remaining high-shot gap, at the same time preserving the method’s training-free.
Acknowledgement: The authors wish to express their gratitude to Prince Sultan University for their support.
Funding Statement: This research was funded by the National Nature Science of China, grant numbers U23A20321 and 62272490. Also, the authors would like to thank Prince Sultan University for paying the APC of this article.
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Mustafa Qaid Khamisi, Zuping Zhang and Mohammed Al-Habib; methodology, Mustafa Qaid Khamisi; software, Mustafa Qaid Khamisi and Mohammed Al-Habib; validation, Mustafa Qaid Khamisi, Zuping Zhang and Mohammed Al-Habib; formal analysis, Mustafa Qaid Khamisi and Zuping Zhang; investigation, Mustafa Qaid Khamisi, Zuping Zhang and Mohammed Al-Habib; resources, Mustafa Qaid Khamisi and Zuping Zhang; data curation, Mustafa Qaid Khamisi and Zuping Zhang; writing—original draft preparation, Mustafa Qaid Khamisi; writing—review and editing, Mustafa Qaid Khamisi, Zuping Zhang, Mohammed Al-Habib, Muhammad Asim and Sajid Shah; visualization, Mustafa Qaid Khamisi and Mohammed Al-Habib; supervision, Zuping Zhang; project administration, Mustafa Qaid Khamisi and Zuping Zhang; funding acquisition, Muhammad Asim and Sajid Shah. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The data that support the findings of this study are available from the Corresponding Author, Zuping Zhang, upon reasonable request.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CLIP | Contrastive Language–Image Pre-Training |
| LiRA | Likelihood Ratio Adapter |
| t-PLLR | Student-t Posterior-Predictive Likelihood Ratio |
| NLL | Negative Log-Likelihood |
| ECE | Expected Calibration Error |
| AECE | Adaptive Expected Calibration Error |
| NIG | Normal Inverse Gamma |
Appendix A Detailed Numerical Values for the Reported Metrics in Manuscript
Appendix A.1 Per-Dataset AECE (%, Lower Is Better) on ResNet50. Results Are Reported over Three Random Seeds:

Appendix A.2 Detailed ECE Results of LiRA-CLIP in Comparison with Baseline Methods:

Appendix A.3 Detailed Accuracy Results of LiRA-CLIP in Comparison with Baseline Methods:

Appendix A.4 99%-Reliable Prediction of LiRA-CLIP in Comparison with Baseline Methods:

Appendix B Detailed Numerical Values for Ablation Study
Appendix B.1 Ablation Results for LiRA-CLIP-F in Comparison with Baseline Methods:

Appendix C Statistical Significance and Robustness Analyses
Appendix C.1 Paired Non-Parametric Significance Test:

References
1. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. London, UK: PMLR; 2021. p. 8748–63. [Google Scholar]
2. Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. London, UK: PMLR; 2021. p. 4904–16. [Google Scholar]
3. Silva-Rodriguez J, Hajimiri S, Ben Ayed I, Dolz J. A closer look at the few-shot adaptation of large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 23681–90. [Google Scholar]
4. Zhou K, Yang J, Loy CC, Liu Z. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2022. p. 16816–25. [Google Scholar]
5. Zhu B, Niu Y, Han Y, Wu Y, Zhang H. Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 15659–69. [Google Scholar]
6. Yu T, Lu Z, Jin X, Chen Z, Wang X. Task residual for tuning vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2023. p. 10899–909. [Google Scholar]
7. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, et al. Clip-adapter: better vision-language models with feature adapters. Int J Comput Vis. 2024;132(2):581–95. [Google Scholar]
8. Zhu X, Zhang R, He B, Zhou A, Wang D, Zhao B, et al. Not all features matter: enhancing few-shot clip with adaptive prior refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 2605–15. [Google Scholar]
9. Song L, Xue R, Wang H, Sun H, Ge Y, Shan Y, et al. Meta-adapter: an online few-shot learner for vision-language model. Adv Neural Inf Process Syst. 2023;36:55361–74. doi:10.52202/075280-2416. [Google Scholar] [CrossRef]
10. Zhang R, Zhang W, Fang R, Gao P, Li K, Dai J, et al. Tip-adapter: training-free adaption of CLIP for few-shot classification. In: Computer vision—ECCV 2022. Cham, Switzerland: Springer Nature; 2022. p. 493–510. doi:10.1007/978-3-031-19833-5_29. [Google Scholar] [CrossRef]
11. Kato N, Nota Y, Aoki Y. Proto-adapter: efficient training-free CLIP-adapter for few-shot image classification. Sensors. 2024;24(11):3624. [Google Scholar] [PubMed]
12. Wang Z, Liang J, Sheng L, He R, Wang Z, Tan T. A hard-to-beat baseline for training-free CLIP-based adaptation. arXiv:2402.04087. 2024. [Google Scholar]
13. Bendou Y, Ouasfi A, Gripon V, Boukhayma A. ProKeR: a kernel perspective on few-shot adaptation of large vision-language models. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference. Piscataway, NJ, USA: IEEE; 2025. p. 25092–102. [Google Scholar]
14. Li D, Wang R. Text-guided dual feature enhancement: a training-free paradigm for few-shot learning with CLIP. In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT). Piscataway, NJ, USA: IEEE; 2025. p. 1937–40. doi:10.1109/ainit65432.2025.11035349. [Google Scholar] [CrossRef]
15. Guo Z, Zhang R, Qiu L, Ma X, Miao X, He X, et al. Calip: zero-shot enhancement of clip with parameter-free attention. In: Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press; 2023. p. 746–54. [Google Scholar]
16. Chen X, Li Y, Chen H. Dual-adapter: training-free dual adaptation for few-shot out-of-distribution detection. arXiv:2405.16146. 2024. [Google Scholar]
17. Morales-Álvarez P, Christodoulidis S, Vakalopoulou M, Piantanida P, Dolz J. BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation. arXiv:2412.09718. 2024. [Google Scholar]
18. Huang Y, Shakeri F, Dolz J, Boudiaf M, Bahig H, Ben Ayed I. Lp++: a surprisingly strong linear probe for few-shot clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 23773–82. [Google Scholar]
19. Yoon HS, Yoon E, Tee JTJ, Hasegawa-Johnson M, Li Y, Yoo CD. C-TPT: calibrated test-time prompt tuning for vision-language models via text feature dispersion. arXiv:2403.14119. 2024. [Google Scholar]
20. Upadhyay U, Karthik S, Mancini M, Akata Z. Probvlm: probabilistic adapter for frozen vison-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 1899–910. [Google Scholar]
21. Oh C, Lim H, Kim M, Han D, Yun S, Choo J, et al. Towards calibrated robust fine-tuning of vision-language models. Adv Neural Inf Process Syst. 2024;37:12677–707. doi:10.52202/079017-0403. [Google Scholar] [CrossRef]
22. Silva-Rodríguez J, Ben Ayed I, Dolz J. Conformal prediction for zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2025. p. 19931–41. [Google Scholar]
23. Liu J, Shen J, Zhou P, Sonke JJ, Gavves E. Probabilistic prototype calibration of vision-language models for generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2025. p. 21155–65. [Google Scholar]
24. Venkataramanan A, Bodesheim P, Denzler J. Probabilistic embeddings for frozen vision-language models: uncertainty quantification with gaussian process latent variable models. In: Chiappa S, Magliacane S, editors.Proceedings of the Forty-First Conference on Uncertainty in Artificial Intelligence. Vol. 286 of Proceedings of Machine Learning Research. London, UK: PMLR; 2025. p. 4309–28. [Google Scholar]
25. Alparone L, Arienzo A, Lombardini F. Improved coherent processing of synthetic aperture radar data through speckle whitening of single-look complex images. Remote Sens. 2024;16(16):2955. doi:10.3390/rs16162955. [Google Scholar] [CrossRef]
26. Pan X, Zhan X, Shi J, Tang X, Luo P. Switchable whitening for deep representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2019. p. 1863–71. [Google Scholar]
27. Cai M, van Buuren S, Vink G. Joint distribution properties of fully conditional specification under the normal linear model with normal inverse-gamma priors. Sci Rep. 2023;13(1):644. doi:10.1038/s41598-023-27786-y. [Google Scholar] [PubMed] [CrossRef]
28. Griffin JE, Brown PJ. Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 2010;5(1):171–88. doi:10.1214/10-ba507. [Google Scholar] [CrossRef]
29. Geweke J. Bayesian treatment of the independent student-t linear model. J Appl Econom. 1993;8(S1):S19–40. doi:10.1002/jae.3950080504. [Google Scholar] [CrossRef]
30. Dunn R, Ramdas A, Balakrishnan S, Wasserman L. Gaussian universal likelihood ratio testing. Biometrika. 2023;110(2):319–37. doi:10.1093/biomet/asac064. [Google Scholar] [CrossRef]
31. Yodnual S, Chumnaul J. Signed log-likelihood ratio test for the scale parameter of Poisson Inverse Weibull distribution with the development of PIW4LIFETIME web application. PLoS One. 2025;20(8):e0329293. doi:10.1371/journal.pone.0329293. [Google Scholar] [PubMed] [CrossRef]
32. Parkhi OM, Vedaldi A, Zisserman A, Jawahar C. Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2012. p. 3498–505. [Google Scholar]
33. Fei-Fei L, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop. Piscataway, NJ, USA: IEEE; 2004. [Google Scholar]
34. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft. arXiv:1306.5151. 2013. [Google Scholar]
35. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2014. p. 3606–13. [Google Scholar]
36. Helber P, Bischke B, Dengel A, Borth D. Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens. 2019;12(7):2217–26. [Google Scholar]
37. Soomro K, Zamir AR, Shah M. Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. 2012. [Google Scholar]
38. Lin Z, Yu S, Kuang Z, Pathak D, Ramanan D. Multimodality helps unimodality: cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2023. p. 19325–37. [Google Scholar]
39. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2016. p. 770–8. [Google Scholar]
40. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv:2010.11929. 2021. [Google Scholar]
41. Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International Conference on Machine Learning. London, UK: PMLR; 2017. p. 1321–30. [Google Scholar]
42. Nixon J, Dusenberry MW, Zhang L, Jerfel G, Tran D. Measuring calibration in deep learning. arXiv:1904.01685. 2019. [Google Scholar]
43. Dadalto Câmara Gomes E, Romanelli M, Pichler G, Piantanida P. A data-driven measure of relative uncertainty for misclassification detection. In: Kim B, Yue Y, Chaudhuri S, Fragkiadaki K, Khan M, Sun Y, editors. International Conference on Learning Representations. Red Hook, NY, USA: Curran Associates, Inc.; 2024. p. 21826–48. [Google Scholar]
44. Geifman Y, El-Yaniv R. Selective classification for deep neural networks. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc.; 2017. p. 4885–94. [Google Scholar]
45. Wu YC, Lyu SH, Shang H, Wang X, Qian C. Confidence-aware contrastive learning for selective classification. In: Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J,editors. Proceedings of the 41st International Conference on Machine Learning. Vol. 235 of Proceedings of Machine Learning Research. London, UK: PMLR; 2024. p. 53706–29. [Google Scholar]
46. Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools