LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification

Mustafa Khamisi; Zuping Zhang; Mohammed Al-Habib; Muhammad Asim; Sajid Shah

doi:10.32604/cmc.2026.077556

icon Open Access

ARTICLE

LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification

Mustafa Qaid Khamisi¹, Zuping Zhang^1,*, Mohammed Al-Habib¹, Muhammad Asim², Sajid Shah²

1 School of Computer Science and Engineering, Central South University, Changsha, China
2 EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia

* Corresponding Author: Zuping Zhang. Email: email

Computers, Materials & Continua 2026, 88(1), 10 https://doi.org/10.32604/cmc.2026.077556

Received 11 December 2025; Accepted 13 February 2026; Issue published 08 May 2026

Abstract

Large Vision-Language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP) have transformed open world image recognition. Nevertheless, few-shot classification, particularly in the extremely low-shot regime, requires not only high accuracy but also reliably calibrated uncertainty for decisions with high confidence. Existing training-free CLIP adapters are primarily designed to increase accuracy and efficiency; integrate the zero-shot text logits with the few-shot feature caches, but not definitely model predictive uncertainty and therefore often exhibit considerable miscalibration and weak selective performance. Bayesian adapters move in the direction of probabilistic modeling by placing priors over adapter parameters and employing task-specific variational training; however, this requires gradient-based optimization for every new task, increases computational costs, and becomes fragile when only one or two labeled examples per class are available. Starting from this observation, we introduce a training-free posterior-predictive Likelihood Ratio Adapter(LiRA-CLIP) for few-shot CLIP classification, which directly addresses probabilistic reliability under strict low-shot and deployment constraints. LiRA-CLIP extends the frozen CLIP head by a text-conditioned generative model in feature space that produces heavy-tailed posterior-predictive likelihood ratios, fused with the CLIP logits via a small, reliability-driven calibration layer. This layer is optimized in order to minimize the negative log-likelihood under an explicit accuracy side constraint, which leads to calibrated probabilities and dependable selective decisions without any gradient-based task-specific training. Extensive experiments show that LiRA-CLIP matches or slightly surpasses strong CLIP adapters in top-1 accuracy, while reducing calibration error by roughly 40%–50% and significantly increasing 95% and 99% reliable coverage in the low-shot regime, and thus establishes a new state of the art with respect to probabilistic reliability for training-free few-shot CLIP models.

Keywords

Vision–language models; few-shot learning; CLIP; training-free; uncertainty calibration; selective classification; posterior predictive modeling

1 Introduction

Large pre-trained Vision–Language Models (VLMs) such as CLIP [1] have provided transferable representations and a prompt-based zero-shot encoder. Web-scale variants [2] trained with noisy text supervision increase transfer under distribution shift. Adaptation have been applied in few-shot image classification to improve over zero-shot prompting, but reported gains can strongly depend on task-specific model selection and may collapse under distribution shift in low-data settings [3]. Most adaptation methods are gradient-based, including prompt-learning methods that optimize continuous context tokens while keeping the CLIP backbone frozen [4,5], and lightweight tuning methods that optimize a minimal set of task parameters on top of frozen CLIP encoders [6]. In addition, feature-adapter approaches introduce bottleneck modules downstream of the encoders, where these are the only parameters updated during training to ensure that the CLIP backbone remains frozen [7]. To bridge domain shifts and exploit limited labeled data, few-shot adapters refine or extend CLIP with limited supervision [8,9]. Training-free caching approaches such as Tip-Adapter [10] combine few-shot visual features with text prompts in order to obtain robust performance gains. Subsequent nonparametric and prototype-based adapters show that cache-like adaptation can rival or surpass traditional fine-tuning under strict time and compute budgets [11,12], and kernel-based analyses provide theoretical grounding and starting points for further improvements [13]. Recent methods have explored multimodal fusion, attention mechanisms, and prototype mechanisms in order to improve generalization and robustness [14,15].

Despite these advances, there remains, for current few-shot CLIP adapters, a gap between accuracy and probabilistic reliability, in particular in the extreme low-shot regime. Training-free cache and prototype adapters [8,13] are optimized for discriminative performance. They reweight zero-shot text logits or fuse them with support-set similarities of feature representations, but they treat these scores as deterministic functionals in the embedding space and do not model predictive uncertainty in a probabilistic manner. As a consequence, they often exhibit substantial miscalibration and weak performance under selective classification metrics, even when their top-1 accuracy is high. Few-shot OOD detectors generally target detection metrics instead of calibrated in-domain confidence [16]. Although meta-learned online adapters [9] indeed avoid task-specific fine-tuning, they still rely on extensive offline training and optimized for discriminative accuracy rather than for calibrated uncertainty.

By contrast, Bayesian adapters such as BayesAdapter [17] do introduce probabilistic structure; however, they place priors over adapter parameters and require gradient-based, task-specific training by means of variational inference. This improves the quality of uncertainty estimation in moderate data regimes, but increases adaptation time and, according to reports, shows particular brittleness or instability in the extreme low-shot regime, especially when only one or two labeled examples per class are available. To the best of our knowledge, there currently exists no training-free CLIP adapter that, under a strict few-shot protocol with only a handful of labeled examples per class, simultaneously performs adaptation in closed form without task-specific gradient-based optimization, explicitly targets calibrated probabilities and high-confidence selective metrics, and operates under strict latency and memory constraints that are suitable for deployment. Existing training-free adapters prioritize discriminative accuracy without a well-founded probabilistic uncertainty model, whereas Bayesian adapters trade training-free adaptation for task-specific variational optimization.

We address this gap by proposing LiRA-CLIP, a training-free posterior-predictive Likelihood Ratio Adapter for few-shot CLIP classification. LiRA-CLIP extends the frozen CLIP classifier by a text-conditioned Bayesian generative model over background-whitened image features, whose posterior-predictive likelihood ratios with respect to a pooled background define a single heavy-tailed generative evidence stream. This generative stream is fused with the CLIP logits through a unified, reliability-driven mechanism a margin-based confidence gate and a lightweight calibration layer which together act as a two-stream probabilistic adapter and, in a fully training-free, deterministic setting, produce calibrated probabilities and reliable high-confidence selective decisions. Across six standard benchmarks for few-shot adaptation and two CLIP backbones. Our training-free, text-conditioned Bayesian decision rule consistently improves calibration metrics (ECE and AECE) and high-confidence selective coverage at 95% and 99% target accuracy in the low-shot regime relative to plain CLIP, fine-tuned, and trained Bayesian adapters. As the budget of labeled data grows, LiRA-CLIP remains competitive in top-1 accuracy on all benchmarks while preserving its reliability advantages. These gains persist across a broad range of recent state-of-the-art adapters and support posterior-predictive likelihood-ratio fusion as a well-founded and practical path toward trustworthy few-shot CLIP adaptation.

Our contributions can be summarized as follows:

• We introduce LiRA-CLIP a training-free CLIP adapter that extends CLIP by a text-conditioned, posterior-predictive generative model over whitened image features. Producing a reliable heavy-tailed Student-t likelihood–ratio (t-PLLR) stream specifically in the extreme low-shot regime and can be fused with CLIP without any gradient-based, task-specific optimization.

• We define a few-shot CLIP adaptation as a single reliability-driven calibration problem in a compact probabilistic adapter, where unifying both stream-specific temperatures and a global generative fusion coefficient. Leading to calibrated probabilities and reliable selective decisions at 95% and 99%, while retaining competitive top-1 accuracy.

• Through extensive experiments on six few-shot benchmarks with two CLIP backbones, Consistently LiRA-CLIP improves probabilistic reliability with clearly reduced calibration errors (ECE and AECE) and higher reliable coverage at 95% and 99% in the low-shot regime. It matches or closely trails the best existing adapters in top-1 accuracy. Additionally, we conduct targeted ablation studies on prior hyperparameters and a lightly fine-tuned variant (LiRA-CLIP-F) showing that these reliability gains are robust and that the fully training-free formulation is particularly advantageous under extreme data scarcity.

2 Related Work

2.1 Few-Shot CLIP Adaptation

Few-shot adaptation to new classes can be commonly categorized into three groups: prompt learning, which optimizes continuous tokens while the CLIP encoders remain frozen [4,5], adapters in the embedding space, connect zero-shot and few-shot evidence with small residual modules [3,6,7], and training-free and key-value cache models [10,13]. Although gradient-based prompt learning has demonstrated strong gains [5], its requirement constrains the use of prompt learning in scenarios in which CLIP is available as a frozen, purely forward-going (black-box) feature extractor. Adapter-based strategies instead operate in the CLIP embedding space using either fine-tuning a lightweight linear head or a shallow MLP on frozen features [3,17,18], or by relying on training-free key-value caches that are initialized from the support set [8,10]. LiRA-CLIP belongs to the family of adapter-based procedures, but is fully training-free and designed for a protocol with inaccessible weights and purely forward inference. Where both vision and text encoders of CLIP remain frozen, no gradients are computed, and the adaptation proceeds entirely via a posterior-predictive decision rule and reliability-driven calibration on frozen features.

2.2 Training-Free Caches

Cache-based CLIP adaptation combines support-set similarities with zero-shot text logits, as in Tip-Adapter and its variants [8,10,13]. APE [8] refines such cache priors by analyzing inter-class discrepancies and leveraging the three way interaction between image, cache, and text. It offers both a training-free variant (APE) and a lightweight trained variant (APE-T) for high accuracy with few parameters. ProKeR [13] formalizes Tip-Adapter as a Nadaraya-Watson local estimator and highlights the benefit of global information via a proximal RKHS regularizer that is solved in closed form, leads to significant improvements in performance. These cache-based methods; however, treat cache scores as deterministic similarity functionals in feature space, and do not directly target calibrated uncertainty or selective decisions in the extreme low-data regime. Building on the low-overhead cache logic, LiRA-CLIP replaces deterministic cache weightings with text-anchored Bayesian posterior-predictive scoring in a background-whitened CLIP space. And uses a pooled background Student-t posterior-predictive likelihood-ratio measure (t-PLLR) as a global generative reference without the need for any gradient-based optimization.

2.3 Bayesian Uncertainty and Calibration in CLIP Adapters

Recent work investigates uncertainty and calibration in VLMs adaptation methods [19–21]. BayesAdapter [17] in particular shows that a strong adapter can be interpreted as a maximum a posteriori solution in a probabilistic framework, and that the transition from a point estimate to a Bayesian posterior over adapter parameters improves calibration and selective classification. These Bayesian approaches for few-shot learning employ priors to adjust parameters and refine them through task-specific training. Yet, they have some limitations, specifically in the extreme low shot regime where their advantages diminish in the 1-shot setting or drop, and also demand considerable optimization costs.

LiRA-CLIP is complementary to this Bayesian parameter-space category but instead of placing priors over adapter weights and updating them through task-specific training, it performs training-free Bayesian modeling directly in feature space. Using a text-conditioned generative model over background-whitened CLIP representations to provide a posterior-predictive likelihood-ratio score with respect to a pooled background. Then, fused it with CLIP logits via a small, accuracy-protected calibration layer. Leading to calibrated probabilities and reliable high-confidence selective decisions in a fully training-free, posterior-predictive setting that, to our knowledge, is not obviously addressed by existing CLIP adaptation methods.

Our study focuses on the gap between accuracy and probabilistic reliability in the extreme low-shot regime for few-shot CLIP classification, with frozen encoders and deployment constraints. Targeted calibrated point probabilities and high-confidence, accuracy-constrained selective classification using a training-free adapter that extends the frozen CLIP head without gradient-based, task-specific training.

We have studied training-free few-shot CLIP classification with frozen encoders. LiRA-CLIP performs closed-form posterior-predictive updates in feature space and applies a fixed two-stream fusion rule, producing class probabilities over fused logits, Algorithm 1, Eqs. (20) and (21). We selected the fusion parameters by reliability-driven global calibration on an auxiliary task pool. We reused them unchanged across tasks, Section 3.9, Eq. (19). We evaluate point-probability reliability using Expected Calibration Error (ECE) and Adaptive ECE (AECE) and and assess selective classification by measuring coverage under accuracy constraints at 95% (Sel@95) and 99% (Sel@99) target accuracy as defined in Section 4, Tables 1 and 2. Some of the baseline methods are restricted to CLIP adapters and evaluated under the same strict few-shot protocol with frozen CLIP encoders and logits-to-probabilities outputs, making them directly comparable, Section 4. Other uncertainty quantification approaches target different objectives, conformal prediction [22] outputs prediction sets with coverage guarantees rather than point-probability calibration. Related probabilistic extensions of frozen VLMs also study uncertainty under different downstream tasks or uncertainty representations including generalized few-shot semantic segmentation [23] or post-hoc probabilistic embeddings via GPLVM [24]. These methods are complementary, not directly comparable to our point-probability calibration (ECE) and selective classification (Sel@99) results under the evaluation protocol in Section 4.

2.4 LiRA-CLIP’s Key Distinction from Prior Work

We distinguished LiRA-CLIP from prior probabilistic VLM adaptation not by the presence of uncertainty modeling but by where uncertainty is represented and how it is computed at deployment. BayesAdapter [17] models uncertainty in parameter space employing priors over adapter parameters and performing Bayesian posterior inference via task-time optimization such variational updates, instead to employ closed-form prediction. ProbVLM [20] instead learns a probabilistic embedding adapter, trained offline via gradient-based optimization to produce output distributions over frozen VLM embeddings. By contrast, LiRA-CLIP is a training free method, performs closed-form posterior-predictive inference directly in frozen CLIP feature space at task time, with no gradient-based updates. Specifically, when given only support set sufficient statistics in a background-whitened representation Eq. (2), a text-conditioned bind NIG prior, Eqs. (4) and (5) leading to a Student-t posterior predictive, Eqs. (6)–(9) and a pooled-background likelihood-ratio evidence stream Eqs. (10) and (11). And then it fused with CLIP logits via a globally calibrated and frozen scalar rule, Eqs. (16)–(19). This approach keeps the deployment profile strictly forward only, no backpropagation and no per-sample test-time optimization as in C-TPT [19], and does not require any per-task calibration or tuning at test time.

3 Methodology

In this section, we present LiRA-CLIP, a training-free posterior-predictive likelihood–ratio adapter for CLIP. LiRA-CLIP couples a text-conditioned Student-t generative head with the standard CLIP discriminative head via a confidence gate, and formulates few-shot adaptation as a single, reliability-driven calibration problem. An overview of the LiRA-CLIP architecture is shown in Fig. 1.

images

Figure 1: Overview of LiRA-CLIP framework, a training-free posterior-predictive likelihood-ratio adapter for few-shot CLIP classification. Fig. 1a, a frozen CLIP image encoder φ and text encoder ψ produce image features z=φ(x) and class prompts wc=ψ(c). Their dot products yield zero-shot logits scclip(z) and a margin-based confidence gate g(z) (discriminative stream). Fig. 1b, support-whitened features z~ feed a text-conditioned diagonal normal-inverse-gamma (NIG) model, leading to Student-t posterior-predictive densities p(z~∣c) and a pooled background density pbg(z~). Their standardized log-likelihood ratios r^c(z~) form the t-PLLR generative stream. Connected with a lightweight reliability-driven calibration layer fuses the discriminative and generative streams into final logits which are converted into calibrated class probabilities pθ⋆(y=c∣x) and high-confidence selective decisions as shown in Fig. 1c, with no gradient-based tuning or task-specific training.

3.1 Problem Setting

We consider a N-way few-shot classification task with a class set 𝒩={1,…,N}. The labeled support set is 𝒟sup={(xisup,yisup)}i=1Nsup, with yisup∈𝒩, and we evaluate on an unlabeled test set 𝒟test={xitest}i=1Ntest. We use a frozen CLIP image encoder φ:𝒳→RD and text encoder ψ:𝒩→RD. For an image x and class c∈𝒩, we define:

z=φ(x),wc=ψ(c)∈RD,(1)

where wc is obtained from a fixed prompt template, as in standard CLIP zero-shot classification [1]. On the support set we compute empirical per-class means z¯c, diagonal variances sc2, and counts nc. We also compute pooled background statistics (μbg,vbg) across all support samples, with total count nbg=Nsup. For one-shot classes (nc≤1), we tie the class variance to the background variance for numerical stability.

3.2 Background Whitening

Whitening is a technique in signal and image processing, used to transform correlated background or clutter into approximately white noise so as to enhance target detectability in challenging environments such as in SAR speckle whitening [25]. Related ideas appear in deep networks through switchable whitening and normalization layers, which decorrelate features and improve optimization stability across tasks [26]. In LiRA-CLIP we adopt a lightweight variant at the representation level. We standardize all features using the pooled support statistics (μbg,vbg). Let vbg denote the element-wise square root and ⊘ element-wise division. For any feature vector z∈RD define as:

z~=(z−μbg)⊘vbg.(2)

Eq. (2) performs pooled-statistics, per-dimension standardization using μbg and vbg computed from the support set, Section 3.1. Since our generative head models z~ with diagonal likelihood and uses closed-form text-conditioned NIG updates, Section 3.3, this standardization controls coordinate-wise scale when computing Student-t posterior predictives and the resulting posterior-predictive log-likelihood ratios, Eqs. (6)–(11). As with any pooled-statistic standardization, (μbg,vbg) may be higher-variance when the support set is extremely small or class frequencies are highly imbalanced. LiRA-CLIP incorporates stability safeguards already in the method: for one-shot classes we tie the class variance to the background variance, Section 3.1. We applied a variance floor ε>0 in the per-sample standardization of LLRs, Eq. (11). And we enforced a minimum predictive degrees-of-freedom in the Student-t posterior predictive, Section 3.3. These choices are designed to mitigate degeneracy of likelihood-ratio scores and keep the posterior-predictive stream well-conditioned in the extreme low-shot setting. Empirically, LiRA-CLIP exhibits non-degenerate behavior in the low-shot regime across benchmarks, as reflected by improved ECE, AECE and selective coverage, Section 4.

To express the text-conditioned generative model in the same background-whitened coordinates as z~. We also whiten the CLIP text embedding using the pooled support background statistics (μbg,vbg), Section 3.1. For each class c, we define the whitened text anchor as:

w~c = (wc−μbg)⊘vbg ∈ RD.(3)

We have used w~c only in the posterior-predictive (generative) stream, while the standard CLIP discriminative stream remains defined on the original features z and text embeddings wc, Section 3.6.

3.3 Text Conditioned Diagonal NIG Class Priors

The Normal Inverse Gamma (NIG) prior is a standard conjugate prior for the mean and variance of a normal likelihood and underpins a wide range of Bayesian regression and hierarchical models [27,28]. Integrating out the latent mean and variance yields Student-t posterior-predictive distributions. Where it naturally accommodate heavy tails and provide robustness to outliers and model misspecification [29]. In our method LiRA-CLIP, we model the class-conditional distribution of whitened image features z~∣y=c as a diagonal Gaussian and place a NIG prior on its parameters (μc,σc2). The prior is text-conditioned, its mean is anchored in the whitened text embedding w~c, Eq. (3), and therefore, it is defined in the same coordinate system as the likelihood for z~. In the whitened space, we employ an independent NIG prior to each class-dimension pair (c,d) as:

σcd2∼Inv-Gamma(α0,β0),μcd∣σcd2∼𝒩(μ0,cd,σcd2/κ0),(4)

with positive hyperparameters (κ0,α0,β0) shared across classes and dimensions. Our key design choice is to set the class prior mean to this whitened text anchor as follows:

μ0,c=w~c∈RD,w~c defined in Eq. (3).(5)

Using the same class embedding that defines CLIP’s frozen zero-shot linear head, Eq. (12). Since CLIP defines the image feature z=φ(x) and text embedding of class wc=ψ(c) in the same D-dimensional representation space, Eq. (1) and compares them directly via dot products, applying the same background-whitening reparameterization to both, Eq. (2). IT simply, expresses this class anchor in the coordinate system of the generative likelihood for z~. Where it also is consistent with our pooled background predictive construction, whose mean is defined as a whitened function of the text embeddings, Eq. (8).

3.4 Student-t Posterior Predictive and Log-Likelihood Ratios

Log-likelihood ratios are central to statistical decision theory, hypothesis testing, and modern likelihood-based machine learning [30,31]. LiRA-CLIP leverages this principle in a posterior-predictive setting. For each class, we form the log-ratio between its Student-t predictive density and a pooled background predictive in the background-whitened CLIP feature space and then apply per-sample standardization across classes. Resulting in Student-t posterior-predictive log-likelihood ratios (t-PLLRs), defining a single stream of generative evidence, which we later fused with CLIP logits by our reliability-driven calibration mechanism, Section 3.9. For a whitened feature z~∈RD, the class-c posterior predictive density factorizes across dimensions define as:

p(z~∣c)=∏d=1Dtνc(z~d|mcd,scd2),(6)

where tν(⋅∣μ,s2) denotes the univariate Student-t distribution with degrees of freedom ν, location μ, and scale s. The predictive variance in dimension d has the usual NIG form:

scd2=βcdαc⋅κc+1κc.(7)

We denote the resulting generative log-likelihood by ℓc(z~)=log⁡p(z~∣c).

3.5 Background Predictive Model

We construct a background Student-t predictive in the same whitened space using the global statistics (μbg,vbg,nbg) and the text embeddings, defined as follows:

μbg(t)=(1C∑c=1Cwc−μbgvbg),(8)

corresponding to the whitened mean of the class text embeddings, and take a unit background variance vbg(t)=1. Using nbg, we obtain a diagonal Student-t posterior predictive from:

pbg(z~)=∏d=1Dtνbg(z~d|μbg,d(t),sbg,d2),(9)

with degrees of freedom νbg and scale parameters sbg,d2 defined analogously to the class case. We denote the background log-likelihood by ℓbg(z~)=log⁡pbg(z~). For each class c, we form the log-likelihood ratio (LLR) as:

rc(z~)=ℓc(z~)−ℓbg(z~).(10)

To stabilize the scale of generative scores across images, we apply per-sample z-scoring across classes. Let r(z~)=(rc(z~))c=1C and denote its empirical mean and variance across classes by r¯(z~) and vr(z~), with a small variance floor ε>0. We get the standardized generative score via:

r^c(z~)=rc(z~)−r¯(z~)vr(z~)+ε,c∈𝒞.(11)

These standardized posterior-predictive LLRs constitute the generative stream used by the adapter.

3.6 CLIP Discriminative Stream

The CLIP zero-shot classifier provides the discriminative stream via a linear head whose weights are given by the text embeddings wc [1,4,10]. For a feature vector z, we define the CLIP logits as:

scclip(z)=βclip⟨z,wc⟩,c∈𝒞,(12)

where βclip>0 is the standard CLIP logit-scale (temperature) parameter [1]. These logits are converted to probabilities via a softmax, pclip(y=c∣z)∝exp⁡(scclip(z)).

3.7 Margin Based Confidence Gate

To modulate the contribution of the generative stream, we construct a scalar confidence gate from the CLIP logits. Let s(1)(z) and s(2)(z) denote the largest and second-largest components of sclip(z), and define the top-2 margin as follows:

m(z)=s(1)(z)−s(2)(z).(13)

The margin is large and positive when CLIP is confident and small when it is ambiguous.

We map this margin to a raw gate via a squashing nonlinearity:

g~(z)=σ(kgate(t0−tanh⁡m(z))),(14)

where kgate>0 and t0 are hyperparameters and σ(⋅) is the logistic sigmoid. This parametrization yields gates close to one when CLIP is uncertain (small margin) and close to zero when CLIP is highly confident (large margin). We then normalize and recenter g~(z) using statistics from the defined set, optionally apply a power transform, and clamp the result to a fixed interval [gmin,gmax], obtaining the final gate g(z)∈[gmin,gmax]. The gate is thus a deterministic function of the CLIP logits and introduces no additional learned parameters beyond those calibrated in the fusion stage.

3.8 Two-Stream Fusion

For each sample, LiRA-CLIP fuses the CLIP discriminative stream with the standardized generative LLR stream. Let scclip(z) denote the CLIP logits and r^c(z~) the standardized t-PLLR scores. Fusion parameters collected into a single vector as follows:

θ=(γclip,γllr,α),(15)

where γclip>0 and γllr>0 are per-stream temperatures and α≥0 controls the gated generative contribution. To calculate the fused logits, we used:

uc(z,z~;θ)=γclipscclip(z)+αg(z)γllrr^c(z~),c∈𝒞,(16)

and then we calculate class probabilities with:

pθ(y=c∣z,z~)=exp⁡(uc(z,z~;θ))∑c′∈𝒞exp⁡(uc′(z,z~;θ)).(17)

When α=0, Eqs. (16) and (17) reduce to temperature-scaled CLIP, setting additionally γclip=1 recovers the original CLIP predictions. For α>0, the generative t-PLLR stream is adaptively up or down-weighted by the gate g(z) depending on CLIP confidence. Once background whitening, priors, and the gate form are fixed, the entire adapter is parameterized only by θ: the CLIP stream, t-PLLR stream, and gate influence decisions entirely via the fused logits u(z,z~;θ) and probabilities pθ.

3.9 Reliability-Driven Global Calibration

We define LiRA-CLIP as the solution of a single reliability-driven calibration problem over θ, rather than tuning the gate and per-stream temperatures independently. Treated it as a unified probabilistic mechanism and choose a single global vector θ⋆=(γclip⋆, γllr⋆, α⋆) by solving a constrained optimization problem on a small pool of auxiliary few-shot classification tasks whose label sets are disjoint from those used in our main evaluation.

Auxiliary Calibration Pool Composition

We built the auxiliary pool from five datasets that span different visual domains to reduce the risk of overfitting to specific task characteristics. These domains include Caltech101 (generic objects), DTD (textures), FGVC-Aircraft (fine-grained), OxfordPets (fine-grained), and UCF101 (actions). We performed a fixed, class-disjoint split into calibration classes and evaluation classes (80/20) for each dataset. Sampling auxiliary few-shot tasks exclusively from the calibration classes. θ⋆ is selected using only the auxiliary calibration pool, no evaluation classes, benchmark datasets, or test episodes are used to choose θ⋆. We excluded EuroSAT from the calibration pool because it contains only 10 classes, which makes class-disjoint calibration splits statistically small and unstable. EuroSAT is therefore used as an out-of-calibration transfer benchmark, evaluated using the same frozen θ⋆ without any dataset-specific tuning. Each auxiliary task follows the same few-shot protocol as in Section 4. for a chosen shot K∈{1,2,4,8,16,32}, we sample K labeled support examples per class from the training split and form a labeled development/query set by sampling Q=16 additional examples per class from the remaining training data (capped by availability). Generated a total of T=120 auxiliary tasks, balanced across datasets and shot regimes (4 random episodes per dataset-shot pair). On this auxiliary pool, we minimized the mean negative log-likelihood by:

ℒNLL(θ)=−1Ndev∑i=1Ndevlog⁡pθ(i)(yidev),(18)

where Ndev is the total number of development or query examples across all auxiliary tasks.

Formally, we select θ⋆ via:

θ⋆∈arg⁡minθ∈𝒢A(θ)≥AguardℒNLL(θ),(19)

where 𝒢 is a fixed grid. Where Aguard specifies an accuracy slack on the auxiliary pool (i.e., we require A(θ)≥Amax−Aguard), protecting accuracy while selecting θ⋆ by NLL. The resulting θ⋆ is reused unchanged for all tasks, datasets, backbones, and shot regimes, with no test-time tuning. At deployment, LiRA-CLIP remains fully training-free: for each new few-shot task it performs only closed-form posterior-predictive updates and evaluates Eqs. (16) and (17). Although Eq. (19) is a constrained optimization problem, we solve it by grid search because the fusion vector θ=(γclip,γllr,α) has only three scalar degrees of freedom. A fixed grid 𝒢 yields a deterministic and stable selection of θ⋆ and avoids introducing any gradient-based optimization into the calibration stage, which aligns with our deployment setting. At test time, the adapter remains fully training-free: for each new few-shot task it performs only closed-form posterior-predictive updates in the Student-t head and evaluates Eqs. (16) and (17) with the frozen θ⋆, with no per-task tuning.

3.10 Test-Time Prediction with Frozen Fusion Parameters

Algorithm 1 describes how LiRA-CLIP adapts itself to a new few-shot task exclusively by means of closed-form computations and a globally calibrated, frozen fusion rule. Given the cached streams and the gate, the final phase applies the globally calibrated fusion parameters θ⋆ to each test. For each xtest∈𝒟test, the algorithm forms fused logits

uc=γclip⋆scclip(ztest)+α⋆g(ztest)γllr⋆r^c(z~test),(20)

and then, converts these fused logits by a softmax into class probabilities pθ⋆(y=c∣xtest); we evaluate pθ in Eq. (17) at θ=θ⋆ and at the features (ztest,z~test) derived from xtest, with:

pθ⋆(y=c∣xtest)=exp⁡(uc(ztest,z~test;θ⋆))∑c′∈𝒞exp⁡(uc′(ztest,z~test;θ⋆)),(21)

where the vector θ⋆ is not updated on the new task, adaptation is completely training-free and reduces to closed-form posterior-predictive updates in the Student-t head together with evaluation of this frozen fusion rule.

images

4 Experiments

4.1 Experimental Setting

4.1.1 Datasets and Protocol

In line with earlier work on CLIP adapters [6,7,10,17], we evaluate on six established vision benchmarks: Oxford Pets [32], Caltech101 [33], FGVC-Aircraft [34], DTD [35], EuroSAT [36], and UCF101 [37]. Taken together, these datasets cover generic object recognition, fine-granular categories, textures, and remote sensing. We adopt the strict few-shot adaptation protocol from [3,6,17], for each task and each class we draw uniformly K labeled support examples from the training split, with K∈1,2,4,8,16,32, and use the official test partition for evaluation. All reported LiRA-CLIP results are averaged over three random seeds. Unless stated otherwise, we use a single set of hyperparameters and a globally calibrated fusion vector θ⋆ that is fixed for all tasks; at test time, there is no task-specific tuning or additional supervision.

4.1.2 Baselines

We compare LiRA-CLIP with eight recent CLIP adapters: Linear Probing (LP) [1], TIP-Adapter and TIP-Adapter-f [10], TaskRes [6], CrossModal [38], BayesAdapter [17], LP++ [18], and CLAP [3]. It is noteworthy that all baseline results reported in this work are taken from [17].

4.1.3 Implementation Details

We compute CLIP features with two common visual encoders, ResNet-50 [39] and ViT-B/16 [40], unless indicated otherwise, ablation studies are carried out on ResNet-50. During feature extraction we apply random zoom, crop, and horizontal flip augmentations, following [3,6,17], and we reuse the same text prompt templates. For the fine-tuning ablation LiRA-CLIP-F, in which only the fusion parameters are updated while CLIP and the generative head remain frozen, we adopt the training configuration from [17]: 300 epochs, batch size 256, as well as SGD with momentum 0.9 and a learning rate of 0.1. All experiments for LiRA-CLIP and LiRA-CLIP-F are averaged over three random seeds. We reported the mean performance in the main text, while standard errors and dataset-specific results are provided in the appendices.

4.2 Analysis of the Experimental Results

4.2.1 Calibration

We first investigate how well confidence values reflect actual correctness. In accordance with common practice, we report the Expected Calibration Error (ECE) [17,41] as well as its adaptive variant AECE, which reduces the bias that arises when standard ECE is influenced by bins with insufficient or zero sample size [17,42]. ECE partitions the predictions into B confidence bins and averages the absolute difference between empirical accuracy and mean confidence in each bin; we use B=10. AECE keeps the same definition but chooses the bin boundaries such that each bin contains approximately the same number of samples, thereby reducing artefacts due to sparsely populated regions of the confidence range.

4.2.2 Overall Accuracy and Calibration

We have reported means over six datasets and six shot regimes (K∈1,2,4,8,16,32), using three random seeds per setting on two backbones. The results are shown in Table 1. On ResNet-50, LiRA-CLIP attains the highest average top-1 accuracy (68.44%), remaining essentially on a par with the strongest previous adapters such as CrossModal, CLAP, and BayesAdapter (67.85%–68.26%). At the same time, LiRA-CLIP exhibits a clearly recognizable gain in probabilistic reliability, its ECE and AECE (2.49% and 2.45%, respectively) correspond to a reduction of about 40%–45% compared to the best-calibrated baseline (BayesAdapter, 4.32% and 4.24%), whereas strongly accuracy-oriented methods such as LP++, CLAP, and Tip-Adapter-f show markedly higher calibration errors. For ViT-B/16, LiRA-CLIP remains competitive in accuracy (74.36%), with a performance that is statistically not distinguishable from the strongest baselines (CrossModal, CLAP, TaskRes; 74.16%–74.42% with overlapping standard deviations). In this setting as well, LiRA-CLIP attains the best calibration, reducing ECE from 3.46%–4.19% (CrossModal, BayesAdapter) to 2.28% and AECE from 3.38%–4.14% to 2.36%. Across both architectures, LiRA-CLIP thus offers a fully training-free adaptation with state-of-the-art calibration and competitive top-1 accuracy and consistently delivers more reliable probability estimates than all considered CLIP-adapter baselines. For more detailed results, see Appendix A.1, Table A1, and Appendix A.2, Table A2, and Appendix A.3, Table A3.

4.2.3 Selective Classification at High Confidence

We evaluate selective classification under high-confidence conditions, a central requirement in safety-critical deployment scenarios [17,43]. Given a confidence threshold τ, the classifier only predicts for those points whose maximum class probability exceeds τ, and abstains on all others. Following [17,44,45], we call a method reliable at level X% if its accuracy on the selected subset is at least X%. Under this side constraint, the objective is to maximize coverage, that is, the fraction of test examples for which the system issues a prediction. Table 2 reports overall reliable coverage at reliability levels of 95% and 99% on the test set, where LiRA-CLIP attains the highest coverage consistently on both backbones. On ResNet-50, it improves 99% reliable coverage from 10.8% to 17.6% and 95% reliable coverage from 22.2% to 28.3% relative to BayesAdapter, the strongest previous uncertainty baseline. On ViT-B/16, LiRA-CLIP increases 95% and 99% reliable coverage from 31.1% and 16.6% to 36.0% and 21.9%, respectively. These aggregated results support our central claim that LiRA-CLIP provides a reliability-focused, training-free adaptation with superior coverage at high confidence levels.

4.3 Per-Dataset Selective Classification

In the low-shot regime (K∈1,2,4), LiRA-CLIP systematically enlarges the set of test points for which high-confidence, accuracy-controlled decisions can be made. For instance, at a target accuracy of 95% and 99% on Caltech101, it increases coverage on ResNet-50 from about 59%–70% (BayesAdapter) to 65%–72% and on ViT-B/16 from 69%–77% to 78%–83%; on OxfordPets, LiRA-CLIP roughly doubles reliable coverage compared to the strongest baselines. By contrast, many prompt- and cache-based adapters either fail to satisfy the 95%–99% accuracy side constraints (which leads to ✗) or attain only marginal coverage in the extremely low-shot regime, whereas LiRA-CLIP remains calibrated and non-degenerate across datasets. As the number of shots increases (K≥8), LiRA-CLIP continues to perform strongly, but parameter-based methods such as BayesAdapter and CrossModal can match or slightly surpass its high-confidence coverage on some texture and remote-sensing tasks (DescribableTextures, EuroSAT). This pattern is consistent with our design of a training-free adapter that deliberately trades a small amount of high-shot coverage in favor of conservative, well-calibrated selective decisions in the low-data regimes where training-based uncertainty estimates are particularly fragile. To illustrate this behavior, Fig. 2 shows 99%-reliable coverage as a function of the number of shots (K∈1,2,4) for the strongest competing baselines and LiRA-CLIP on six representative benchmarks with the ResNet-50 backbone. Across all datasets, the LiRA-CLIP curves in the low-shot regime lie on or above those of the baselines and translate a strict 99% reliability target into a substantial increase in coverage, while the method remains fully training-free. Full numerical results are given in Appendix A.4, Table A4.

images

Figure 2: 99%-reliable coverage (%) on ResNet-50 across low-shot regimes (K∈{1,2,4}), broken down per dataset. Each curve shows how many test examples a method can classify while satisfying a 99% accuracy constraint.

4.4 Ablation Study

Our main method, LiRA-CLIP, is completely training-free. To assess whether the same posterior-predictive fusion architecture remains effective also under limited supervised fine-tuning, we additionally consider LiRA-CLIP-F, which fine-tunes exclusively the three scalar fusion parameters by gradient descent, while all CLIP and generative-model parameters remain frozen. Table 3 reports ablation results on EuroSAT. The training-free LiRA-CLIP already attains accuracy comparable to or slightly higher than recent adapters (64.9% vs. 64.7% for TaskRes and 64.6% for BayesAdapter) and at the same time reduces ECE from 4.8–10.6 to 2.9. The permitted mild fine-tuning of the fusion parameters (LiRA-CLIP-F) further increases accuracy to 69.1%, while maintaining competitive calibration, ECE 4.5%. This suggests that posterior-predictive likelihood-ratio fusion remains effective even when it is deployed as a small trainable adapter. In the extreme low-shot regime (1–2 shots), however, the labeled support set provides only very limited information to reliably re-estimate even a small number of fusion parameters. In this setting, task-specific fine-tuning of (γclip,γllr,α) tends to improve accuracy, but at the price of overconfident, less stable probability estimates, whereas the fully training-free variant preserves the once-on-auxiliary-tasks calibrated, cross-task posterior-predictive structure. In line with this bias-variance intuition, our per-shot ablations (1, 2, 4 shots on EuroSAT) show that LiRA-CLIP-F indeed achieves higher accuracy, but systematically worse calibration and high-confidence coverage than LiRA-CLIP, which underscores that the fully training-free, posterior-predictive formulation is particularly advantageous for reliability under extreme data scarcity. For full numerical results see Appendix B.1, Table A5.

images

Another ablation we conduct to assess the sensitivity of LiRA-CLIP to the normal-inverse-gamma prior hyperparameters (κ0,α0,β0). Table 4 illustrates the sensitivity of LiRA-CLIP, where sweeping the NIG hyperparameters has only a minor effect on accuracy and reliability, confirming that performance is driven by the structure of the posterior-predictive model rather than fine-tuning of prior scales. We report top-1 accuracy (Accuracy %), ECE (%), AECE (%) and 99% reliable selective coverage (Sel@99%).

images

4.4.1 Computational Complexity and Runtime

To calculate Computational complexity let N be the number of classes (ways), K the number of shots per class, D dimension the CLIP feature, and Ntest the number of test images in an episode. LiRA-CLI training-free at task time, conditioned on frozen CLIP features. It computes only closed-form sufficient statistics and t-predictive parameters and excluding CLIP feature extraction, the one-off episode setup processes NK support features in a single pass and costs 𝒪(NKD) time where it stores class-wise parameters with 𝒪(ND) memory and 𝒪(NKD) only when explicitly caching all support features). At inference, each test image evaluates N class-wise diagonal t-predictive scores and fuses them with CLIP logits, which are 𝒪(ND) time per image and fully vectorized (no backpropagation).

4.4.2 Paired Non-Parametric Significance Analysis

The main objectives of LiRA-CLIP is to improve probabilistic reliability under extreme low-shot and deployment constraints while preserving few-shot accuracy. We evaluated its statistical stability at the level of dataset × shot operating points. Specifically, we treated each dataset and shot regime as one paired observation and applied paired Wilcoxon signed-rank and sign tests, complemented by a paired permutation (sign-flip) test on the mean improvement. Following standard multi-dataset comparison practice [46], we controlled family-wise error with Holm correction. We conducted the Paired non-parametric test on ResNet50. LiRA-CLIP maintains accuracy parity with the strongest baseline adapters; however, per-setting accuracy differences are small pHolm=0.194. In contrast, the reliability gains are large and consistent across settings. LiRA-CLIP significantly improved calibration and selective decision-making. ECE improved in 25/36 dataset × shot settings, pHolm is 1.9×10−3 and selective coverage improved in 17/18 settings for both Selective coverage sel@95% and Sel@99%, pHolm is 5.0×10−4 and 5.7×10−4, respectively. Further details reported in Appendix C.1, Table A6.

4.4.3 Implementation and Measured Cost

LiRA-CLIP performs no task-time optimization (training time is 0). Thus we reported in Table 5 its test-time inference cost as average latency per image (ms/img) includes adapter-only cost, which excludes CLIP image encoding and captures only the closed-form scoring and fusion stage. We also reported end-to-end latency, includes CLIP encoding plus evaluation stage. BayesAdapter reported the average adaptation time averaged over three independent seeds (in seconds). All LiRA-CLIP experiments were run on NVIDIA Tesla T4 GPU (16 GB VRAM), PyTorch 2.9.0, CUDA 12.6, and cuDNN 9.1.

images

4.4.4 LiRA-CLIP Seed-Sensitivity Analysis

We verified the sensitivity of LiRA-CLIP results to the choice of a three-seed protocol by conducting a robustness check on Caltech101 (ResNet50) using five random seeds. The five-seed estimates (mean ± SE) align closely with LiRA-CLIP original findings, showing negligible deviations with a maximum of 0.30 pp in accuracy, 0.14–0.18 in calibration (ECE/AECE), and under 0.90 pp in selective coverage. These results confirm the stability of LiRA-CLIP core trends, details in Table 6.

images

4.4.5 Sensitivity to the Accuracy Guard

The reliability-driven calibration in Eq. (19) includes an accuracy-guard constraint A(θ)≥Aguard to protect discriminative performance while optimizing NLL. To assess how sensitive the resulting global fusion parameters are to this choice, we vary Aguard∈{0.005,0.01,0.02} on a representative calibration setting (Caltech101, 1-shot, ResNet-50; three random seeds) and report the resulting top-1 accuracy, calibration errors (ECE and AECE), and reliable selective coverage at 95% and 99% target accuracy. Table 7 summarizes the results.

images

5 Conclusion

LiRA-CLIP, a training-free text-conditioned, posterior-predictive likelihood-ratio adapter for few-shot CLIP classification. We introduced it in this article to improve probabilistic reliability in the extremely low-shot regime. LiRA-CLIP operations take place in a background-whitened CLIP feature space, and placing diagonal Normal Inverse Gamma priors over both the class-conditional distributions and a pooled background; leading to a Student-t posterior-predictive likelihood-ratio stream (t-PLLR) capturing heavy distributional outliers and data scarcity. We build a two-stream fusion mechanism to combine the generative t-PLLR scores with zero-shot CLIP logits; using a lightweight global calibration layer without gradient based task specific tuning or accessing to the CLIP weights. For a new few-shot task, LiRA-CLIP performs training-free adaptation via closed-form posterior-predictive updates and evaluation of a frozen fusion rule. Experimental results on six standard benchmarks confirm the robust adaptability and reliability of LiRA-CLIP; notably, ablation studies validate these reliability gains stem from the proposed posterior-predictive likelihood-ratio architecture rather than from brittle hyperparameter tuning. Establishing LiRA-CLIP as a simple efficient route to training-free, reliably calibrated few-shot CLIP adaptation. LiRA-CLIP by design trades a small amount of high-shot accuracy for substantially improved low-shot reliability, making it preferable to choose when labels are scarce and complementary to fully fine-tune adapters in data-rich regimes. For future work, we will explore lightweight task-aware refinements and richer generative components to narrow the remaining high-shot gap, at the same time preserving the method’s training-free.

Acknowledgement: The authors wish to express their gratitude to Prince Sultan University for their support.

Funding Statement: This research was funded by the National Nature Science of China, grant numbers U23A20321 and 62272490. Also, the authors would like to thank Prince Sultan University for paying the APC of this article.

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Mustafa Qaid Khamisi, Zuping Zhang and Mohammed Al-Habib; methodology, Mustafa Qaid Khamisi; software, Mustafa Qaid Khamisi and Mohammed Al-Habib; validation, Mustafa Qaid Khamisi, Zuping Zhang and Mohammed Al-Habib; formal analysis, Mustafa Qaid Khamisi and Zuping Zhang; investigation, Mustafa Qaid Khamisi, Zuping Zhang and Mohammed Al-Habib; resources, Mustafa Qaid Khamisi and Zuping Zhang; data curation, Mustafa Qaid Khamisi and Zuping Zhang; writing—original draft preparation, Mustafa Qaid Khamisi; writing—review and editing, Mustafa Qaid Khamisi, Zuping Zhang, Mohammed Al-Habib, Muhammad Asim and Sajid Shah; visualization, Mustafa Qaid Khamisi and Mohammed Al-Habib; supervision, Zuping Zhang; project administration, Mustafa Qaid Khamisi and Zuping Zhang; funding acquisition, Muhammad Asim and Sajid Shah. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The data that support the findings of this study are available from the Corresponding Author, Zuping Zhang, upon reasonable request.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CLIP	Contrastive Language–Image Pre-Training
LiRA	Likelihood Ratio Adapter
t-PLLR	Student-t Posterior-Predictive Likelihood Ratio
NLL	Negative Log-Likelihood
ECE	Expected Calibration Error
AECE	Adaptive Expected Calibration Error
NIG	Normal Inverse Gamma

Appendix A Detailed Numerical Values for the Reported Metrics in Manuscript

Appendix A.1 Per-Dataset AECE (%, Lower Is Better) on ResNet50. Results Are Reported over Three Random Seeds:

images

Appendix A.2 Detailed ECE Results of LiRA-CLIP in Comparison with Baseline Methods:

images

Appendix A.3 Detailed Accuracy Results of LiRA-CLIP in Comparison with Baseline Methods:

images

Appendix A.4 99%-Reliable Prediction of LiRA-CLIP in Comparison with Baseline Methods:

images

Appendix B Detailed Numerical Values for Ablation Study

Appendix B.1 Ablation Results for LiRA-CLIP-F in Comparison with Baseline Methods:

images

Appendix C Statistical Significance and Robustness Analyses

Appendix C.1 Paired Non-Parametric Significance Test:

images

References

1. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. London, UK: PMLR; 2021. p. 8748–63. [Google Scholar]

2. Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. London, UK: PMLR; 2021. p. 4904–16. [Google Scholar]

3. Silva-Rodriguez J, Hajimiri S, Ben Ayed I, Dolz J. A closer look at the few-shot adaptation of large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 23681–90. [Google Scholar]

4. Zhou K, Yang J, Loy CC, Liu Z. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2022. p. 16816–25. [Google Scholar]

5. Zhu B, Niu Y, Han Y, Wu Y, Zhang H. Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 15659–69. [Google Scholar]

6. Yu T, Lu Z, Jin X, Chen Z, Wang X. Task residual for tuning vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2023. p. 10899–909. [Google Scholar]

7. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, et al. Clip-adapter: better vision-language models with feature adapters. Int J Comput Vis. 2024;132(2):581–95. [Google Scholar]

8. Zhu X, Zhang R, He B, Zhou A, Wang D, Zhao B, et al. Not all features matter: enhancing few-shot clip with adaptive prior refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 2605–15. [Google Scholar]

9. Song L, Xue R, Wang H, Sun H, Ge Y, Shan Y, et al. Meta-adapter: an online few-shot learner for vision-language model. Adv Neural Inf Process Syst. 2023;36:55361–74. doi:10.52202/075280-2416. [Google Scholar] [CrossRef]

10. Zhang R, Zhang W, Fang R, Gao P, Li K, Dai J, et al. Tip-adapter: training-free adaption of CLIP for few-shot classification. In: Computer vision—ECCV 2022. Cham, Switzerland: Springer Nature; 2022. p. 493–510. doi:10.1007/978-3-031-19833-5_29. [Google Scholar] [CrossRef]

11. Kato N, Nota Y, Aoki Y. Proto-adapter: efficient training-free CLIP-adapter for few-shot image classification. Sensors. 2024;24(11):3624. [Google Scholar] [PubMed]

12. Wang Z, Liang J, Sheng L, He R, Wang Z, Tan T. A hard-to-beat baseline for training-free CLIP-based adaptation. arXiv:2402.04087. 2024. [Google Scholar]

13. Bendou Y, Ouasfi A, Gripon V, Boukhayma A. ProKeR: a kernel perspective on few-shot adaptation of large vision-language models. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference. Piscataway, NJ, USA: IEEE; 2025. p. 25092–102. [Google Scholar]

14. Li D, Wang R. Text-guided dual feature enhancement: a training-free paradigm for few-shot learning with CLIP. In: 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT). Piscataway, NJ, USA: IEEE; 2025. p. 1937–40. doi:10.1109/ainit65432.2025.11035349. [Google Scholar] [CrossRef]

15. Guo Z, Zhang R, Qiu L, Ma X, Miao X, He X, et al. Calip: zero-shot enhancement of clip with parameter-free attention. In: Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press; 2023. p. 746–54. [Google Scholar]

16. Chen X, Li Y, Chen H. Dual-adapter: training-free dual adaptation for few-shot out-of-distribution detection. arXiv:2405.16146. 2024. [Google Scholar]

17. Morales-Álvarez P, Christodoulidis S, Vakalopoulou M, Piantanida P, Dolz J. BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation. arXiv:2412.09718. 2024. [Google Scholar]

18. Huang Y, Shakeri F, Dolz J, Boudiaf M, Bahig H, Ben Ayed I. Lp++: a surprisingly strong linear probe for few-shot clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2024. p. 23773–82. [Google Scholar]

19. Yoon HS, Yoon E, Tee JTJ, Hasegawa-Johnson M, Li Y, Yoo CD. C-TPT: calibrated test-time prompt tuning for vision-language models via text feature dispersion. arXiv:2403.14119. 2024. [Google Scholar]

20. Upadhyay U, Karthik S, Mancini M, Akata Z. Probvlm: probabilistic adapter for frozen vison-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2023. p. 1899–910. [Google Scholar]

21. Oh C, Lim H, Kim M, Han D, Yun S, Choo J, et al. Towards calibrated robust fine-tuning of vision-language models. Adv Neural Inf Process Syst. 2024;37:12677–707. doi:10.52202/079017-0403. [Google Scholar] [CrossRef]

22. Silva-Rodríguez J, Ben Ayed I, Dolz J. Conformal prediction for zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE; 2025. p. 19931–41. [Google Scholar]

23. Liu J, Shen J, Zhou P, Sonke JJ, Gavves E. Probabilistic prototype calibration of vision-language models for generalized few-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ, USA: IEEE; 2025. p. 21155–65. [Google Scholar]

24. Venkataramanan A, Bodesheim P, Denzler J. Probabilistic embeddings for frozen vision-language models: uncertainty quantification with gaussian process latent variable models. In: Chiappa S, Magliacane S, editors.Proceedings of the Forty-First Conference on Uncertainty in Artificial Intelligence. Vol. 286 of Proceedings of Machine Learning Research. London, UK: PMLR; 2025. p. 4309–28. [Google Scholar]

25. Alparone L, Arienzo A, Lombardini F. Improved coherent processing of synthetic aperture radar data through speckle whitening of single-look complex images. Remote Sens. 2024;16(16):2955. doi:10.3390/rs16162955. [Google Scholar] [CrossRef]

26. Pan X, Zhan X, Shi J, Tang X, Luo P. Switchable whitening for deep representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE; 2019. p. 1863–71. [Google Scholar]

27. Cai M, van Buuren S, Vink G. Joint distribution properties of fully conditional specification under the normal linear model with normal inverse-gamma priors. Sci Rep. 2023;13(1):644. doi:10.1038/s41598-023-27786-y. [Google Scholar] [PubMed] [CrossRef]

28. Griffin JE, Brown PJ. Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 2010;5(1):171–88. doi:10.1214/10-ba507. [Google Scholar] [CrossRef]

29. Geweke J. Bayesian treatment of the independent student-t linear model. J Appl Econom. 1993;8(S1):S19–40. doi:10.1002/jae.3950080504. [Google Scholar] [CrossRef]

30. Dunn R, Ramdas A, Balakrishnan S, Wasserman L. Gaussian universal likelihood ratio testing. Biometrika. 2023;110(2):319–37. doi:10.1093/biomet/asac064. [Google Scholar] [CrossRef]

31. Yodnual S, Chumnaul J. Signed log-likelihood ratio test for the scale parameter of Poisson Inverse Weibull distribution with the development of PIW4LIFETIME web application. PLoS One. 2025;20(8):e0329293. doi:10.1371/journal.pone.0329293. [Google Scholar] [PubMed] [CrossRef]

32. Parkhi OM, Vedaldi A, Zisserman A, Jawahar C. Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2012. p. 3498–505. [Google Scholar]

33. Fei-Fei L, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop. Piscataway, NJ, USA: IEEE; 2004. [Google Scholar]

34. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft. arXiv:1306.5151. 2013. [Google Scholar]

35. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2014. p. 3606–13. [Google Scholar]

36. Helber P, Bischke B, Dengel A, Borth D. Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens. 2019;12(7):2217–26. [Google Scholar]

37. Soomro K, Zamir AR, Shah M. Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. 2012. [Google Scholar]

38. Lin Z, Yu S, Kuang Z, Pathak D, Ramanan D. Multimodality helps unimodality: cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2023. p. 19325–37. [Google Scholar]

39. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2016. p. 770–8. [Google Scholar]

40. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv:2010.11929. 2021. [Google Scholar]

41. Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International Conference on Machine Learning. London, UK: PMLR; 2017. p. 1321–30. [Google Scholar]

42. Nixon J, Dusenberry MW, Zhang L, Jerfel G, Tran D. Measuring calibration in deep learning. arXiv:1904.01685. 2019. [Google Scholar]

43. Dadalto Câmara Gomes E, Romanelli M, Pichler G, Piantanida P. A data-driven measure of relative uncertainty for misclassification detection. In: Kim B, Yue Y, Chaudhuri S, Fragkiadaki K, Khan M, Sun Y, editors. International Conference on Learning Representations. Red Hook, NY, USA: Curran Associates, Inc.; 2024. p. 21826–48. [Google Scholar]

44. Geifman Y, El-Yaniv R. Selective classification for deep neural networks. In: NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc.; 2017. p. 4885–94. [Google Scholar]

45. Wu YC, Lyu SH, Shang H, Wang X, Qian C. Confidence-aware contrastive learning for selective classification. In: Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J,editors. Proceedings of the 41st International Conference on Machine Learning. Vol. 235 of Proceedings of Machine Learning Research. London, UK: PMLR; 2024. p. 53706–29. [Google Scholar]

46. Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30. [Google Scholar]

Cite This Article

APA Style

Khamisi, M.Q., Zhang, Z., Al-Habib, M., Asim, M., Shah, S. (2026). LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification. Computers, Materials & Continua, 88(1), 10. https://doi.org/10.32604/cmc.2026.077556

Vancouver Style

Khamisi MQ, Zhang Z, Al-Habib M, Asim M, Shah S. LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification. Comput Mater Contin. 2026;88(1):10. https://doi.org/10.32604/cmc.2026.077556

IEEE Style

M. Q. Khamisi, Z. Zhang, M. Al-Habib, M. Asim, and S. Shah, “LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification,” Comput. Mater. Contin., vol. 88, no. 1, pp. 10, 2026. https://doi.org/10.32604/cmc.2026.077556

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

LiRA-CLIP: Training-Free Posterior-Predictive Uncertainty for Few-Shot CLIP Classification

Abstract

Keywords

References

Cite This Article

1265

714

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link