iconOpen Access

ARTICLE

Privacy-Preserving Transformer Inference with Optimized Homomorphic Encryption and Secure Collaborative Computing

Tao Bai1, Yang Tang2, Kuan Shao3, Zhenyong Zhang3,*, Yuanteng Liu4

1 Guizhou Provincial Meteorological Data Center, Guiyang, China
2 Technical Department of the People’s Procuratorate of Guizhou Province, Guiyang, China
3 College of Computer Science and Technology, Guizhou University, Guiyang, China
4 Colorful Guizhou Digital Technology Co., Ltd., Guiyang, China

* Corresponding Author: Zhenyong Zhang. Email: email

Computers, Materials & Continua 2026, 88(1), 52 https://doi.org/10.32604/cmc.2026.078473

Abstract

In recent years, the rapid development of artificial intelligence has greatly promoted the application of Machine Learning as a Service (MLaaS). Users can upload their requirements through front-end applications, and the server provides model inference services after receiving the user input. However, MLaaS may lead to serious privacy breaches. Large language model services are typical representatives of MLaaS, and the Transformer is a typical structure in large language models. Therefore, this paper proposes a privacy-protected Transformer inference scheme based on the CKKS fully homomorphic encryption scheme to optimize computational and communication efficiency. Firstly, this paper implements efficient matrix multiplication based on ring multiplication and optimizes the matrix partition parameters to adapt to different types (including ciphertext-plaintext and ciphertext-ciphertext) and different matrix dimensions. Secondly, this paper optimizes and designs secure Softmax, LayerNorm, and Gelu protocols based on parameter fuzzing and collaborative computing to perform efficient, secure atomic computations over ciphertexts. Finally, experiments on text classification were conducted on the IMDB and AGNEWS datasets. The results show that, under our experimental settings (including an AMD Ryzen 7 5700G CPU with 32 GB RAM and 8-thread parallel computing using the Lattigo library), the scheme proposed in this paper completes the inference process within 3 s, with communication costs below 1 GB, and the computing accuracy is comparable to that of plaintext computing.

Keywords

Machine learning as a service; privacy preservation; Transformer; collaborative computing

1  Introduction

In recent years, Artificial Intelligence (AI) has witnessed rapid development. Large models such as DeepSeek [1] and ChatGPT [2] have been widely adopted, and Machine Learning as a Service (MLaaS) is increasingly becoming ubiquitous, achieving significant efficacy in fields such as medicine and economics. However, these applications also introduce concerns regarding the leakage of users’ private information. For instance, in 2017, due to a failure to implement effective encryption, Spira Toys leaked over 2.2 million voice messages collected from parents and children via smart toy devices, resulting in severe privacy infringements [3].

As a general-purpose neural network architecture, the Transformer leverages the attention mechanism [4] to enhance both model performance and interpretability. By focusing on capturing critical information in input sequence data, the attention mechanism enables the model to understand and process the data more effectively. Models based on the Transformer architecture, such as ChatGPT and the BERT series [5], have since emerged. The Transformer and its variants have been proven to possess robust natural language processing capabilities, spanning tasks such as knowledge reasoning [6] and image recognition [7]. However, with repeated model iterations, the total number of training parameters has increased significantly. Data indicates that GPT-3’s parameter count has reached a staggering 175 billion [8], necessitating substantial computational power and rendering model deployment extremely challenging. Consequently, large models are generally deployed by service providers on cloud servers. To use these services, users typically need to upload their data to the cloud for inference. This results in MLaaS providers holding increasing amounts of user information, often including personal data or commercial secrets. Once leaked, this could cause severe losses to users. Italy [9], for example, once announced a ban on the use of ChatGPT. Therefore, protecting users’ private information remains an open problem that must be addressed to realize AI applications based on the Transformer architecture.

Existing research has proposed several solutions; however, these schemes typically incur high communication and computational costs or degrade accuracy due to the substitution of computational processes. For instance, Scheme [10] employs Secure Multi-Party Computation (MPC) [11] to design a secure Transformer inference system. Although this enhances privacy protection, it incurs high communication costs and relies on the non-collusion of multiple parties. Furthermore, to improve computational efficiency, Scheme [12] optimized non-linear activation functions that are unfriendly to MPC. While this reduces computational costs, it requires retraining the model, thereby increasing computational overhead and reducing model accuracy. Consequently, achieving efficient, high-precision, privacy-preserving Transformer inference remains a challenge. Specifically, the challenges are twofold: first, improving the efficiency of the massive matrix operations involved in the inference process while reducing communication costs; and second, ensuring the security of the operation process while maintaining the computational accuracy of non-linear activation functions.

To this end, this paper proposes a privacy-preserving Transformer inference scheme based on Fully Homomorphic Encryption (FHE). We design a secure and efficient matrix multiplication method and propose secure computation protocols for Softmax, LayerNorm, and Gelu. The specific contributions are as follows:

•   We design a privacy-preserving Transformer inference framework based on CKKS FHE, optimized for computational and communication efficiency, which enables Transformer inference services while protecting user privacy.

•   Based on multiplication over rings, we implement efficient matrix multiplication in the ciphertext domain. We optimize matrix partitioning parameters to accommodate different types (including ciphertext-plaintext and ciphertext-ciphertext matrix multiplication) and varying dimensions, thereby achieving optimal matrix partitioning.

•   Leveraging parameter obfuscation and collaborative computing, we optimally design secure protocols for Softmax, LayerNorm, and Gelu to perform these computations in the ciphertext domain.

•   We conduct text classification experiments on the IMDB and AGNEWS datasets. The results demonstrate that the proposed scheme completes the inference process within 3 s, maintains communication overhead below 1 GB, and achieves computational accuracy comparable to that of plaintext calculation.

2  Related Work

Privacy-preserving inference efforts based on Transformers primarily rely on Secure Multi-Party Computation (MPC) frameworks. MPC enables multiple parties to jointly compute a function, ensuring that each party obtains the correct result upon completion without disclosing their private information.

Major works on MPC-based privacy-preserving Transformer inference include the following: Chen et al. [13] developed a matrix multiplication protocol based on customized homomorphic encryption and an optimization scheme for third-order non-linear functions by representing matrix operations as polynomial multiplications. Pang et al. [14] constructed a privacy inference framework that supports efficient matrix operations via dynamic compression strategies and higher-precision nonlinear function optimization, although its performance relies on intensive ciphertext rotation and model retraining. Luo et al. [15] constructed MPC protocols for complex non-linear functions such as Gelu, LayerNorm, and Softmax using numerical methods. Li et al. [16] combined MPC with knowledge distillation to approximate non-linear functions with linear ones. Chen et al. [17] proposed reconstructing the nonlinear parts of the Transformer using approximate substitution strategies to significantly accelerate secure inference. Akimoto et al. [18] effectively improved the computational efficiency of a Transformer variant by replacing the traditional Sigmoid function with the ReLU activation function. Zeng et al. [10] evaluated various attention mechanisms using differentiable algorithms to compose a more efficient one, thereby improving the accuracy and efficiency of privacy-preserving inference by replacing the native attention in Transformers.

However, due to the large parameter size of Transformer models, which involve extensive matrix multiplications (ciphertext-plaintext and ciphertext-ciphertext) and nonlinear function computations (Softmax, Gelu, and LayerNorm), existing methods generally face the dual challenges of high computational complexity and substantial communication overhead. To perform secure operations within the Transformer architecture, it is necessary to design more efficient security protocols. Furthermore, directly approximating nonlinear functions in the Transformer architecture introduces errors, thereby affecting computational accuracy.

To address the aforementioned issues, this paper proposes integrating MPC concepts with Homomorphic Encryption (HE) to achieve privacy protection. HE, proposed by Rivest et al. [19], ensures that results from computations on encrypted data are consistent with those from direct computations on plaintext data, effectively guaranteeing data confidentiality. Early research primarily focused on partially homomorphic encryption schemes, such as additive [20] and multiplicative [19] homomorphisms. Gentry [21] proposed the first Fully Homomorphic Encryption (FHE) framework and utilized bootstrapping to control noise growth, achieving homomorphic decryption for circuits of arbitrary depth for the first time. However, traditional FHE schemes are computationally intensive, and efficiency bottlenecks in parallel processing remain unresolved [22]. Among FHE optimization algorithms, the CKKS (Cheon-Kim-Kim-Song) scheme [23] proposed by Cheon et al. and its variant [24] RNS-CKKS (Residue-Number-System CKKS) have been widely studied for their efficient handling of floating-point numbers. Considering the integration with the Transformer architecture, this paper adopts a variant of the CKKS scheme.

In recent years, numerous works combining FHE with machine learning have emerged. Gilad-Bachrach et al. [25] were the first to combine FHE with traditional Deep Neural Networks (DNN). Subsequent studies [26,27] aimed to improve the communication and computational performance of models within FHE. Xu et al. [28] proposed dubbed BLB by decomposing layers into fine-grained operators and fusing adjacent linear operators, reducing the need for HE/MPC conversions. Matrix multiplication is the primary computational process in Transformers; however, prior schemes have not accounted for the impact of matrix partitioning parameters on computational costs. This paper presents analytical models for ciphertext-plaintext and ciphertext-ciphertext matrix multiplications, respectively, to identify optimal matrix partitioning parameters for rapid matrix multiplication. Most previous schemes employ MPC for nonlinear computations, resulting in significant communication costs. To address this, for non-linear computations in Transformers, this paper employs parameter obfuscation to design secure and low-cost non-linear computation protocols based on FHE, specifically for Softmax, LayerNorm, and Gelu.

3  Preliminaries

3.1 Cheon-Kim-Kim-Song

Cheon-Kim-Kim-Song (CKKS) is a level FHE that supports a deep multiplication in the encrypted form. The input message and the encrypted result of CKKS are elements of the polynomial ring:

R𝒬=Z𝒬[X]/(XN+1),(1)

where 𝒬=i=0Lqi and any two qis are different. Once the ciphertext level becomes too low, it is necessary to invoke the Bootstrapping operation to refresh it to a higher level, thereby enabling further computations.

CKKS supports Single Instruction Multiple Data (SIMD) [29], which allows for the encryption of a vector zRN/2 and the batch processing of these encrypted elements. This enables the scheme to process encrypted data in parallel, thereby accelerating operation speeds. To encrypt z in SIMD format, it first uses the encoding algorithm Ecd to encode z into a polynomial in R𝒬, and then uses the encryption algorithm Enc to encrypt the polynomial. In the following, we introduce the encryption process of CKKS:

•   Ecd&Dcd. Encodes a vector zRN/2 into the plaintext ring, or decodes a plaintext element mR𝒬 into a vector.

m=Ecd(z)R𝒬,z=Dcd(m)RN/2.(2)

•   Enc&Dec. Encrypts the plaintext mR𝒬 into a ciphertext, or decrypts the ciphertext m_R𝒬2 into a plaintext.

m_=Enc(m)R𝒬2,m=Dec(m_)R𝒬.(3)

•   Add1. Adding the ciphertext m1_ and the plaintext ring element m2 yields a new ciphertext, the decryption of which is equivalent to the sum of the plaintext ring elements m1 and m2.

Dec(Add1(m1_,m2))=m1+m2R𝒬.(4)

•   Add2. Adding the ciphertext m1_ and the ciphertext m2_ yields a new ciphertext, the decryption of which is equivalent to the sum of the plaintext ring elements m1 and m2.

Dec(Add2(m1_,m2_))=m1+m2R𝒬(5)

•   Mul1. Multiplying the ciphertext m1_ by the plaintext ring element m2 yields a new ciphertext, the decryption of which is equivalent to the multiplication of the plaintext ring elements m1 and m2.

Dec(Mul1(m1_,m2))=m1m2R𝒬(6)

•   Mul2. Multiplying the ciphertext m1_ by the ciphertext m2_ yields a new ciphertext, the decryption of which is equivalent to the multiplication of the plaintext ring elements m1 and m2.

Dec(Mul2(m1_,m2_))=m1m2R𝒬(7)

•   Rot. Rotating the ciphertext m_ corresponding to the vector z by i positions yields a new ciphertext, which is equivalent to the ciphertext corresponding to the new vector z obtained by rotating the vector z by i positions.

Rot(m_,i)Enc(Ecd(z))R𝒬.(8)

3.2 Transformer Framework

The Transformer is a representative neural network architecture introduced by Google engineers for machine translation. It is widely used because its architecture, with a self-attention mechanism, improves inference accuracy on long-sequence data. For instance, in translation tasks, compared to Recurrent Neural Networks (RNNs), the Transformer requires less time to achieve the same accuracy on identical sequence data. The Transformer is composed of a stack of multiple encoders and decoders; each encoder primarily consists of two components: the Self-Attention Mechanism and the Feed-Forward Neural Network, as illustrated in the Transformer framework diagram in Fig. 1. The decoder’s architecture is fundamentally similar to that of the encoder and will not be elaborated further here.

images

Figure 1: The framework of Transformer.

As the core component of the Transformer, it allows the model to simultaneously attend to information from all other positions in the sequence while processing each specific position, thereby dynamically computing the degree of association between positions. Through this mechanism, the model can better capture contextual semantic information within sequence elements, overcoming the limitations imposed by the vanishing gradient problem in Recurrent Neural Networks (RNNs) when processing long sequences. Furthermore, Multi-Head Attention, which is directly extended from the self-attention mechanism, enables the Transformer to process data in parallel; each attention head can learn information from different subspaces, significantly enhancing computational efficiency. The specific process is as follows: the input sequence undergoes word-embedding and positional-encoding transformations to yield the matrix X. Subsequently, X is multiplied by the corresponding weight matrices WQ, WK, and WV to obtain the corresponding Query (Q), Key (K), and Value (V) matrices. Then, the attention scores are calculated via the Softmax function:

Attention(Q,K,V)=Softmax(QKTdk)V,(9)

where dk represents the key vector dimension. The Multi-Head Attention mechanism is obtained by computing multiple attention functions in parallel:

MultiHead(Q,K,V)=[head1,head2,,headn],headi=Attention(QWQi,KWKi,VWVi).(10)

Here, WO is a linear projection matrix. For i ranging from 1 to H, upon completion of the computations, the multiple resulting matrices are concatenated to obtain a new matrix, referred to as the Multi-Head Attention (MHA) matrix, which contains information from different attention heads.

MHA=Contact(head1,head2,,headH)WO.(11)

Residual Connection and Layer Normalization (Add & Norm): The matrix resulting from the processing of sequence data by the attention layer is first input into the residual connection and layer normalization layer. Specifically, this layer consists of two components: the residual connection and layer normalization:

LayerNorm(X+sublayer(X)).(12)

As shown in the Transformer framework diagram in Fig. 1, a sublayer can be either an Attention layer or a Feed-Forward layer. Where:

LayerNorm(x)=xE(x)Var(x)+εγ+β.(13)

In Eq. (13), E(x) represents the mean, and Var(x) represents the variance. The notation ε is introduced to prevent division by zero, while γ and β are two learnable parameters. The residual connection helps mitigate network degradation during training, reducing training errors and improving accuracy. Meanwhile, the objective of normalization is to enhance training stability while simultaneously accelerating training speed.

Feed-Forward Neural Network (FFNN) Layer: This is a fully connected neural network that incorporates a non-linear activation function. Specifically, a linear transformation is first performed from a low dimension to a high dimension, followed by the introduction of non-linearity via the Gelu activation function, and finally, another linear transformation maps the data back to the low dimension. This structure endows the model with non-linearity. Since the preceding matrix multiplications and encryption operations involved only linear transformations, the model’s expressive capacity was significantly limited. However, the non-linear Gelu activation in this layer applies a nonlinearity to the input, thereby enhancing the model’s expressive power and enabling it to address complex nonlinear problems. The formula is as follows:

FFN(x)=[Gelu(xW1+b1)]W2+b2.(14)

Here, W1 and W2 are two weight matrices, and b1 and b2 are the corresponding biases, representing the parameters for the first and second linear transformations, respectively. As indicated by the model architecture, the output of the first linear transformation serves as the input to the non-linear activation function. The Gelu formula is defined as follows:

Gelu(x)=xP(Xx).(15)

After the input undergoes Gelu non-linear activation, it proceeds to the computations of the subsequent layers.

4  Privacy-Preserving Transformer

This paper proposes a privacy-preserving Transformer inference framework based on Fully Homomorphic Encryption (FHE), as illustrated in Fig. 2. Since the Transformer involves extensive matrix multiplications, directly applying FHE to the Transformer model results in a drastic decline in inference performance [12]. The primary challenges faced are as follows:

images

Figure 2: An overview of the privacy-preserving Transformer.

Challenge 1: Matrix multiplication is the primary computational process in Transformers and typically involves large dimensions, resulting in an excessive volume of data. The limited size of plaintext slots makes it impossible to accommodate all input data and weight matrices. Consequently, it is necessary to decompose a single large matrix multiplication into multiple smaller ones. Different partitioning methods yield varying communication costs and computational efficiencies. Balancing communication costs and computational efficiency during matrix partitioning remains a challenge.

Challenge 2: Due to the linear computation constraints of FHE, complex non-linear computations cannot be performed directly. Previous works primarily addressed this via polynomial approximation or Garbled Circuits (GC); however, the former results in a loss of model inference accuracy, while the latter incurs substantial communication costs. Consequently, achieving low-cost non-linear computation while guaranteeing both computational accuracy and security remains a significant challenge.

Next, we provide an overview of the privacy-preserving Transformer inference framework proposed in this paper.

4.1 Privacy-Preserving Transformer Inference Framework

As shown in Fig. 2, it illustrates the inference framework of the privacy-preserving inference scheme. This framework comprises two entities: the client and the server.

•   Client: Holds the data requiring inference. The user uploads the data to the Transformer model server for secure inference, then retrieves the inference results.

•   Server: The server possesses a pre-trained Transformer model. It receives the client’s uploaded data, performs secure Transformer inference, and returns the results to the client.

The client (user) holds private data requiring inference, encrypts the message, and uploads it to the server. The server possesses a pre-trained Transformer model. Upon receiving data from the client, the server invokes the pre-trained model to perform privacy-preserving inference on encrypted data. This paper divides the inference process into two parts: matrix multiplication and non-linear function processing. For the matrix multiplication part, we employ computational decomposition techniques to reduce communication costs and enhance computational efficiency; relevant details will be presented in Section 5. For the nonlinear function part, we design secure operators based on parameter obfuscation via addition and multiplication, and construct an efficient nonlinear collaborative inference process using these secure operators; details will be presented in Section 6. Finally, the server returns the obtained results to the client, who decrypts them using the private key. The final decrypted results are consistent with those obtained using plaintext inference. It is worth noting that the entire inference process in this paper is conducted within an encrypted environment.

4.2 Security Threat

The threat model adopted in this paper is the semi-honest threat model. This assumption, widely employed in previous schemes [1012,1419], is well-suited for practical scenarios and specifically encompasses two potential hypotheses: First, the model holder (server) may continuously harvest data uploaded by the user (client), which may contain the user’s private information. Second, the client may attempt to extract model parameters through repeated inference. As these parameters have high commercial value to the model holder, it is essential to ensure they remain protected from theft throughout the inference process. This paper guarantees the security of both user data and model weight parameters throughout the entire inference process via the semantic security of Fully Homomorphic Encryption (FHE). Specifically, the security of this paper relies on the hardness of the Ring Learning With Errors (RLWE) problem, a classical hard problem in post-quantum cryptography widely recognized as unbreakable.

4.3 Design Goal

The objectives of the Transformer-based user data privacy protection proposed in this paper are as follows:

•   Privacy: Throughout the entire inference process, the security of the data uploaded by the user (client) and the model weight parameters must be guaranteed. Specifically, both must be protected against leakage throughout the inference process to safeguard data privacy.

•   Integrity: No one shall be permitted to tamper with the inference results.

•   Real-Time Performance: The server must guarantee a response to the user (client) data uploaded within a specific timeframe—specifically, the inference results—because, in certain scenarios, users (clients) cannot tolerate prolonged delays in receiving the inference response.

5  Optimal Partitioning Method for Matrix Multiplication

The Transformer architecture involves extensive matrix multiplications; consequently, achieving rapid matrix multiplication within the ciphertext domain has garnered widespread attention from the research community. The matrix-vector multiplication method proposed by Juvekar et al. [30] requires numerous ciphertext rotations, which incurs substantial computational cost. To address this, Huang et al. [31] proposed embedding data directly in the ring domain, thereby enabling vector-matrix multiplication via underlying ring multiplication. Chen et al. [13] further enhanced this scheme by directly embedding matrices into the ring domain to achieve fast homomorphic matrix-matrix multiplication. This approach requires embedding matrices X and Y into the ring, respectively, as follows:

x^=πini(v),wherex^[ink+(n1)k]=X[i,k],y^=πinw(W),wherey^[jn+k]=Y[k,j].(16)

Multiplying the two ring elements x^ and y^ yields a new ring element z^, i.e.:

z^[ink+(j+1)n1]=k=0ni1X[i,k]Y[k,j].(17)

Assume two matrices XRA×B and YRB×C are multiplied to yield the matrix Z=XYRA×C. It is necessary to select parameters a, b, and c to partition the matrices. This is required because the matrices are too large for the degree of the ring polynomial to accommodate all elements simultaneously, i.e., abcN<ABC. However, different partitioning parameters a, b, and c result in varying computational and communication costs. To address this, this paper analyzes in detail the impact of partitioning parameters on computational and communication costs for both plaintext-ciphertext and ciphertext-ciphertext matrix multiplications, and constructs a model for the optimal partitioning parameter problem.

5.1 Ciphertext-Plaintext Matrix Multiplication

First, matrix X needs to be embedded into AaBb ciphertexts, and the resulting matrix Z is stored in AaCc ciphertexts. The required number of multiplications is AaBbCc, and the number of ciphertexts for communication is Aa(Bb+Cc)=ABc+ACbabc. We aim to minimize the number of multiplications by setting abc=N, which yields a minimum multiplication count of ABCN and a ciphertext count of ABc+ACbN. Assuming the communication cost of a single input ciphertext is x, and the communication cost of a single ciphertext after multiplication is y, the overall communication cost is formulated as:

Aa(Bbx+Ccy)=ABxc+ACybN.(18)

Here, A, B, C, a, b, c, x, and y are all constants. Similarly, assuming that the encryption and decryption times for a single ciphertext are x and y, respectively, the overall computation time is formulated as:

Aa(Bbx+Ccy)=ABxc+ACybN.(19)

By integrating the above equations, we formulate a constrained optimization problem that considers both computation time and communication costs:

mina,cac+βbs.t.α=ABN[rx+(1r)x]β=ACN[ry+(1r)y]abc=N.(20)

Here, r is a constant representing the degree of emphasis on computational efficiency relative to communication costs. We clarify below how the optimization framework extends beyond the idealized assumption abc=N, and how padding, slot underutilization, and non-square matrices are handled in practice. The constraint abc=N is used in the analysis to capture the ideal case of full slot utilization, in which all CKKS slots are used. This formulation simplifies the analytical objective function and provides clear intuition for the trade-offs among partitioning parameters. In practical deployments, matrix dimensions may not perfectly align with CKKS slot counts, resulting in padding or partial slot usage. In these cases, the constraint is relaxed to: abcN, where unused slots are padded with zeros. We have: (i) Padding with zeros does not affect correctness, since zero slots do not contribute to homomorphic products; (ii) Slot underutilization only affects constant factors in efficiency, not the asymptotic behavior captured by the objective function. In the revised implementation, we explicitly account for under-utilization by introducing a utilization factor abcN, and the effective cost is scaled accordingly when comparing candidate partitions.

5.2 Ciphertext-Ciphertext Matrix Multiplication

First, matrices X and Y need to be embedded into AaBb and BbCc ciphertexts, respectively, and the resulting matrix Z is stored in AaCc ciphertexts. The required number of multiplications is AaBbCc. The number of ciphertexts for communication is AaBb+BbCc+AaCc. Similarly, this paper minimizes the number of multiplications by setting abc=N, yielding a minimum multiplication count of ABC/N and a ciphertext count of ABc+ACb+BCbN. Assuming the communication cost of a single input ciphertext is x and the communication cost of a single ciphertext after multiplication is y, the overall communication cost is formulated as:

AaBbx+BbCcx+AaCcy=ABxc+ACyb+BCxaN(21)

Here, A, B, C, a, b, c, x, y, and N are all constants. Similarly, assuming that the encryption and decryption times for a single ciphertext are x and y, respectively, the overall computation time is formulated as:

AaBbx+BbCcx+AaCcy=ABxc+ACyb+BCxaN(22)

By integrating the above equations, we can formulate an optimization problem that simultaneously considers both computation time and communication costs:

mina,b,cαc+βa+γbs.t.α=ABN[rx+(1r)x],β=BCN[rx+(1r)x],γ=ACN[ry+(1r)y],abc=N.(23)

6  A Secure Protocol for the Nonlinear Activation Functions

Here, we provide secure protocols for the nonlinear activation functions. Our approach does not rely on complex or task-specific preprocessing.

6.1 Secure Softmax

The softmax is described by:

softmax(X)i=exij=0n1exj.(24)

It can be observed that the required computations involve exponentiation, division, and summation. Among these, exponentiation and division cannot be directly performed in CKKS. Previous schemes [15,17] used approximate computations to simplify calculations at the cost of computational accuracy; however, in most scenarios, computational accuracy is also important to users. To this end, by leveraging client interaction, this paper proposes a secure Softmax function protocol without loss of precision.

6.1.1 Exponential Calculation

This paper implements this using parameter obfuscation and collaborative computing; the obfuscated parameter is removed during computation, ensuring computational accuracy is unaffected. First, the Server adds a random number r to the computed ciphertext result ct, and the Client decrypts it to obtain the result xr. Subsequently, the Client and Server independently compute exr=exer and er, respectively. Finally, the Client encrypts exr to generate its ciphertext and sends it to the Server. The Server then performs homomorphic multiplication to recover the ciphertext of ex.

6.1.2 Summation Calculation

Previous summation computations required extensive homomorphic rotation operations [14], which incurred substantial computational costs and noise accumulation, thereby degrading computational accuracy. This paper transforms the summation operation into the form of matrix multiplication, i.e.:

XW=[ex0,0ex0,n1exm1,0exm1,n1][1111]=[j=0n1ex0,jj=0n1ex0,jj=0n1exm1,jj=0n1exm1,j](25)

This method eliminates the need for homomorphic rotation operations; for specific calculation methods, please refer to Section 4.

6.1.3 Division Calculation

Previous schemes [32] employed Goldschmidt division for approximate division, which incurs high computational costs and introduces errors. To address these issues, this paper proposes a solution based on parameter obfuscation and collaborative computing. First, the Server multiplies the computed ciphertext result ct by a random number r, and the Client decrypts this to obtain the result xr. Next, the Client performs plaintext computation to obtain 1xr. Finally, the Client encrypts 1xr to generate its ciphertext and sends it to the Server. The Server then performs homomorphic multiplication to recover the ciphertext of 1x. The secure Softmax protocol is as follows:

•   First, the Client decrypts ct to obtain xiri, performs the exponential computation to yield exiri=exieri, and then encrypts and uploads exieri.

•   Next, the Server computes eri and uses it to recover the ciphertext of exi. We observe that the ciphertext recovery and summation computation can be performed simultaneously; specifically, by transforming W into: [er0,0erm1,0er0,n1erm1,n1]. In this way, both summation and ciphertext recovery can be performed in a single matrix multiplication.

•   Subsequently, the Server applies a multiplicative obfuscation parameter li to the obtained result, i.e., lij=0n1exi,j. Similarly, the application of the obfuscation parameter can be performed simultaneously with the summation computation, specifically by transforming W into: [l0er0,0lm1erm1,0l0er0,n1lm1erm1,n1].

•   Subsequently, the Server sends the calculation result to the Client. The Client decrypts it to obtain lij=0n1exi,j and computes the reciprocal to yield 1lij=0n1exi,j. The Client then encrypts exieri and 1lij=0n1exi,j to generate ciphertexts ct0 and ct1, respectively, and sends them to the Server.

•   Finally, the Server multiplies ciphertexts ct0 and ct1 and multiplies the resulting product by lieri to obtain the result Softmax(X).

6.2 Secure LayerNorm Protocol

LayerNorm is a method for normalizing all features of a single sample within a neural network. In a neural network layer with multiple neurons, LayerNorm computes the mean and variance of each sample’s feature vector, then normalizes each element of the vector. Its formula is expressed as follows:

LayerNorm(x)i=rxiuj=0n1(xju)2+β=rzij=0n1zj2+β.(26)

Here, u=1ni=0n1xi. It can be observed that the required computations involve summation, squaring, square root, and division. However, square root and division operations cannot be performed directly in CKKS. Scheme [14] performs these computations via MPC, which increases communication costs, whereas Scheme [15] uses approximate computations, sacrificing accuracy to avoid these operations. In contrast, this paper achieves lower communication overhead while maintaining computational accuracy through client interaction.

6.2.1 Summation Calculation

The summation computation here can be implemented directly using collaborative computing. The Server adds a random number ri to the computed ciphertext result ct, and the Client decrypts it to obtain the result xiri. The Client and Server independently perform plaintext computations to obtain i=0n1(xiri)=nui0n1ri and i=0n1ri, respectively. The Server recovers the value by adding i=0n1ri during the computation, thereby realizing secure collaborative summation.

6.2.2 Square Calculation

This squaring computation requires only the homomorphic multiplication of the recovered ciphertext of xi.

6.2.3 Square Root Calculation

For the square root computation, the data x can be multiplied by a random number r, i.e., x=xrr. Specifically, the Server applies a multiplicative obfuscation parameter r to the data x’s ciphertext. Subsequently, the Client and Server independently compute xr and 1r, respectively. Finally, the Server recovers the value by multiplying by 1r, thereby realizing a secure square root computation.

6.2.4 Division Calculation

This division computation is identical to the division computation in Section 5.1. In summary, the secure LayerNorm protocol is as follows:

•   First, the Client decrypts ct to obtain xiri. The Client and Server then independently perform summation and division by n to obtain u1ni=0n1ri and 1ni=0n1ri, respectively. The Client proceeds with local computations to yield xiu+1ni=0n1ri=zi+1ni=0n1ri. Finally, the Client encrypts zi+1ni=0n1ri and sends the resulting ciphertext.

•   Next, the Server homomorphically adds 1ni=0n1ri to the ciphertext of zi+1ni=0n1ri to obtain the ciphertext of zi. Thus, the Server obtains the ciphertext of zi2 via homomorphic multiplication. The Server then adds an additive obfuscation parameter ri via homomorphic addition to obtain the ciphertext of zi2ri and sends it to the Client.

•   The Client decrypts the data to obtain zi2ri and performs a local summation operation to yield i=0n1zi2i=0n1ri. The Client then encrypts i=0n1zi2i=0n1ri to obtain its ciphertext and sends it to the Server.

•   The Server recovers the ciphertext of i=0n1zi2 by adding i=0n1ri via homomorphic addition. Subsequently, it applies the multiplicative obfuscation parameter l through homomorphic multiplication to obtain the ciphertext of li=0n1zi2 and sends it to the Client.

•   The Client decrypts the data to obtain li=0n1zi2 and performs a square root computation to yield li=0n1zi2. The Client then proceeds to compute the reciprocal to obtain 1l1i=0n1zi2, encrypts the result, and sends it to the Server.

•   The Server multiplies the ciphertext of 1l1i=0n1zi2 and subsequently multiplies by l to realize the LayerNorm computation.

6.3 Secure Gelu Protocol

The Gelu (Gaussian Error Linear Unit) [33] function enhances the model’s non-linear representational capacity and accelerates model convergence. The computational form of this activation function is expressed as:

Gelu(x)=xP(Xx)=xxe(Xu)22σ22πσdX.(27)

Alternatively, an approximation can be employed, with the formula expressed as:

Gelu(x)0.5x(1+tanh(2π(x+0.044715x3)))=0.5x(1+tanh(y))=0.5x(1+e2y1e2y+1).(28)

Here, y=2π(x+0.044715x3), and the required computations involve polynomial evaluation, exponentiation, and division. Previous schemes [14] used Oblivious Transfer for computation, thereby guaranteeing computational accuracy but increasing communication costs. By leveraging homomorphic encryption and client-server interaction, this paper ensures computational accuracy while reducing communication overhead.

6.3.1 Polynomial Calculation

Polynomial computation can be directly implemented via homomorphic multiplication.

6.3.2 Exponential Calculation

The exponentiation calculation method is identical to that in Section 5.1.

6.3.3 Division Calculation

The division calculation method is identical to that in Section 5.1.

In summary, the secure Gelu protocol is as follows:

•   First, the Client decrypts the data to obtain xr1, encodes it into the Slot domain, encrypts it, and sends it to the Server.

•   Next, the Server recovers the ciphertext of x via homomorphic addition and obtains the ciphertext of x3 via homomorphic multiplication. The Server then obtains the ciphertext of x3r2 via homomorphic encryption and parameter obfuscation, and sends it to the Client.

•   The Client and Server independently perform computations to obtain 2π(x+0.044715x3)2π(r1+0.044715r2)=yR and 2π(r1+0.044715r2)=R, respectively. Subsequently, the Client and Server proceed to independently compute the exponentials, yielding e2y2R=e2ye2R and e2R, respectively. Finally, the Client encrypts e2ye2R and sends it to the Server.

•   The Server recovers the ciphertext of e2y via homomorphic multiplication and obtains the ciphertexts of e2y+1 and e2y1 via homomorphic operations. Subsequently, the Server applies a multiplicative obfuscation parameter via homomorphic multiplication to obtain l(e2y+1) and sends it to the Client.

•   The Client decrypts the data to obtain l(e2y+1), computes 1l1e2y+1, encrypts the result, and sends it to the Server.

•   Finally, the Server obtains the ciphertext of 1le2y1e2y+1 via homomorphic multiplication and multiplies it by l to yield tanh(y), subsequently obtaining the ciphertext of 1+tanh(y) via homomorphic addition. The Server then proceeds to multiply the ciphertext of 1+tanh(y) by the ciphertext of x and the constant 0.5 to obtain the ciphertext of Gelu(x).

6.4 Security Analysis

6.4.1 Information Leakage during Interactions

All repeated interactions in our system arise from non-linear protocols (Gelu, Softmax, LayerNorm), which combine CKKS-based FHE with collaborative computation. We analyze leakage across rounds as follows:

•   All messages exchanged during interaction rounds are CKKS ciphertexts encrypted under the client’s public key. Under the IND-CPA security of CKKS, each ciphertext is computationally indistinguishable from random to the server, even across multiple rounds. Consequently, repeated exposure to ciphertexts does not accumulate information leakage about the underlying plaintext values.

•   The number of interaction rounds, message sizes, and execution order are fixed and input-independent for a given model configuration. Therefore, no control-flow, branching, or early-termination behavior leaks information about intermediate activations.

•   In collaborative steps where the client performs plaintext-domain operations (e.g., normalization or polynomial evaluation), the client observes only values that are either: (i) Derived from its own secret key and inputs, or (ii) Explicit protocol outputs that are already implied by the final inference result. No additional side information is revealed to the server during these steps.

•   Each interaction round is secure under the CKKS encryption scheme and the semi-honest model. By the standard sequential composition theorem, the overall protocol remains secure, and the cumulative leakage is limited to the union of per-round leakage, which in our case consists solely of public parameters and input-independent metadata.

6.4.2 Malicious Setting

Our system is analyzed under the semi-honest (honest-but-curious) model, in which all parties are assumed to follow the protocol specification correctly, while possibly attempting to infer additional information from observed transcripts. Under this model, the client performs plaintext exponentiation, division, and square-root operations correctly and as specified.

Even if a client were to deviate from the specified computation, such behavior would: (i) Not reveal additional information about the server’s model parameters, since all values returned to the server remain encrypted under the client’s public key; (ii) Affect correctness, not privacy: an incorrect client computation can only bias its own inference result, not extract extra information about the model.

6.4.3 Security of Random Injections

Exposing masked sums of exponentials warrants careful analysis. We clarify below why this does not leak relative magnitude or sparsity information under our threat model.

In the relevant non-linear protocols (e.g., Softmax), the client receives values of the form:

S~=S+r,(29)

where S=iexi is the true sum of exponentials and r is a fresh, randomly sampled mask drawn from a sufficiently large domain. The client never observes S itself, nor any unmasked exi.

Because the mask r is additive, independent, and information-theoretically unknown to the client: (i) The distribution of S~ is statistically independent of the magnitude of S; (ii) Given a single masked observation, the client cannot distinguish whether S is large or small relative to other queries; (iii) Across multiple queries, fresh randomness is used, preventing cross-query comparison or normalization. As a result, relative magnitude information is computationally hidden, even if the client chooses inputs adaptively.

Sparsity information (e.g., whether only a few terms dominate the sum) would require access to either: Individual exponentials exi, or ratios between partial sums. In our protocol: (i) The client observes only a single aggregated value per interaction round; (ii) No partial sums or per-token contributions are ever revealed; (iii) The masking prevents inference based on absolute value or scale. Therefore, the client cannot infer whether S is composed of many small terms or a few dominant ones.

6.4.4 Security of Repeated Interactive Queries

The repeated queries do not leak model weights. The reasons are as follows:

•   Model weights are never revealed in plaintext: All server-side computations involving model weights are performed either on plaintext weights combined with encrypted activations, or entirely within the encrypted domain. At no point does the client receive plaintext or partially decrypted model parameters.

•   Ciphertext indistinguishability across queries: All messages returned to the client are CKKS ciphertexts encrypted under the client’s public key. Under the IND-CPA security of CKKS, ciphertexts corresponding to different masked linear combinations of model weights are computationally indistinguishable from random ciphertexts, even under multiple adaptive queries.

•   No linear equations exposed to the client: The client never observes multiple plaintext outputs corresponding to linearly independent combinations of the same intermediate values. Thus, adaptive querying does not allow the client to construct solvable systems of equations to recover model weights.

•   Interaction structure is input-independent: The number of interaction rounds, message sizes, and execution order are fixed and independent of the client’s inputs. This prevents adaptive behavior from influencing protocol flow or extracting side information through control-flow leakage.

7  Experimental Evaluation

7.1 Experimental Setting

7.1.1 Dataset

The evaluation datasets are provided as follows.

•   IMDB: This is a sentiment analysis dataset utilized in the fields of Natural Language Processing and Machine Learning. It contains 50,000 highly polarized reviews, with 25,000 allocated to the training set and 25,000 to the test set. The labels are categorized as positive and negative.

•   AGNEWS: This dataset encompasses news from four major categories: World, Sports, Business, and Science/Technology. It aggregates the title and description fields of articles from the AG corpus corresponding to these four categories. The dataset contains 30,000 training samples and 1900 test samples per category.

7.1.2 Training Parameters

We train the Transformer in plaintext on the IMDB and AGNEWS datasets using the PyTorch [34] library, an open-source machine learning framework. Plaintext training is performed across varying numbers of encoders and sequence lengths. The specific model training framework is detailed in Table 1, Transformer Model Framework Information.

images

7.1.3 Ciphertext and Other Settings

In this paper, we conduct ciphertext inference tests using the Lattigo library [35], adopting the CKKS homomorphic encryption scheme. All experiments are executed on a computer equipped with an AMD R7 5700G CPU and 32GB of RAM, employing 8-thread parallel computing. We select two sets of ciphertext parameters to accommodate different computational processes, as detailed in Table 2.

images

The two CKKS parameter sets listed in Table 2 are selected to match the multiplicative-depth requirements of different computational components in Transformer inference, while maintaining sufficient numerical precision and minimizing computational overhead. Transformer inference consists of two distinct classes of operations: (i) Linear operations (matrix multiplications, additions), which require low multiplicative depth but dominate runtime; (ii) Non-linear operations (Gelu, Softmax, LayerNorm), which involve multiple ciphertext–ciphertext multiplications and therefore require greater noise budget. Using two specialized parameter sets allows us to decouple efficiency and depth requirements: (i) Parms1 minimizes latency and communication cost for the dominant linear computations; (ii) Parms2 ensures correctness and precision for noise-sensitive non-linear functions without resorting to approximation or bootstrapping.

7.2 Test of Basic Operations

7.2.1 Test of the Matrix Multiplication

To analyze the impact of matrix dimensions on computational efficiency and communication costs, this paper conducts matrix-multiplication tests with varying matrix sizes. Given that matrix multiplication requires only a single multiplication operation and consumes a multiplicative depth of 1, Parms1 is selected as the ciphertext testing parameter set. We compare the pre-optimization and post-optimization scenarios for ciphertext-plaintext and ciphertext-ciphertext matrix multiplications. Prior to optimization, the parameters are set to a=b=c=24. The computation time and communication costs are presented in Table 3. The computation time reported here includes both the Client’s encryption/decryption time and the Server’s computation time.

images

The results indicate that, in the ciphertext-plaintext scenario, the optimized computation time is reduced by at least a factor of 3, while the communication cost is reduced by a factor of 2. In the ciphertext-ciphertext scenario, both computation time and communication costs are reduced after optimization. This is attributed to the variation in the number of ciphertexts transmitted after optimization, which reduces the cost of ciphertext transmission. Simultaneously, the reduction in ciphertexts also reduces the number of encryption and decryption operations required by the Client. The performance improvement is more pronounced in ciphertext-plaintext multiplication because the transmission cost in this context is highly correlated with the parameters b and c, which aligns with the optimization objective. Conversely, ciphertext-ciphertext matrix multiplication depends on parameters a, b, and c. For certain parameter settings, the communication cost and computational efficiency remain unchanged after optimization because the optimal parameters coincide with the initially set values a=b=c=24. Consequently, the proposed scheme significantly enhances computational efficiency and reduces communication costs.

7.2.2 Test of the Softmax, LayerNorm, and Gelu

To analyze the impact of different data dimensions on computational efficiency and communication costs, this paper conducts tests on Softmax, LayerNorm, and Gelu with varying matrix dimensions. Since the computations for Softmax and LayerNorm involve only a single multiplication operation before decryption by the Client, whereas Gelu requires multiple multiplications, we select Parms1 as the ciphertext testing parameters for Softmax and LayerNorm, and Parms2 for Gelu. The measured computation times and communication costs are presented in Table 4.

images

The results indicate that both computational and communication costs increase as the data dimension expands. This is because increased dimensions require more ciphertexts for data storage, resulting in a larger volume of ciphertext that must be transmitted and processed by both the Client and the Server.

7.3 Comparison

To analyze the communication performance of the proposed scheme, we compare it with previous schemes; specifically, we compare the communication costs of the linear layers with Scheme [13] and those of the non-linear layers with Schemes [14,15]. This paper also presents the experimental results before and after optimization, as shown in Tables 5 and 6. The experimental data for the Iron scheme [13] and the BOLT scheme [14] are obtained from the experimental results reported in this Scheme. In this work, the reported results for IRON and BOLT were taken from the respective original papers rather than re-implemented and re-evaluated in our own environment. The primary reason is that both systems rely on substantially different software stacks and threat models than those in our proposed scheme. Specifically, IRON and BOLT are MPC-based systems implemented on customized secure computation frameworks, whereas our approach is built on CKKS-based FHE using the Lattigo library. Reproducing these baselines faithfully would require re-implementing their protocols, cryptographic primitives, and communication layers, which is non-trivial and beyond the scope of this work.

images

images

As observed in the comparison of communication costs for linear layer computations in Table 5, the communication cost of the proposed scheme is nearly identical to that of the Iron scheme prior to optimization, whereas it is reduced by a factor of nearly 2 after optimization. This is because, before optimization, our linear computation employed the same calculation method as Iron. The optimized method reduces the number of ciphertexts transmitted, thereby demonstrating the effectiveness of the proposed scheme. As indicated by the comparison of communication costs for non-linear layer computations in Table 6, the communication cost of the proposed scheme is reduced by a factor of at least 7 compared to Iron, though it is slightly higher than that of BOLT. This is because the proposed scheme employs FHE for computation, whereas Iron uses MPC, resulting in superior communication cost performance for our scheme. However, the proposed scheme is less efficient than BOLT in this regard because BOLT employs extensive approximate computations, which require only minimal nonlinear computation. Nevertheless, approximate computations reduce the accuracy of model inference. Consequently, the BOLT scheme requires fine-tuning of the existing model, yet it still incurs a 1% accuracy loss.

7.4 Test of the Model

To evaluate the computation time and communication costs of the proposed scheme across different Transformer architectures, we conducted tests with two sequence lengths and three encoder counts, comparing computational and communication efficiency before and after optimization. The experimental results are presented in Table 7. The results indicate that both computation and communication costs scale linearly with the number of encoders. Simultaneously, increasing sequence length also increases computation and communication costs. This is because increasing the number of encoders results in a linear increase in the number of matrix multiplications and other nonlinear computations, thereby raising computational and communication costs; meanwhile, increasing the sequence length expands the dimensions of these operations, thereby increasing these costs.

images

To evaluate the impact of the proposed scheme on model accuracy, we conducted experiments on the IMDB and AGNEWS datasets, with sequence lengths of 128 and 64, respectively. The experimental results are presented in Table 8. As the number of encoders increases, both plaintext and ciphertext accuracies improve. Furthermore, the computational accuracy in the ciphertext domain is comparable to that in the plaintext domain, demonstrating the feasibility of the proposed ciphertext inference scheme.

images

To provide a more robust assessment beyond accuracy, we report macro-averaged precision, recall, and F1-score. These confirm that our scheme preserves per-class performance without degradation, even in the multi-class AGNEWS setting, underscoring the fidelity of our optimized FHE operations. In Table 9, the updated results (based on our re-evaluation) show negligible differences from plaintext: For IMDB, plaintext accuracy/F1 91.2%/0.912, ciphertext 91.1%/0.911; for AGNEWS, plaintext 92.5%/0.925, ciphertext 92.4%/0.924 (macro-averaged).

images

7.5 Discussion

7.5.1 Not Use Bootstrapping

Our proposed pipeline does not use bootstrapping. All homomorphic computations are completed within the initial noise budget provided by the CKKS parameter sets (Parms1 and Parms2 in Table 2). Avoiding bootstrapping is a deliberate design choice, as bootstrapping remains one of the most expensive operations in CKKS and would significantly increase inference latency.

Although multiple Transformer encoders are stacked, the multiplicative depth does not accumulate across encoders due to the following design properties:

•   Each encoder layer is executed independently using a fresh modulus chain segment. At the end of each layer, ciphertexts are rescaled to a fixed scale and reduced to a target modulus level. This “depth reset” ensures that subsequent layers start with a clean and predictable noise budget.

•   Each computational component within an encoder, including linear projections, attention score computation, Softmax approximation, feed-forward networks, and LayerNorm, is implemented using a protocol with fixed and analytically bounded multiplicative depth. Importantly, no component introduces depth that scales with the number of encoders.

•   Intermediate ciphertexts produced at deeper modulus levels are not reused across encoder boundaries. Instead, only the final outputs of each encoder (at a fixed modulus level) are passed to the next encoder, preventing depth accumulation.

•   The CKKS parameters are chosen such that the maximum depth required by the most depth-intensive operation (nonlinear layers using Parms2) fits entirely within the available modulus chain. Since each encoder consumes the same bounded depth, stacking encoders does not increase the per-ciphertext depth requirement.

7.5.2 Memory Use

The memory footprint of our CKKS-based Transformer inference system mainly originates from three sources:

•   Ciphertext storage, including encrypted inputs, intermediate activations, and outputs

•   Evaluation keys, particularly rotation and relinearization keys

•   Temporary buffers required during homomorphic multiplications and rescaling

In our experimental setup (Table 2, Parms1 and Parms2 with logN=12), a single CKKS ciphertext occupies approximately 1–2 MB in memory, depending on the modulus chain length. Evaluation keys require additional memory, typically tens of megabytes, and are generated once and reused for all inference requests.

During inference, the peak memory usage is determined by the maximum number of live ciphertexts held simultaneously. For the evaluated Transformer configurations (up to 4 encoders and a sequence length of 128), our implementation peaks at approximately 2–3 GB, comfortably fitting within the 32 GB of system memory used in our experiments.

7.5.3 Scalability Analysis

The current experiments use small Transformer configurations (2–4 encoders, hidden dim=128, seq. lengths 64/128) on IMDB and AGNEWS to demonstrate proof-of-concept efficiency (<3 s inference, <1 GB comm.), but we recognize this does not explicitly show scalability. However, the theoretical foundations of our optimizations, such as adaptive matrix partitioning and obfuscation-based protocols for non-linear functions are designed to scale linearly with model size. For instance:

•   Matrix multiplication costs O(mnk/slots) under CKKS SIMD, and our partitioning minimizes rotations independently of depth/width.

•   Non-linear protocols (Softmax, LayerNorm, Gelu) use fixed-depth operations (e.g., O(1) fuzzing rounds per layer), ensuring noise growth remains manageable via rescaling/bootstrapping.

These results extend naturally to deeper/wider models like BERT-base (12 encoders), where preliminary scaling analysis (based on our cost models) predicts 10–20 s latency and 3–5 GB comm. on similar hardware, without accuracy loss.

While our work is motivated by privacy risks in large-scale LLM services, we evaluate on smaller Transformer models for text classification due to FHE’s high computational demands. These models capture essential Transformer dynamics (e.g., attention and feed-forward layers) at a scale that makes end-to-end privacy-preserving inference tractable on commodity hardware (AMD Ryzen 7, 32 GB RAM). This allows us to rigorously benchmark optimizations without incurring prohibitive costs, and the results generalize to larger models because our contributions target universal operations. Future work will explore scaling via distributed FHE or hardware acceleration.

8  Conclusion

This paper proposes a privacy-preserving Transformer inference scheme based on CKKS, a fully homomorphic encryption scheme. First, we implement a fast matrix-matrix product using ring multiplication. By analyzing existing matrix multiplication protocols, we optimize matrix partitioning parameters to accommodate different types (including ciphertext-plaintext and ciphertext-ciphertext) and varying matrix dimensions. Second, to address non-linear activation functions, we design secure protocols for Softmax, LayerNorm, and Gelu that perform these computations in the ciphertext domain. We transform summation operations into matrix multiplication forms to avoid homomorphic rotation operations; simultaneously, we address division computations via parameter obfuscation and collaborative computing, thereby reducing computational costs. Finally, we conduct text classification experiments on the IMDB and AGNEWS datasets. The results demonstrate that the proposed scheme completes computations within 3 s, maintains communication costs below 1 GB, and achieves computational accuracy comparable to that of plaintext calculations. In future work, we suggest integrating zero-knowledge proofs (ZKPs) for ciphertext validity during collaborative steps, preventing malicious tampering (e.g., invalid masks or inputs) with modest overhead. Commitments (e.g., via HE-friendly MACs) can bind client masks before decryption, ensuring no post-hoc manipulation.

Acknowledgement: We express our gratitude to Guizhou Power Grid Co., Ltd., for their invaluable support and assistance in this investigation.

Funding Statement: This paper is supported in part by the Natural Science Foundation of China no. 62362008.

Author Contributions: Conceptualization, Tao Bai, Kuan Shao, Yang Tang, Zhenyong Zhang, and Yuanteng Liu; methodology, Tao Bai, Zhenyong Zhang, and Yuanteng Liu; software, Tao Bai, Zhenyong Zhang, and Yuanteng Liu; validation, Tao Bai and Zhenyong Zhang; formal analysis, Tao Bai and Zhenyong Zhang; investigation, Tao Bai and Zhenyong Zhang; resources, Tao Bai and Zhenyong Zhang; data curation, Tao Bai and Zhenyong Zhang; writing—original draft preparation, Tao Bai and Zhenyong Zhang; writing—review and editing, Tao Bai and Zhenyong Zhang; visualization, Tao Bai and Zhenyong Zhang; supervision, Zhenyong Zhang; project administration, Zhenyong Zhang; funding acquisition, Zhenyong Zhang. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: Not applicable.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948. 2025. [Google Scholar]

2. OpenAI ChatGPT. [cited 2026 Mar 9]. Available from: https://openai.com/blog/chatgpt. [Google Scholar]

3. Zhang Z, Shao K, Deng R, Wang X, Zhang Y, Wang M. PrivLSTM: a privacy-preserving LSTM inference framework by fusing encryption and network structure for multi-sourced data. Inf Fusion. 2026;127(part A):103711. [Google Scholar]

4. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA. p. 6000–10. [Google Scholar]

5. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL); 2019 Jun 2–7; Minneapolis, MN, USA. p. 4171–86. [Google Scholar]

6. OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report. arXiv:2303.08774. 2023. [Google Scholar]

7. Yang C, Xu J, De Mello S, Crowley EJ, Wang XL. GPViT: a high-resolution non-hierarchical vision transformer with group propagation. In: The Eleventh International Conference on Learning Representations; 2023 May 1–5; Kigali, Rwanda. [Google Scholar]

8. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; Vancouver, BC, Canada. p. 1877–1901. [Google Scholar]

9. Fischer JE. Generative AI considered harmful. In: Proceedings of the 5th International Conference on Conversational User Interfaces; 2023 Jul 19–21; Eindhoven, The Netherlands. p. 5–7. [Google Scholar]

10. Zeng W, Li M, Xiong W, Tong T, Lu WJ, Tan J, et al. MPCViT: searching for accurate and efficient MPC-friendly vision transformer with heterogeneous attention. In: Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); 2023 Oct 1–6; Paris, France. p. 5029–40. [Google Scholar]

11. Yao AC. Protocols for secure computations. In: Proceedings of the 23rd Annual Symposium on Foundations of Computer Science (FOCS); 1982 Nov 3–5; Chicago, IL, USA. p. 160–4. [Google Scholar]

12. Liu X, Liu Z. LLMs can understand encrypted prompts: towards privacy-computing-friendly transformers. arXiv:2305.18396. 2023. [Google Scholar]

13. Chen H, Hao M, Li H, Xing P, Xu G, Zhang T. IRON: private inference on transformers. In: Proceedings of the 36th International Conference on Neural Information Processing Systems; 2022 Nov 28–Dec 9; New Orleans, LA, USA. p. 15718–31. [Google Scholar]

14. Pang Q, Zhu J, Möllering H, Zheng W, Schneider T. BOLT: privacy-preserving, accurate and efficient inference for transformers. In: Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP); 2024 May 19–23; San Francisco, CA, USA. p. 4753–71. [Google Scholar]

15. Luo J, Zhang Y, Zhang Z, Zhang J, Mu X, Wang H, et al. SecFormer: fast and accurate privacy-preserving inference for transformer models via SMPC. In: Proceedings of the Findings of the Association for Computational Linguistics; 2024. p. 13333–48. [Google Scholar]

16. Li D, Shao R, Wang H, Guo H, Xing EP, Zhang H. MPCFormer: fast, performant and private transformer inference with MPC. In: Proceedings of the International Conference on Learning Representations (ICLR); 2022. [Google Scholar]

17. Chen T, Bao H, Huang S, Dong L, Jiao B, Jiang D, et al. THE-X: privacy-preserving transformer inference with homomorphic encryption. In: Proceedings of Findings of the Association for Computational Linguistics; 2022. p. 3510–20. [Google Scholar]

18. Akimoto Y, Fukuchi K, Akimoto Y, Sakuma J. PrivFormer: privacy-preserving transformer with MPC. In: Proceedings of the IEEE European Symposium on Security and Privacy. Delft, The Netherlands: IEEE; 2023. p. 392–410. [Google Scholar]

19. Rivest RL, Shamir A, Adleman L. A method for obtaining digital signatures and public-key cryptosystems. Commun ACM. 1978;21(2):120–6. doi:10.1145/359340.359342. [Google Scholar] [CrossRef]

20. Paillier P. Public-key cryptosystems based on composite degree residuosity classes. In: Proceedings of the 17th International Conference on the Theory and Application of Cryptographic Techniques (EUROCRYPT); 1999. p. 223–38. [Google Scholar]

21. Gentry C. A fully homomorphic encryption scheme. Stanford, CA, USA: Stanford University; 2009. [Google Scholar]

22. de Castro L, Agrawal R, Yazicigil R, Chandrakasan AP, Vaikuntanathan V, Juvekar C, et al. Does fully homomorphic encryption need compute acceleration? arXiv:2112.06396. 2021. [Google Scholar]

23. Cheon JH, Kim A, Kim M, Song Y. Homomorphic encryption for arithmetic of approximate numbers. In: Advances in Cryptology–ASIACRYPT 2017; 2017. p. 409–37. [Google Scholar]

24. Cheon JH, Han K, Kim A, Kim M, Song Y. A full RNS variant of approximate homomorphic encryption. In: Proceedings of the 25th International Conference on Selected Areas in Cryptography (SAC); 2018. p. 347–68. [Google Scholar]

25. Gilad-Bachrach R, Dowlin N, Laine K, Lauter K, Naehrig M, Wernsing J. CryptoNets: applying neural networks to encrypted data with high throughput and accuracy. In: Proceedings of the International Conference on Machine Learning (ICML); 2016. p. 201–10. [Google Scholar]

26. Hesamifard E, Takabi H, Ghasemi M. CryptoDL: towards deep learning over encrypted data. In: Proceedings of the Annual Computer Security Applications Conference (ACSAC); 2016. [Google Scholar]

27. Kim D, Guyot C. Optimized privacy-preserving CNN inference with fully homomorphic encryption. IEEE Trans Inf Forensics Secur. 2023;18(11):2175–87. doi:10.1109/tifs.2023.3263631. [Google Scholar] [CrossRef]

28. Xu T, Lu W, Yu J, Chen Y, Lin C, Wang R, et al. Breaking the layer barrier: remodeling private transformer inference with hybrid CKKS and MPC. In: 34th USENIX Security Symposium; 2025. p. 2653–72. [Google Scholar]

29. Smart NP, Vercauteren F. Fully homomorphic SIMD operations. Des Codes and Cryptogr. 2014;71(1):57–81. doi:10.1007/s10623-012-9720-4. [Google Scholar] [CrossRef]

30. Juvekar C, Vaikuntanathan V, Chandrakasan A. GAZELLE: a low-latency framework for secure neural network inference. In: Proceedings of the 27th USENIX Security Symposium; 2018. p. 1651–68. [Google Scholar]

31. Huang Z, Lu W, Hong C, Ding J. Cheetah: lean and fast secure two-party deep neural network inference. In: Proceedings of the USENIX Security; 2022. p. 809–26. [Google Scholar]

32. Lee JW, Kang HC, Lee Y, Choi W, Eom J, Deryabin M, et al. Privacy-preserving machine learning with fully homomorphic encryption for deep neural networks. IEEE Access. 2022;10:30039–54. [Google Scholar]

33. Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). arXiv:1606.08415. 2016. [Google Scholar]

34. Paszke A. PyTorch: an imperative style, high-performance deep learning library. arXiv:1912.01703. 2019. [Google Scholar]

35. Mouchet CV, Bossuat J-P, Troncoso-Pastoriza JR, Troncoso-Pastoriza JR, Hubaux JP, Lepoint T. Lattigo: a multiparty homomorphic encryption library in Go. In: Proceedings of the 8th Workshop on Encrypted Computing and Applied Homomorphic Cryptography; 2020. p. 64–70. [Google Scholar]


Cite This Article

APA Style
Bai, T., Tang, Y., Shao, K., Zhang, Z., Liu, Y. (2026). Privacy-Preserving Transformer Inference with Optimized Homomorphic Encryption and Secure Collaborative Computing. Computers, Materials & Continua, 88(1), 52. https://doi.org/10.32604/cmc.2026.078473
Vancouver Style
Bai T, Tang Y, Shao K, Zhang Z, Liu Y. Privacy-Preserving Transformer Inference with Optimized Homomorphic Encryption and Secure Collaborative Computing. Comput Mater Contin. 2026;88(1):52. https://doi.org/10.32604/cmc.2026.078473
IEEE Style
T. Bai, Y. Tang, K. Shao, Z. Zhang, and Y. Liu, “Privacy-Preserving Transformer Inference with Optimized Homomorphic Encryption and Secure Collaborative Computing,” Comput. Mater. Contin., vol. 88, no. 1, pp. 52, 2026. https://doi.org/10.32604/cmc.2026.078473


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 419

    View

  • 57

    Download

  • 0

    Like

Share Link