LCF: A Deep Learning-Based Lightweight CSI Feedback Scheme for MIMO Networks

Recently, as deep learning technologies have received much attention for their great potential in extracting the principal components of data, there have been many efforts to apply them to the Channel State Information (CSI) feedback overhead problem, which can significantly limit Multi-Input MultiOutput (MIMO) beamforming gains. Unfortunately, since most compression models can quickly become outdated due to channel variation, timely model updates are essential for reflecting the current channel conditions, resulting in frequent additional transmissions for model sharing between transceivers. In particular, the heavy network models employed by most previous studies to achieve high compression gains exacerbate the impact of the overhead, eventually cancelling out the benefits of deep learning-based CSI compression. To address these issues, in this paper, we propose Lightweight CSI Feedback (LCF), a new lightweight CSI feedback scheme. LCF fully utilizes autoregressive Long Short-Term Memory (LSTM) to generate CSI predictions and uses them to train the autoencoder, so that the compression model could work effectively even in highly dynamic wireless channels. In addition, 3D convolutional layers are directly adopted in the autoencoder to capture diverse types of channel correlations in three dimensions. Extensive experiments show that LCF achieves a lower CSI compression error in terms of the Mean Squared Error (MSE), using only about 10% of the overhead of existing approaches.

Although the aforementioned approaches show that deep learning can be used as a very effective tool for CSI compression, there still remain several critical issues to be solved in terms of how the transceivers can practically share the models. Neural network-based CSI compression schemes are basically premised on sharing a model between a transmitter and a receiver, which means that some transmissions for this model sharing, and their accompanying cost, are unavoidable. In this paper, we refer to this as model sharing overhead. Unfortunately, this overhead has not been thoroughly taken into account in most existing studies, and in many cases, it is assumed that the transceivers already share a model or that model sharing will rarely happen. However, as we will see later, model sharing can occur quite often in practice, since the model cannot guarantee a high degree of generalization to wireless channel data. If the model cannot properly cope with CSI values that it has not experienced during training, then the compression and recovery will fail, which leads to an inevitable process of model re-training and sharing. Of course, in some channel environments where a clear pattern is found, as shown in Fig. 1a, the overhead problem may not be so serious. However, this situation cannot be always guaranteed; the actual channels may look more like the one in Fig. 1b. In addition, due to the strong randomness of the change in the wireless channel state, simply increasing the amount of training data does not result in noticeable generalization enhancement. Rather, it is more important to use a proper set of training data that reflects the pattern and tendency of the current channel status well, and for this, an appropriate channel status prediction can be of great help. As mentioned earlier, most of the previous approaches focus only on making the model work at higher compression ratios, and thus they prefer deep and wide networks in their designs. For example, CSINet [13] uses five convolutional layers and two fully connected layers, and it successfully compresses CSI with a high compression ratio of up to 1/64, which obviously outperforms conventional compressed sensing-based approaches [24,25]. However, from the model sharing perspective, the model is still too big to share; roughly calculated, for an 8 × 2 MIMO channel with 8 paths, it needs more than 1,000 decoder parameters in total, which actually makes it larger than the original CSI (i.e., 256 = 8 × 8 × 2 × 2, where the last number denotes the real and imaginary parts of complex channel coefficients). In this case, model sharing is practically not available since it consequently cancels out the benefits of compression.
In order to overcome these limitations, in this paper, we propose a lightweight CSI feedback scheme (LCF). Similar to recent approaches, LCF exploits deep neural networks to achieve better CSI compression, but we focus more on ensuring that the model does not impose a substantial burden on the network when being shared. LCF mainly consists of two parts: CSI prediction and CSI compression. First, for CSI prediction, LCF employs a long short-term memory (LSTM) structure to infer future CSI values, which are in turn used to train the CSI compression model. In particular, to generate multiple future CSI predictions effectively, we apply an autoregressive approach to our model. The actual channel compression and reconstruction is conducted by a convolutional autoencoder, where 3D convolution layers are adopted to capture channel correlations in the three dimensions of the transmitting antenna, receiving antenna, and path delays. The resulting compression model appears to be simple compared to recent proposals; however, we will show that this lightweight network structure is still sufficient for achieving high CSI compression performance, with a much cheaper model sharing overhead.
The proposed CSI feedback scheme is developed and evaluated in a TensorFlow [26] deep learning framework. In order to investigate the performance of LCF for various channel environments, we simulate channel coefficients by applying the WINNER II channel model [27,28]. We provide microbenchmarks that evaluate the performance of each CSI prediction and compression process, and we compare the overall performance of LCF with those of AFC [3] and CSINet [13] in terms of CSI recovery accuracy and model sharing overhead. Through extensive experiments, we show that LCF obtains more stable and better CSI compression performance in terms of the mean squared error (MSE), with as low as 10% of the model sharing overhead of the existing approaches. We summarize the main contributions of this paper as follows: 1. We propose a novel deep learning-based CSI feedback scheme, LCF, which effectively reduces the CSI feedback overhead by using CSI prediction based on autoregressive LSTM and CSI compression with a convolutional autoencoder. We propose the use of CSI predictions to train the autoencoder, so that the compression model can be valid even in highly dynamic wireless channels. 2. We design a CSI feedback algorithm to make the transmitter and the receiver effectively share the compression model, which has not been investigated well in previous studies. The proposed algorithm can be applied to the existing deep learning-based CSI compression approaches as well. 3. The performance of LCF is evaluated for various wireless channel scenarios using the WINNER II model, and it is also compared with those of other approaches through extensive experiments. LCF shows more stable and better CSI compression performance, using only about 10% of the overhead of existing approaches.
The rest of this paper is organized as follows. In Section 2, we review the previous works related to this paper. In Section 3, we provide the preliminaries of this work, and Section 4 describes LCF in detail. Section 5 evaluates the performance of the proposed scheme, and we conclude this paper in Section 6.

Related Work
Numerous schemes have been proposed to address the CSI feedback overhead problem using diverse types of channel correlations. The channel coherence time, during which the channel state remains highly correlated, has been used as a key metric for eliminating unnecessary CSI feedback transmissions in many studies [3,[8][9][10]12]. Huang et al. [8] analyze the effect of time-domain compression, based on a theoretical model of channel correlation over time. Sun et al. [9] simulate the 802.11n single-user MIMO (SU-MIMO) performance in time-varying and frequency-selective channel conditions. AFC [3] computes the expected SINR by comparing the previous and the current CSI values and then utilizes it to determine whether to skip a CSI feedback transmission or not.
Similar ideas can be applied to compressing the frequency domain CSI values. Since in OFDM systems, the channel estimation should be performed on each subcarrier, appropriate subcarrier grouping can reduce the feedback size significantly. In MIMO systems, spatial correlation could also be used for CSI compression. Gao et al. [10] design a channel estimation scheme for an MIMO-OFDM channel using both temporal and spatial correlations. Ozdemir et al. [11] analyze the parameters affecting spatial correlation and its effect on MIMO systems, and Karabulut et al. [12] investigate the spatial and temporal channel characteristics of 5G channel models, considering various user mobility scenarios. These schemes can be further improved with proper quantization schemes that encode the original CSI data with a smaller number of bits. AFC [3] employs an adaptive quantization scheme on top of the integrated time and frequency domain compression, and CQNET [14] is designed for optimizing codeword quantization using deep learning for massive MIMO wireless transceivers. Among other things, it is actually being used for codebook-based CSI reporting in current cellular and Wi-Fi systems [1,2].
Recently, as deep learning technologies have received attention for their powerful performance in extracting the principal components of data, there have been many efforts to use this capability for CSI compression [13][14][15][16][17][18] and estimation [19][20][21][22][23]. The autoencoder model is widely used in this field since it best fits the problem context. A novel CSI compression scheme, CSINet [13], uses a convolutional autoencoder to solve the CSI compression problem by turning it into a typical 2D image compression problem. Along this line, numerous variants of CSINet have been developed so far [14][15][16][17][18]. RecCsiNet [15] and CSINET-LSTM [16] incorporate LSTM structures into the existing autoencoder model to benefit from the temporal and frequency correlations of wireless channels. The authors of PRVNet [17] employ a variational autoencoder to create generative models for wireless channels. In DUalNet [18], the channel reciprocity is utilized for CSI feedback in FDD scenarios. Most of these approaches validate the feasibility of deep learning as an effective tool for CSI compression and feedback; however, there remain several practical open issues related to model sharing and generalization, which will be discussed in the following section.

Preliminaries
In this section, we describe the channel model and propagation scenarios used in this paper, and we explain the model sharing overhead problem, which motivates this work, in greater detail.

Channel Model and Propagation Scenarios
We consider an SU-MIMO communication scenario in which a receiver equipped with N r antennas feeds the estimated CSI back to its transmitter, which is equipped with N t antennas. Uniform Linear Array (ULA) antennas with 2 cm spacing are assumed for both the transmitter and the receiver. For simplicity, moving network scenarios are not considered. In order to simulate channel coefficients for diverse channel environments, we adopt the WINNER II channel model [27][28][29][30][31], which has been widely used in wireless communication research activities; it was recommended as a baseline for measuring radio communication performance in ITU-R (International Telecommunication Union-Radio communication sector) [29,30]. According to this model, the channel coefficients are generated based on a Clustered Delay Line (CDL) model ( Fig. 2), where the propagation channel is described as being composed of a number of separate clusters with different rays, and each cluster has a number of multipath components that have the same delay values but differ in the Angle-of-Departure (AoD) and Angle-of-Arrival (AoA).
In this paper, we consider two different propagation scenarios, namely "stable" and "dynamic" scenarios, which model indoor office environments and bad city macro cells, respectively. Here, the former, as the name suggests, has a smaller channel status variation than the latter. Tab. 1 shows the basic statistics of the two channels. We use MATLAB to gather channel coefficient data for each case, and CSI data is sampled every 2 ms at the center frequency of 5.25 GHz. Note that MATLAB provides a toolbox for working with the WINNER model [27], which allows us to freely customize network configuration such as sampling rate, center frequency, number of base stations and mobile stations, and their geometry and location information. In particular, since the main channel parameters for various radio propagation scenarios defined by the WINNER model are already configured, the corresponding channel coefficient values can be easily obtained through this. For each scenario, channel coefficient data is expressed as a four-dimensional normalized complex matrix whose shape is (N s × N d × N t × N r ), where N s is the length of the sampled data and N d is the number of path delays. Throughout this paper, we use the terms "CSI" and "channel coefficients" interchangeably.  [28]. Each cluster has a number of multipath components that have the same delay values but differ in the angle-of-departure and angleof-arrival Bad urban macro-cell (C3 in [28]) 19 10 50 7100 To show the difference between the two scenarios, we plot the changes over time of the channel coefficients for each scenario in Fig. 1. These channel coefficients are the values corresponding to the first path of receiving antennas 1-3 and transmitting antenna 1, and only the real parts of these values are displayed in the plot. From the figure, we can observe the spatial and temporal correlation of the channels for both scenarios, even though they differ in degree. In the case of the stable channel scenario (Fig. 1a), similar signal patterns are repeated quite clearly over time and also among the three receiving antennas. In the case of the dynamic channel scenario (Fig. 1b), it is difficult to find a clear pattern like that in the previous case, yet we can still see correlations in the two domains.
We can see the difference between the two channels in terms of correlation more clearly in Fig. 3. In this figure, we measure the correlation coefficient of any two CSI instances separated by T, using the following formula [3,32]: where L is the total length of CSI instances and h(t) is the CSI instance at time t. Note that the above equation can be also applied to computing the correlation in the spatial domain ( Fig. 3b) by changing the definition of the separation. As expected, the stable scenario has overall higher temporal correlations than the dynamic channel, as shown in Fig. 3a; their coherence times 1 are 420 and 40 ms, respectively. Compared to the temporal correlation result, higher spatial channel correlations among 1 Channel coherence time is defined as the point when the correlation value drops to 0.5 [33]. receiving antennas are observed in both cases (Fig. 3b), though the degree of correlation in the stable channel is still higher than that in the dynamic channel.

Model Sharing Overhead
As we saw earlier, wireless channels are basically diverse, and their characteristics are thus hard to generalize; some channels remain highly correlated over time for long periods of time, e.g., the stable channel, while others may experience large channel fluctuations, e.g., the dynamic channel. This aspect, unfortunately, has not been fully taken into account in most of the previous deep learningbased CSI compression schemes, although this leads to a substantial model sharing overhead that eventually limits the gains of deep learning. To identify the model sharing overhead in more detail, we revisit the CSI compression performance of CSINet [13] for three different channel environments, including the two channels described in the previous subsection. As a baseline, we additionally consider a purely random channel, where the channel coefficients are sampled from the normal distribution with zero mean and a variance of 0.1. Note that the stable channel used in this paper belongs to an ideal case in which we can readily predict how the channel changes in the future, while the random channel can be viewed as being at the other end of the spectrum. The mean squared error (MSE) in Section 5 is used as a performance metric, and we train the model using Adam optimization [34] with 10,000 CSI datasets and a maximum of 1,000 epochs.
First, Fig. 4a shows the compression performance according to varying compression ratios. What we pay attention to here is the performance in the dynamic channel: It deteriorates rapidly with the compression ratio, and its MSE value reaches 0.1 even at the low compression ratio of 1/8, which is too large to be used practically. Throughout this paper, the compression failure criterion, denoted as δ thr , is set to the MSE of 0.1, and considering that for any wireless channels, the performance of CSINet will be somewhere between the two curves of the stable (i.e., blue curve) and the random (i.e., black dotted curve) channels, we can conclude that it is practically available only with the compression ratio of 1/8. When compression fails, the model has to be retrained and shared between the transceivers again, which eventually causes additional transmissions, i.e., model sharing overhead. Unfortunately, most of the previous approaches focus only on making the model work at higher compression ratios, and thus they prefer deep and wide networks in their designs, hoping that model sharing will not occur frequently. However, as can be seen, it could occur very frequently. In the case of the dynamic channel, at compression ratios greater than 1/16, for every CSI sample the model will always produce a high recovery error above the threshold; in this case, its heavy network structure will accelerate the feedback overhead increase. Therefore, we have to consider proper model sharing when designing a deep learning-based CSI compression scheme. Fig. 4 shows the result when 50 consecutive CSI values are compressed with a fixed model. If the channel is quite reliable, such as in the stable channel case, we might keep the high compression gains of deep learning, but this is not always guaranteed; as can be seen in the dynamic channel case, high recovery errors could continue over time, thus leading to additional feedback transmissions. One might think that this problem can be solved through training the model with more data and thus strengthening the model generalization. Unfortunately, this approach may not be very effective when dealing with wireless channel data, which generally have large irregularities over time. As shown in Fig. 4, even if we increase the number of CSI training data samples, there are only slight performance gains. In this case, the amount of training data may not be that important; rather, it is more important to use a proper set of training data that reflects the pattern and tendency of the current channel status well, and for this, appropriate channel status predictions could be very helpful.
We describe the proposed CSI feedback scheme in detail in the following section.

Overview
In this section, we provide an overview of the proposed scheme. As mentioned before, LCF is composed of two main processes, as shown in Fig. 5, for which different deep learning models are used: 1) autoregressive LSTM is used for CSI prediction, and 2) a convolutional autoencoder is used for CSI compression and reconstruction. In the first process, as the name suggests, a receiver generates predictions for future channel states using accumulated CSI values, which in turn will be used as the training dataset for autoencoder optimization. This process is described in detail in Section 4.2. Next, in the second step, the actual CSI compression and recovery is performed. Using the encoder of the autoencoder, a receiver compresses the estimated CSI into an M-byte codeword and then sends it back to the transmitter. Upon receiving the compressed CSI, the transmitter reconstructs the original CSI with the decoder of the autoencoder, which has been already shared by the receiver. More details on the compression model will be described in Section 4.3. Ideally, the feedback process of LCF requires only an M-byte data transmission, and therefore the feedback overhead can be significantly reduced. However, as mentioned earlier, such a gain is not always achievable; in some channel environments, the compression model could quickly become less effective and invalid. To tackle this issue, in LCF, a receiver dynamically updates and shares the compression model depending on the expected CSI recovery error obtained during the CSI compression. We explain this in more detail in Section 4.4.

CSI Prediction Using Autoregressive LSTM
Let H i be the complex channel coefficient matrix whose shape is (N d × N t × N r ) at time slot i, and assume that a receiver keeps accumulated CSI matrices, denoted as H a . Note that the superscript − → is used to denote a vector. The objective of this step is to generate CSI predictions for the next N o ≥ 1 time slots, denoted as H a = [H i + 1 ,H i + 2 , . . . ,H i + No ] T . To do this, we employ an autoregressive LSTM model as shown in Fig. 6. Let f LSTM and θ LSTM be the model and its parameters, respectively. Then, we have H o = f LSTM ( H i ; θ LSTM ), where H i , which is part of H a , is the vector of the last N i CSI samples. Note that H a is updated whenever a new CSI sample is obtained. In this approach, the output of the model is repeatedly fed back into the model as input to generate its next prediction. For example, the prediction for time slot i + 1, i.e.,H i + 1 , is in turn used to predict the next CSI value,H i + 2 . This process is repeated until all N o target CSI predictions are made. The proposed prediction model has two layers, an LSTM layer and a fully connected layer, as shown in Fig. 6. CSI predictions are made for each combination of the path delay, transmitting antenna, and receiving antenna. Additionally, we handle the real and imaginary parts of complex channel coefficients separately, since complex numbers are inherently not comparable in size and thus cannot be directly used in optimization. That is, we use 2N d N t N r models in total, and each model is used to generate CSI predictions for the corresponding combination. The input data shape for the models is (batch × N i × 1), where the last number indicates the number of units (features) in the fully connected layer. One distinct feature of CSI data is that input data samples keep arriving sequentially to the model, and relatively old data samples can quickly become less effective. In this case, we can use online learning; instead of always training on the entire data set, i.e., H a , in most cases, training is performed only on a small data set containing new data samples, i.e., H i . In particular, the weights obtained in the previous step can be reused for performance. We basically use the MSE as the objective for optimization and employ Adam optimization [34].
For better understanding, we illustrate an example of CSI prediction in Fig. 7. The channel coefficients shown in the figure are sampled from the channel of the first path between transmitting antenna 1 and receiving antenna 1 in the stable channel case. We depict two curves for both the real and imaginary parts of the channel coefficients in the figure. This example corresponds to the case of N i = N o = 20, which means the channel coefficients from index 10 to index 29 are used to generate the next 20 CSI values from index 30 to index 49. As expected, the prediction performance is basically dependent on the previous prediction results; errors are continuously accumulated as predictions are made, and thus the prediction accuracy gradually drops as time advances. Starting from the MSE of 0.0001, the gap between the actual data and the prediction becomes apparent continuously, and at the last position, it grows to up to 0.0032. However, we note that this level of error is acceptable for the training data, as we will see later.

Convolutional Autoencoder-Based CSI Compression and Reconstruction
The proposed CSI compression model has the typical structure of an autoencoder, as shown in Fig. 8. It consists of two parts: An encoder and a decoder. Let f enc and f dec be the encoder and the decoder, respectively. The encoder takes the current CSI (i.e., H i ) as input, which is a four-dimensional channel coefficient matrix whose shape is (N d × N t × N r × 2), where the last element denotes the real and the imaginary parts. The first layer in the encoder is a 3D convolution layer, where threedimensional filters are used to capture the channel correlation in both the spatial (for both the transmitting and receiving antennas) and delay domains. By default, we use 16 (3×3×3) filters, and the LeakyReLU activation function with a parameter of 0.3 is applied. Stripping is not used. The feature maps acquired from this layer are then transferred to a fully connected layer with M units through average downsampling with a shape of (2 × 2 × 2) and flattening, and thus the M-byte compressed CSI, denoted as H M , can be obtained as a result, i.e., H M = f enc (H i ; θ enc ), where θ enc is a set of the encoder parameters.
The decoder is basically the mirror of the encoder. It first passes the encoded data (i.e., H M ) to a fully connected layer with N f units, where N f is the size of the output of the convolution layer in the encoder, and then transfers the outcome to a convolution layer through the upsampling layer. Like the convolution layer in the encoder, the convolution layer in the decoder also takes three-dimensional filters; however, we employ a transposed convolution layer in the decoder to match the input shape and the output shape, i.e., (N c × N t × N r × 2). The same LeakyReLU activation function is applied, and L2 regularizers with a parameter of 0.001 are applied to all layers in the model. As a result, the decoder reconstructs the original CSI data from the compressed data, H M ; that is,Ĥ = f dec (f enc (H i ; θ enc ); θ dec ) = f dec (H M ; θ dec ), where θ dec is the decoder parameters. In the following subsection, we will explain how the model parameters (θ enc and θ dec ) are trained and shared between transceivers.

Model Training and Sharing
In LCF, the whole parameters of the autoencoder (both θ enc and θ dec ) are trained by a receiver, but only the decoder parameters (θ dec ) are sent to the transmitter if needed. At the very beginning, the receiver trains the autoencoder with (N o + 1) CSI values, including the current CSI, i.e., H i , and the newly generated N o CSI predictions, i.e., H o , and through this step, it obtains the trained encoder and decoder parameters, respectively. Since the decoder model is updated, its parameters need to be sent to the transmitter by the receiver. Now, the process of compressing the target CSI is conducted by retraining the model on it. Note that in this step, training is performed with the decoder parameters fixed, since the decoder parameters are already shared with the transmitter. In this process, θ enc parameters are still trainable, so they may have different values before and after training. However, since they are not shared with the transmitter and are used only by the receiver, they do not have a significant impact on the system as a whole.
Every time the receiver compresses CSI, as a result of training, it obtains an optimization error, denoted as δ, which corresponds to the expected CSI recovery error at the transmitter. Depending on this value, it makes a decision about whether to send the decoder parameters to the transmitter or not. If a δ value is less than a predefined threshold, i.e., δ thr , the receiver sends only the compressed CSI, i.e., H M , as this implies that the decoder parameters are still valid enough for the current target CSI thanks to the CSI prediction. Otherwise, the receiver obsoletes the previous decoder parameters; then, it re-trains the entire model and sends the newly trained decoder parameters (i.e., θ dec ) to the transmitter with the compressed CSI.

Algorithm 2: Transmitter
To summarize, we provide the entire proposed CSI feedback algorithm in Algorithms 1 and 2. Basically, the proposed method is designed for two communication entities, but it can be extended to multi-user scenarios as well. However, in this case, the transmitter may have to maintain different models for different users, which can cause additional operational burdens such as increased model sharing overhead. Therefore, we should consider mixing the proposed deep learning-based approach with traditional approaches. We leave this issue for our future work.

Settings
In this section, the performance of LCF is evaluated. We use TensorFlow 2 [26] to develop the proposed deep learning models of LCF and conduct extensive experiments on an Intel-i7 machine with 16 GB RAM and an NVIDIA RTX 3080 GPU. Using the MATLAB WINNER II model [27], we generate CSI datasets for both scenarios, which are sampled every 2 ms. When training the models, we use 70% of the total dataset for training, and the other 20% and 10% are used for validating and testing, respectively. For model optimization, we use the MSE as the objective, and employ Adam optimization [34] with a maximum of 1,000 epochs and a learning rate of 0.001. The default parameters used in the experiments are described in Tab. 2. To compare the performance of LCF with those of other approaches, we additionally implement AFC [3] and CSINet [13]. Unfortunately, since all these schemes have different features and feedback policies, we have to make some modifications to them to ensure a fair comparison. The main changes are as follows: • AFC: As AFC is not a machine learning-based approach, it does not require a training step and determines the degree of compression by calculating the expected compression error each time it receives CSI. The adaptive bit-quantization scheme is excluded since it can be applied to the other schemes as well. In the original AFC, the compression ratio can also be dynamically adjusted depending on the channel status, which is different from the other two schemes, which use a fixed compression ratio. In this study, for simplicity, we apply a fixed compression ratio to AFC. • CSINet: CSINet considers only single-antenna receiver cases. In order to extend it to multiantenna receiver cases, we repeatedly apply it to the channel of each receiving antenna. We use the same training configuration (both dataset and optimizer) for both CSINet and LCF. • Both: All these schemes can skip a CSI transmission or model (i.e., decoder) parameter transmission if the expected CSI recovery error is less than δ thr . Note that even if this condition is satisfied, LCF and CSINet should still send the compressed CSI to the transmitter.
Compression ratio α is defined as the ratio of the compressed data size to the original CSI data size (i.e., M  2N d N t N r ), and as a key performance metric, we measure the MSE, defined in the following equation [13]: MSE = ||H −Ĥ|| 2 ||H|| 2 . In addition to the MSE, we use the cosine similarity between the original CSI and the reconstructed CSI to determine the value of δ thr . The imperfect CSI due to the compression causes changes in the resulting beam steering vectors, which can be measured as the cosine similarity between the two CSI values [3,13]. Fig. 9 shows the cosine similarity values as a function of MSE for N t = N r = 4 and N d = 16. To draw this plot, for each MSE value, we generate two sets of CSI matrices: One is randomly generated from the standard normal distribution, and the other is generated by adding random noise of the given MSE value to the previous matrix. After that, we compute the cosine similarity between the two matrices for each MSE value. The result is quite predictable; the cosine similarity decreases with the value of MSE. Based on this result, we set δ thr as 0.1, where the cosine similarity is around 0.95.
In the following subsections, we first investigate the performance of each model used in LCF through micro-benchmarks, and then we compare the overall performance of LCF with those of the other approaches.

Micro-Benchmarks 5.2.1 CSI Prediction Model
We investigate the impact of LSTM parameters on the prediction performance, according to varying numbers of LSTM units and different N i and N o combinations. Fig. 10 illustrates the plots of N o = N i cases for each scenario. It can be seen that the prediction accuracy decreases with the value of N o , except for the case where N o = 5 and 256 units are used in the dynamic channel. This result is consistent with the previous observation shown in Fig. 7, where, as the model predicts CSI values for times farther in the future, the prediction error becomes larger. The CSI prediction performance is mainly affected by the N o value, but the number of LSTM units also has an effect. For all cases, the prediction becomes more accurate as the number of LSTM units increases, except for the N o = 5 case for the dynamic channel. This exception is due to the overfitting problem; using more LSTM units increases the model capacity too much, making it difficult to handle unobserved CSI data. Recall that in this evaluation, the dynamic channel has a relatively high and rapid channel variation compared to the stable one, and thus it is more likely to suffer from overfitting. Overall, better CSI prediction results are observed in the stable channel case than in the dynamic channel case; the worst MSE for the stable case is around 0.02, while for the dynamic case, it reaches almost 0.9. However, by taking small N i and N o values, we can improve CSI predictions in the dynamic channel as well; when N o is 5 and the number of LSTM units is 32, the prediction error is at its lowest value of 0.02.

CSI Compression and Recovery Model
In this subsection, we evaluate the performance of the CSI compression model. To do this, for each scenario we first generate 20 CSI predictions with N i = N o = 20 and use them to train the autoencoder. After that, we compress and restore the corresponding CSI label data with the trained model, measuring the difference between the two data. Recall that the decoder parameters are fixed once they are trained. We repeat this experiment 100 times and take the average value. Fig. 11 shows the results in terms of the MSE and the number of decoder parameters, according to various numbers of encoder filters and compression ratios. First, from Figs. 11a and 11b, we can observe that the number of filters affects the compression performance greatly. For the same compression ratio, using more filters causes lower recovery errors. Unfortunately, in return for this high performance, the use of more filters causes the model to be larger, resulting in a high model sharing overhead, as shown in Fig. 11c. The number of filters is not the only factor to have an impact on the compression performance; the compression ratio affects the performance as well, since it directly affects the number of units in the fully connected layers of the autoencoder. As expected, the decoder size decreases with the compression ratio, at the expense of a high compression error. We also find that the recovery errors in the stable channel are overall lower than those in the dynamic channel, even though this difference is very subtle.

Macro-Benchmarks
In this evaluation, we compare the performance of LCF with those of AFC [3] and CSINet [13]. We run each scheme for 20 consecutive CSI values and measure the average MSE and feedback size at different compression ratios from 1/8 to 1/128. Here, the feedback size is defined as the combined sizes of the model and the compressed CSI that are sent to the transmitter. We repeat this evaluation for both the stable and dynamic channel scenarios, and Fig. 12 shows the results.
From the results, we can see that AFC takes advantage of a small feedback size; for both scenarios, the maximum overhead is only 32, which is a much smaller number than those of other deep learningbased schemes. However, it suffers from high CSI recovery errors. As shown in Figs. 12a and 12d, its MSE values are all larger than 1, which is practically implausible. However, we note that the actual AFC can perform better than this modified AFC because of its adaptive compression ratio and quantization, which have been excluded in this evaluation. Compared to AFC, CSINet and LCF both have lower MSE results. When the compression ratio is 1/8, CSINet obtains minimum MSE values of 0.07 and 0.24 for both cases, respectively. Although these values are better than those of AFC, they still seem to be somewhat unstable. In particular, when using a higher compression ratio such as 1/128, or in a highly dynamic wireless channel scenario, the CSI recovery error increases significantly. Its MSE values are at around 0.5 in the dynamic case, as shown in Fig. 12d, which verifies our hypotheses that CSI compression would quickly become less effective without proper model updates.
To make matters worse, this result eventually incurs a substantial model sharing overhead, as shown in Figs. 12b and 12e. The reason the two curves of the two scenarios show different patterns, i.e., one is going up while the other is going down, is because the major factor affecting the feedback size is different. In the stable channel case, model sharing rarely occurs at low compression ratios, and thus the feedback overhead decreases with the compression ratio; from Fig. 12c, we can see that only 1-2 model sharing transmissions happen when the compression ratio is lower than 1/32. However, using a higher compression ratio results in more frequent model sharing, causing the decoder size to take up most of the feedback overhead; as a result, when the compression ratio is 1/128, model feedback occurs for every CSI sample (Fig. 12c). Conversely, for the dynamic channel case, model sharing occurs for all CSI, as shown in Fig. 12f, so the number of model transmissions no longer has a significant impact on the results. In this case, the higher the compression ratio, the smaller the model size, which at the same time reduces the feedback overhead. Overall, LCF outperforms the other approaches in terms of MSE. Even at the highest compression ratio of 1/128, it achieves an MSE value of 0.05, which is much lower than those of the other schemes. More surprisingly, LCF obtains this result with lower feedback overhead; the average feedback overhead values are only around 40 and 120, respectively, for both cases. From Fig. 12b, we can see that LCF has a higher feedback overhead than CSINet in the stable channel case when compression ratios are low (e.g., 1/8 and 1/16). This is due to the fact that LCF directly takes three-dimensional channel data as input, and thus the number of units in the fully connected layers is inherently larger than that of CSINet for the same compression ratio. As shown here, the gains of LCF may not be noticeable in these special situations, where model updates are not required much. However, in most cases, compared to the existing CSI feedback approaches, LCF obtains more stable and higher CSI compression performance, with only 10% of the model sharing overhead of the other approaches. In this paper, we propose LCF, which addresses the issues of conventional autoencoder-based CSI feedback schemes, specifically that CSI compression quickly becomes less effective and incurs an excessive model sharing overhead over time. Employing an autoregressive LSTM model, LCF generates CSI predictions and then exploits them to train the autoencoder, so that the compression model will be valid even for highly dynamic wireless channels. In order to fully capture the channel correlations to achieve higher CSI compression, three-dimensional convolutional layers are directly applied to the autoencoder. As a result, compared to the existing CSI feedback approaches, LCF obtains more stable and better CSI compression performance in terms of MSE, with only 10% of the model sharing overhead of the other approaches.
The LSTM model in LCF performs properly for forecasting time-series CSI data, but unfortunately it has the well-known drawback of a long training time. Several approaches can be considered to remedy this issue. First, instead of training the prediction model on all of the data, we can use an ensemble learning strategy that would update the model with the new data, and combine it with the existing model [35,36]. To overcome this limitation of LCF, we could also consider using a different type of network. Gated Recurrent Units (GRU) [37] could be one good alternative since it can take advantage of low computation with a smaller number of parameters compared to LSTM. Generally, Convolutional Neural Networks (CNNs) are computationally cheaper than the models in the Recurrent Neural Network (RNN) family, and thus they could be used for this task as well. In this case, it is easier to combine the two models of LCF, which are currently separated, into one model, resulting in higher efficiency. These schemes should be carefully considered not only with the two channel models currently used in this paper, but also with more realistic and diverse channel environments. We leave these issues for our future work.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.