Efficient Gauss-Seidel Precoding with Parallel Calculation in Massive MIMO Systems

A number of requirements for 5G mobile communication are satisfied by adopting multiple input multiple output (MIMO) systems. The inter user interference (IUI) which is an inevitable problem in MIMO systems becomes controllable when the precoding scheme is used. In this paper, the horizontal Gauss-Seidel (HGS) method is proposed as precoding scheme in massive MIMO systems. In massive MIMO systems, the exact inversion of channel matrix is impractical due to the severe computational complexity. Therefore, the conventional Gauss-Seidel (GS) method is used to approximate the inversion of channel matrix. The GS has good performance by using previous calculation results as feedback. However, the required time for obtaining the precoding symbols is too long due to the sequential process of GS. Therefore, the HGS with parallel calculation is proposed in this paper to reduce the required time. The rows of channel matrix are eliminated for parallel calculation inHGSmethod. In addition,HGSuses the ordered channel matrix to prevent performance degradation which is occurred by parallel calculation. The HGS with proper number of parallelly computed symbols has better performance and reduced required time compared to the traditional GS.

MIMO systems promise significant improvements in terms of spectral efficiency, reliability, and data rate compared to multi-user (MU) MIMO systems by using large channel matrix [9][10][11]. The significant improvement of performances for the massive MIMO is shown mathematically in [12]. If the channel matrix is extremely large, the effect of fast fading and non-correlated noise becomes extinct [13]. However, computational complexity at BS is extremely increased because of large array of antennas at BS. The inversion of the channel matrix is critical task for massive MIMO signal processing [14]. It means that the precoding schemes which use exact inversion of channel matrix are impractical in massive MIMO systems. Therefore, the schemes which use approximated matrix inversion have been studied for achieving low computational complexity in massive MIMO systems.
The match filter (MF) precoding is the simplest precoding scheme. However, the performance of the MF precoding is much poorer than the ZF scheme [9]. The ZF scheme at MU-MIMO systems has sub-optimal performance and the lowest computational complexity. In contrast, the ZF in massive MIMO systems has optimal performance because of the large channel matrix [15]. However, in massive MIMO systems, since the inversion of large gram matrix has to be calculated, ZF scheme has impractically large computational complexity which is cubic order with respect to the number of users. Therefore, in [16], the approximate inversion of gram matrix is calculated by using Neumann Series (NS). The computational complexity of approximate inversion using NS is lower than exact inversion and the calculation of NS is parallelized. However, when the iteration of NS is more than 2, NS has also computational complexity which is cubic order.
In contrast, precoding scheme based on Gauss-Seidel (GS) which was proposed in [17] keeps the computational complexity which is square order. In addition, since the result of the previous precoding symbol is used as the feedback of the following calculation of next precoding symbol, the performance of the scheme based on GS is better than the scheme based on NS. It means that the calculation at GS is not parallel processing and requires longer time for obtaining precoding symbol. Therefore, in this paper, horizontal GS (HGS) which reduces required time for obtaining precoding symbol is proposed. The HGS calculates some precoding symbols without feedback of the previous precoding symbol for parallel processing. The conventional schemes use calculation result of previous symbol for calculating present symbol. It means that conventional schemes have to wait for result of the (n − 1)th symbol to calculate the nth symbol [18]. However, parallel calculation at HGS does not need to wait for result of previous symbol. Parallel calculation technique calculates some symbols simultaneously. The calculation without feedback at GS results in the performance degradation. In HGS, the channel matrix is sorted to overcome performance degradation. The calculation without feedback at GS results in the performance degradation. In HGS, the channel matrix is sorted to overcome performance degradation. Thus, the required time for obtaining precoding symbol is reduced at HGS by calculating precoding symbol without feedback. In addition, the performance of HGS is better than GS by sorting the channel matrix.
In Section 2, system model which is used in this paper is explained. The conventional schemes which are compared with the proposed scheme are explained in Section 3. In Section 4, HGS scheme is proposed. Section 5 shows the performance comparison between the conventional schemes and the proposed scheme.

System Model
In Fig. 1, massive MIMO broadcasting system which is composed of one base station with N T transmit antennas and K users is considered. Rayleigh flat fading channel is assumed and the kth user has one receive antenna. The number of transmit antennas is much larger than the total number of receive antennas N R (N T N R ). The transmit signal vector for the kth user is x k . The received signal y k for the kth device can be expressed as follows, where H k ∈ C N R ×N T , P k ∈ C N T ×1 , j and n k ∈ C N k ×1 are the channel matrix of the kth user, precoding matrix for the kth user, index of other users except for the kth user and the additive white Gaussian noise vector of the kth user which has zero mean and variance σ 2 n , respectively. The set of entire received signal vectors y ∈ C N R ×1 can be expressed as follows, whereH ∈ C N R ×K is the effective channel. The effective channelH is obtained as follows, Diagonal components in the effective channel matrix are MIMO channels for each user and off-diagonal components express the IUI.

Zero Forcing Precoding
The ZF precoding scheme uses exact inversion of the gram matrix to eliminate IUI and inter antenna interference (IAI). The precoding matrix of ZF can be expressed as follows, where β = K tr(Z −1 ) and Z = HH H are the normalization factor and the gram matrix, respectively. The precoded signal vector can be expressed as follows, wherex = Z −1 x. In massive MIMO systems, since the channel matrix has large array, the ZF can achieve optimal performance. However, the exact inversion of gram matrix is hard to be calculated because of high computational complexity of O(K 3 ). Therefore, the method to calculate approximate inversion of the gram matrix is needed.

Gauss-Seidel Precoding
For downlink massive MIMO systems, the columns of channel matrix H are asymptotically orthogonal [9]. It means that the gram matrix Z of the channel is the Hermitian positive definite which is the condition to exploit GS method. Therefore, the approximate inversion of gram matrix can be obtained by GS method.
The GS method is an iterative technique for solving linear equation Ax = b when A, x and b are the K × K Hermitian positive definite matrix, the K × 1 unknown vector and the K × 1 measurement vector, respectively. The A can be decomposed as follows, where D, L and L H are the diagonal component, the strictly lower triangular component and the strictly upper triangular component of A, respectively. The solution of Ax = b which is calculated iteratively by using GS method can be expressed as follows, where x (i+1) and x (i) are the (i + 1)th and the (i)th approximations of x.
The GS method can be applied to linear equationx = Z −1 x which can be rewritten as Zx = x. The solution of linear equation Zx = x is expressed as follows, is sequentially calculated as follows, It means that the components ofx (i+1) from the first to the (m − 1)th index are used as . Therefore, the performance for the GS method is enhanced due to the use of previous results. However, since the calculation of the mth component needs results of previous components, the mth component has to wait for calculation results of the components from the first to the (m − 1)th index. It means that the required time to calculate precoding symbol is long. Therefore, GS method which can be computed in parallel is proposed in this paper to reduce the required time.

Proposed Horizontal Gauss-Seidel Precoding
The ZF precoding scheme has optimal performance with large scale MIMO systems. However, when the ZF is used, large computational complexity of exact matrix inversion is inevitable. Therefore, the methods which compute approximate inversion of channel matrix with low complexity have been studied such as GS method. In GS method, the results of the precoding symbol from the first to the (m − 1)th index are used to calculate the mth precoding symbol as feedback.
Since the previous results are used at the current calculation, the performance of GS method is improved. It means that the precoding symbols are computed sequentially. The required time to obtain precoding symbols becomes long due to sequential process of GS method. Therefore, HGS method with parallel calculation is proposed to reduce the required time. at the (i + 1)th iteration is used. In other words, when the mth component ofx (i+1) is current calculated symbol, (m − 1) results at the (i + 1)th iteration are used as feedback. The number of used feedbacks is increased as an index ofx (i+1) grows. In other words, with HGS, if the mth precoding symbol is chosen for calculating without feedback, the (m − 1) feedbacks are not used. The number of feedbacks which are not used in HGS varies depending on the index of the selected symbol. Therefore, the precoding symbol with the smallest index has to be chosen as calculated symbol without feedback. The symbol that requires the least feedbacks at calculation is chosen to reduce performance degradation by not using feedbacks. In HGS, the rows of L are eliminated horizontally from the second to the (s + 1)th index not to use results of calculation at the current iteration when the s is the number of selected symbols.
For example, Fig. 3 shows lower triangular matrix L when the number of calculated precoding symbol without feedback is 0 (s = 0). When the number of computed precoding symbol without feedback is 1 or 2 (s = 1 or s = 2), the row of L is eliminated as shown in Fig. 3.
The components which are excluded from L are added to upper triangular matrix L H as shown in Fig. 4. Therefore, Eq. (7) for selected symbol can be rewritten as follows, The required time for calculating a precoding symbol is shown in Tab. 1. If the required time for calculating a precoding symbol is t, the Kt time is needed at conventional GS method. In HGS, when the number of symbols which are parallelly calculated is s, the total required time is (K − s)t. However, HGS method cannot avoid the performance degradation because of the parallel calculation. Since the previous results are not used for computing current precoding symbol, the performance of the HGS method becomes poor.
The performance degradation of the HGS can be prevented by sorting the gram matrix of the channel. The criterion of sorting the gram matrix is the off-diagonal components of Z. The power of the off-diagonal components for the first ordering is expressed as follows, The off-diagonal components of the gram matrix mean the correlation with other channels. The large correlation with other channels means that the interference from other channels is large. Therefore, the symbol which has the channel corresponding to large off-diagonal components needs many feedbacks. It means that the row of gram matrix with maximum off-diagonal components should replace the bottom of the matrix. If the gth row of Z has maximum off-diagonal components, it becomes the Kth row of ordered Z. Since the interference between the gth channel and other channels is already considered at the first ordering, the power of the off-diagonal components is calculated except for the interference from the gth row to find the row with the second largest off-diagonal component. Therefore, the power of the off-diagonal components for second ordering is expressed as follows, The row with maximum P 2 m becomes the (K − 1)th row of ordered Z. The power of the off-diagonal components for the third ordering is also calculated except for the interference from the row with maximum P 2 m . Therefore, the power of the off-diagonal components for the second ordering is expressed as follows, where the g is the index of the row with maximum P (i−1) m . The row with minimum off-diagonal components are positioned on top of the matrix. The gram matrix at HGS is sorted into the off-diagonal components of the gram matrix. The performance of HGS is better than the GS by sorting the gram matrix. In addition, the performance of HGS is still better than the GS when the s is smaller than K 2 . It means that the HGS has better performance and shorter required time than traditional GS. The HGS needs extra operation for ordering the gram matrix of the channel compared to the traditional GS. However, while the channel is unchanged, the benefit from reducing the required time with an arrangement is considerable.
Since the approximation of HGS rapidly gets close to the exact inversion of the gram matrix, the performance of HGS is better than the conventional GS. However, the GS has better convergence rate by using only sequential calculation. The HGS overcomes poor convergence rate by sorting the gram matrix. The approximation error between approximate solution and exact solution can be expressed as follows, where thex = Z −1 x is the exact solution with inversion of the gram matrix. Eq. (14) can be rewritten as follows, where M = −(D + L) −1 L H is the iteration matrix. Since the small approximation error means fast convergence, the convergence rate accelerates when Frobenius norm of M is small [19].

Simulation Results
The error and throughput performances for the HGS are evaluated and compared with the conventional GS scheme. The Rayleigh flat fading channel is used and the perfect channel estimation is assumed. All elements of the channel matrices have independent complex Gaussian random variables with zero mean and unit variance. The system which has much larger number of transmit antennas compared to the number of receive antennas is considered. The number of transmit antennas is 80 or 100. The number of antennas for each user is 1 and total number of receive antennas is 10. The used modulation is 16-quadrature amplitude modulation (QAM). The number of GS iterations is 2 or 3. The number of precoding symbols which are computed parallelly at HGS is 0, 2 or 5. The bit error rate (BER) performances for the HGS are evaluated with various number of parallelly calculated symbols. Tab. 2 shows the simulation parameters. The BER and throughput performances for the HGS are shown in Figs. 5-8. The enhancement of the BER performance at HGS is obtained by ordering the gram matrix of the channel. The performances with 2 GS iterations are shown in Fig. 5. Since the performance enhancement is occurred by ordering gram matrix but the performance degradation of parallel calculation is not occurred, the HGS without parallel calculation (HGS-0) has the best performance. Therefore, the throughput of HGS-0 is larger than other schemes in Fig. 7a. In Fig. 7b, the gap of throughput between the schemes except HGS-5 is small. However, the HGS without parallel calculation has the same required time for obtaining precoding symbol as conventional GS. The HGS with 2 symbols which are parallelly computed (HGS-2) has better performance compared to the conventional GS. In Fig. 7a, HGS-2 achieves the maximum throughput rapidly compared to the conventional GS. In addition, the required time for calculating precoding symbol is reduced compared to the conventional GS due to the parallel calculation. However, since the number of feedback at HGS-2 is smaller than the feedback at HGS-0, the BER performance of HGS-2 is poorer than HGS-0. The throughput of HGS-2 is slightly smaller than HGS-0 in Fig. 7a. However, the throughput performances of HGS-0 and HGS-2 are almost same in Fig. 7b. When the number of symbols which are computed parallelly is larger than K 2 , the performance degradation of parallel calculation is larger than the performance enhancement of ordering gram matrix. Therefore, the performance of HGS with 5 symbols which are computed parallelly is poorer than the conventional GS. In Fig. 5b, since the number of transmit antennas grows to 100, the channel becomes more diagonal dominant. Therefore, the performances of all schemes are improved due to the diagonal dominant channel. Even though all schemes except HGS-5 has almost same throughput performance due to the diagonal dominant channel in Fig. 7b, the HGS-0 has poorer throughput performance because of the large number of parallelly computed symbols. The BER and throughput performances with 3 GS iterations are shown in Figs. 6 and 8. Since the 1 GS iteration is added, the approximate solution becomes close to the exact solution. Therefore, the performances of all schemes with 3 GS iterations are better than the performances of all schemes with 2 GS iteration. In addition, the GS, HGS-0 and HGS-2 have almost optimal BER performance. However, the HGS-5 has poorer performance compared to other schemes due to parallel calculation of many symbols in Fig. 6a. The HGS-5 has almost similar performance with other schemes due to diagonal dominant channel in Fig. 6b. The all schemes have almost optimal throughput performance due to added GS iteration in Fig. 8. The comparison of Frobenius norm is shown in Fig. 9. Since precoding symbols of the GS and HGS-0 are calculated with whole lower and upper triangular matrix for sequential calculation, the GS and HGS-0 have the smallest Frobenius norm. In addition, since the HGS-0 has performance enhancement by sorting the gram matrix, the HGS-0 has better BER performance compared to conventional GS. Since the lower triangular matrix without 2 rows is used at calculation, the HGS-2 has slightly large Frobenius norm compared to GS and HGS-0. However, the HGS-2 overcomes the reduced Frobenius norm by sorting the gram matrix. Therefore, the performance of HGS-2 is better than the GS although the Frobenius norm of the HGS-2 is larger than GS. Since the HGS-5 has too many symbols which are calculated parallelly, the difference of Frobenius norm between GS and HGS-5 becomes large. Therefore, the HGS-5 can not overcome the reduced Frobenius norm even though the HGS-5 uses the ordered gram matrix. The BER performance of HGS-5 is poorer than the GS because of the reduced Frobenius norm.

Conclusion
Massive MIMO system is a key component of future wireless communication in terms of high data rate over a limited frequency resource. In MIMO broadcast channel, IUI occurs inevitably at each device. Therefore, the BS has to utilize precoding schemes for IUI reduction.
The HGS with parallel operation is proposed to reduce the required time for obtaining the precoding symbol in massive MIMO systems. The conventional GS method has performance enhancement due to the sequential calculation which uses previous results as feedback. It means that the required time for obtaining the precoding symbols is too long. When the required time for obtaining one precoding symbol is t, the total required time at GS is Kt because of sequential calculation. Therefore, in HGS, some symbols are parallelly computed without feedback to reduce the required time. When the number of symbols which are parallelly calculated is s, the total required time is reduced to (K − s)t. However, the parallel calculation without feedback in HGS gives bad influence to BER performance. Therefore, the gram matrix which is sorted by interference of other channel is used to overcome the performance degradation of parallel calculation in HGS. When the number of symbols which are parallelly calculated is under K 2 , the BER performance of HGS is better than the conventional GS. However, the performance degradation of parallel calculation has serious impact on the performance compared to the performance enhancement of ordered gram matrix when the number of parallelly calculated symbol is larger than K 2 . Therefore, the performance of HGS with too many parallelly calculated symbols is poorer than the GS. The way to overcome performance degradation which is occurred when the number of parallelly calculated symbols is larger than K 2 has to be studied. In this paper, the proposed HGS scheme is proposed for reducing the required time by using parallel calculation.