Efficient Concurrent L1-Minimization Solvers on GPUs

Given that the concurrent L1-minimization (L1-min) problem is often required in some real applications, we investigate how to solve it in parallel on GPUs in this paper. First, we propose a novel self-adaptive warp implementation of the matrix-vector multiplication (Ax) and a novel self-adaptive thread implementation of the matrix-vector multiplication (Ax), respectively, on the GPU. The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size. Second, based on the above proposed kernels, the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU, and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache. Finally, we design a concurrent L1-min solver on multiple GPUs. The experimental results have validated the high effectiveness and good performance of our proposed methods.


Introduction
Due to the sparsity of the solution of the L1-min problem, it has been successfully applied in various fields such as signal processing [1][2][3], machine learning [4][5][6][7][8], and statistical inference [9]. Moreover, the concurrent L1-min problem where a great number of L1-min problems need be concurrently computed is often required in these real applications. This motivates us to discuss the concurrent L1-min problem in this paper. Here the following concurrent L1-min problem is considered: min jjx i jj 1 s:t: Ax i ¼ b i ; i ¼ 1; 2; Á Á Á ; k; (1) where A 2 R mÂn ðm << nÞ is a full-rank dense matrix, b i 2 R m is a pre-specified vector, and x i 2 R n is an unknown solution. Each one of these L1-min problems is independent except that they share the matrix A. This makes it suitable for parallel computing.
Given their multiple core structures, graphics processing units (GPUs) have sufficient computation power for scientific computations. Processing big data by GPUs has drown much attention over the recent years [10][11][12][13][14]. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications by NVIDIA [15], GPUs have become strong competitors as general-purpose parallel programming systems.
Due to high compute capacity of GPUs, accelerating algorithms that are used to solve the L1-min problem on the GPU has attracted considerable attention recently [16,17]. As we know, there exists a great number of L1-min algorithms such as the gradient projection method [18], truncation Newton interior-point method [19], homotopy methods [20], augmented Lagrange multiplier method (ALM) [21], class of iterative shrinkage-thresholding methods (FISTA) [22], and alternating direction method of multipliers [23]. And most of them are composed of dense matrix-vector multiplications such as Ax and A T x, and vector operations. In 2011, Nath et al. [24] presented an optimization symmetric dense matrixvector multiplication on GPUs. In 2016, Abdelfattah et al. [25] proposed an open-source, highperformance library for the dense matrix-vector multiplication on GPU accelerators, KBLAS, which provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Moreover, a subset of KBLAS high performance kernels has been integrated into the CUBLAS library [26], starting from version 6.0. In addition, there have been highly efficient implementations for the vector operation on the GPU in the CUBLAS library. Therefore, the existing GPU-accelerated L1-min algorithms are mostly based on CUBLAS.
However, for the implementations of Ax and A T x in CUBLAS, the performance value fluctuates as m (row) increases when n (column) is fixed or n increases when m is fixed, and the difference between the maximum and minimum performance values is distinct. In [17], Gao et al. observe these phenomena, and present two novel Ax and A T x implementations on GPU to alleviate the drawbacks of CUBLAS. Furthermore, they take FISTA and ALM to propose two adaptive optimization L1-min solvers on GPU.
In this paper, we further investigate the design of effective algorithms that are used to solve the L1-min problem on GPUs. Different from other publications [16,17], here we emphasize the design of concurrent L1-min solvers on GPUs. First, we enhance Gao's GEMV and GEMV-T kernels by optimizing the warp allocation strategy of the GEMV kernel and the thread allocation strategy of the GEMV-T kernel, and designing the optimization schemes for the GEMV and GEMV-T kernels. Second, the vector-operation and inner-product decision trees are established automatically to merge the same operations into a single kernel. For any-sized vector, the optimization vector-operation and inner-product implementation methods are automatically and rapidly selected from the decision trees. Furthermore, the popular L1-min algorithm, fast iterative shrinkage-thresholding algorithm (FISTA), is taken for example. Based on the proposed GEMV, GEMV-T, vector-operation and inner-product kernels, we present two optimization concurrent L1-min solvers on a GPU that are designed from the perspective of the streams and the thread blocks, respectively. Finally, we design a concurrent L1-min solver on multiple GPUs. In this solver, each GPU only solves a L1-min problem every time instead of solving multiple L1-min problems by utilizing multiple streams or thread blocks. This solver is applied to this case where the number of L1-min problems included in the concurrent L1-min problem is much less than the number of streams (or thread blocks). Experimental results show that our proposed GEMV and GEMV-T kernels are more robust than those that are suggested by Gao et al. and CUBLAS, and the proposed concurrent L1-min solvers on GPUs are effective. The main contributions are summarized as follows: Two novel adaptive optimization GPU-accelerated implementations of the matrix-vector multiplication are proposed. Two optimization concurrent L1-min solvers on a GPU are presented from the perspective of the streams and thread blocks, respectively.
Utilizing new features of GPU and the technique of merging kernels, an optimization concurrent L1min solver on multiple GPUs is proposed.
The remainder of this paper is organized as follows. In Section 2, we describe the fast iterative shrinkage-thresholding algorithm. Two adaptive optimization implementations of the matrix-vector multiplication on the GPU and the vector-operation and inner-product decision trees are described in Section 3. Sections 4 and 5 give two concurrent L1-min solvers on a GPU and a concurrent L1-min solver on multiple GPUs, respectively. Experimental results are presented in Section 6. Section 7 contains our conclusions and points to our future research directions.

Fast Iterative Shrinkage-Thresholding Algorithm
The L1-min problem is known as the basis pursuit (BP) problem [27]. In practice, a measurement data b often contains noise (such as the measurement error:e), which is called the BPDN problem. A variant of this problem is also well known as the unconstrained BPDN problem with a scalar weight or the Lasso problem [28] in the statistics perspective: The fast iterative shrinkage-thresholding algorithm (FISTA) is a kind of accelerations, and achieves an accelerated non-asymptotic convergence rate of O(k 2 ) by combining Nesterov's optimal gradient method [22]. For FISTA, it adds a new sequence y k ; k ¼ 1; 2; Á Á Á f gas follows.
where softðu; aÞ ¼ signðuÞ max juj À a; 0 f g is the soft-thresholding operator, y 1 ¼ x 0 , t 1 ¼ 1 and the associated Lipschitz constant L f of rf ðÁÞ is given by the spectral norm of A T A, denoted by jjA T Ajj 2 . For large-scale problems, L f is not always easily computable. Thus, a backtracking stepsize rule is suggested to alleviate this drawback. Algorithm 1 summarizes the generic FISTA algorithm with a backtracking stepsize rule [22].

GPU Kernels
For FISTA, its main components include Ax (GEMV) and A T x (GEMV-T), vector operations, and inner product of vector. In the following subsection, we present their implementations on the GPU, respectively. Tab. 1 lists the symbols that are used in this paper. N sm , N reg , N mem , N tb , and N td are constants for a specific GPU. A row major and 0-based indexing array a is used to store the matrix A and the float precision values are only used for all computations in this paper.

GEMV Kernel
Given that the GEMV, Ax, is composed of dot products of x and A i (the ith row of A), i ¼ 1; 2; Á Á Á ; m, and these dot products can be independently computed, we assign one warp or multiple warps to a product dot for our proposed GEMV kernel. To optimize the GEMV kernel performance, for a given nt, we propose the following self-adaptive warp allocation strategy to select the number of warps k for a product dot: k nt=32 and k ¼ 2 z ; z 2 0; 1; 2; Á Á Á f g : Eq. (4) denotes the objective of maximizing the number of warps. Eq. (5) guarantees that each warp group (k warps are grouped into a warp group) calculates at least a product dot. Eq. (6) guarantees that k must be a power of two and the warp-group size is at most the number of threads per block. N mb in Eq. (4) is calculated as follows: ; N td nt; N tb Þ; where N reg b and N mem b denote the number of registers and the amount of shared memory required by the threads per block for our proposed GEMV kernel, respectively. Here N mb is the minimum number of blocks per grid that maximize the resource utilization per multiprocessor. In this paper, we take N sm Â N mb as nb. 1.

16.
Ay b x y f y x y The GEMV kernel is mainly composed of the following three steps: 1.
x-load step: The step is used to make threads per block parallel read x into the shared memory xS.
Because the size of x is large, x is segmentally read into the shared memory. To take full advantage of the amount of shared memory per multiprocessor, the size of shared memory per block is set as follows: where N mb is calculated by Eq. (7). The time of loading x is xTimes ¼ n=SIZE MEM. By this way, the accesses to x are coalesced, and the access number is reduced by letting threads in the same thread block to share the section of elements of x.

Partial-Reduction
Step: Each time after a section of elements of x is read into the shared memory, the threads in each warp group perform in parallel a partial-style reduction. Obviously, each thread in a warp group at most performs n=SIZE WG times of reductions and the accesses to the global memory a are coalesced. Here SIZE WG is the number of threads in the warp group, and is equal to k × 32.

Warp-Reduction
Step: After the threads in each warp group have completed the partial-style reductions, the fast shuffle instructions are utilized to perform a warp-style reduction for each warp in these warp groups. The warp-style reduction values are stored in the shared memory. Then the warp-style reduction values in the shared memory for each warp group are reduced to an output value in parallel.

GEMV-T Kernel
The GEMV-T, A T x, is composed of dot products of x and (A T ) i (the ith column of A), i ¼ 1; 2; Á Á Á ; n, and these dot products can be independently computed. Given that the size of the vector x in the GEMV-T is small, we assign one thread or multiple threads to a dot product in our proposed GEMV-T kernel. To optimize the GEMV-T kernel performance, for a given nt, we propose the following self-adaptive thread allocation strategy to select the number of threads k for a product dot: n w; k 32 and k ¼ 2 z ; z 2 0; 1; 2; Á Á Á f g : Eq. (9) denotes the objective of maximizing the number of threads. Eq. (10) guarantees that each thread group (k threads are grouped into a thread group) calculates at least a product dot. Eq. (11) guarantees that k must be a power of two and the thread-group size is at most the size of a warp. N mb in Eq. (9) is calculated by the same equation as Eq. (7) except that N reg b and N mem b denote the number of registers and the amount of shared memory required by the threads per block for our proposed GEMV-T kernel. To maximize the resource utilization per multiprocessor, we take N sm Â N mb as nb.
Similar to the GEMV kernel, our proposed GEMV-T kernel is also composed of the x-load step, partialreduction step and warp-reduction step.
1. x-load step: Like the GEMV kernel, this step is used to make threads per block parallel read elements of x into the shared memory xS. Given that the size of x is small, x is once read into the shared memory in the GEMV-T kernel.
2. partial-reduction step: Since a row major and 0-based index format is used to store the matrix A, the accesses to A will not be coalesced if the thread groups are constructed in an inappropriate way. For example, we assume that A is a 4 Â 8 matrix as shown in Fig. 1, 16 threads in a thread block are launched, and 2 threads are assigned to a dot product in the GEMV-T kernel. If we use the following thread groups 0; 1 f g; 2; 3 f g; 4; 5 f g; Á Á Á ; 14; 15 f g, the accesses to A will not be coalesced. However, when the thread groups 0; 8 f g; 1; 9 f g; 2; 10 f g; Á Á Á ; 7; 15 f g are utilized, the accesses to A are coalesced, as shown in Fig. 1b. Therefore, in the partial-reduction step, the thread groups are created according to Definition 3.1 below in order to ensure that the accesses to A are coalesced.
Definition 3.1: Assume that the size of the thread block is s, h threads are assigned to a dot product in A T x, and z ¼ s=h. The thread groups are created as follows 0; z; For these elements of x that are read into the shared memory, the threads in each thread group perform in parallel a partial-style reduction similar to that in the GEMV kernel.
3. warp-reduction step: These threads in a thread group are usually not in the same warp, so we cannot use the shuffle instruction to reduce their partial-style reduction values. Therefore, in the warpreduction stage, we store the partial-style reduction values that are obtained by threads in each thread group to the shared memory, and then reduce them in the shared memory to an output value in parallel.

Optimization
Assume that s ¼ N sm Â N mb Â nt=32, where N mb is calculated by Eq. (7), and m ¼ l þ s þ Ds, where l = 0, 1, 2, Á Á Á, and Ds is a smaller positive integer than s. We observe that for a matrix with m ¼ 2 Â 960 þ 30 and n = 102,400 on the GTX 1070 GPU, nt = 1,024, the GEMV kernel with one warp per row is utilized according to Eq. (4), and thus achieves 103.18 GFLOPS. Let us divide this matrix into two blocks B m ¼ 2 Â 960; n ¼ 102; 400 ð Þ and C m ¼ 30; n ¼ 102; 400 ð Þ . If B and C are calculated by one warp per row and 32 warps per row, respectively, we will obtain 107.25 GFLOPS. In this way, the performance of the GEMV kernel increases 4.07 GFLOPS. Why? Obviously, for the GEMV kernel, if Ds ¼ 0 and l > 0, each warp will calculate the same number of rows and thus it obtains good performance. However, when Ds 6 ¼ 0 and l > 0, the performance of the GEMV kernel decreases because many warps are idle when computing the remaining Ds rows. Based on the above observations, we optimize the GEMV kernel as follows.
When Ds ¼ 0; l > 0 or Ds 6 ¼ 0; l ¼ 0, the GEMV kernel shown in Section 3.1 is applied. Otherwise, on the basis of the GEMV kernel, we construct a new kernel GEMV Kernel-I to calculate the GEMV on the GPU. In this kernel, each row of the first l Â s rows is assigned to one warp, and each row of the last Ds rows is calculated by self-adaptive multiple warps.
Similar to the GEMV kernel, for the GEMV-T kernel, we assume that s ¼ N sm Â N mb Â nt and m ¼ l þ s þ Ds, where l = 0, 1, 2, Á Á Á, and Ds is a smaller positive integer than s, and then optimize it as follows.
When Ds ¼ 0; l > 0 or Ds 6 ¼ 0; l ¼ 0, the GEMV kernel shown in Section 3.2 is applied. Otherwise, on the basis of the GEMV-T kernel, we construct a new kernel GEMV-T Kernel-I to calculate the GEMV on the GPU. In this kernel, each row of the first l Â s rows is assigned to one thread, and each row of the last Ds rows is calculated by self-adaptive multiple threads.

Concurrent L1-min Solvers on a GPU
In this section, based on FISTA, we present two concurrent L1-min solvers on a GPU, which are designed from the perspective of the streams and the thread blocks, respectively.

Streams
Utilizing the multi-steam features of GPU, on the basis of FISTA, we design a concurrent L1-min solver, called CFISTASOL-SM, to solve the concurrent L1-min problem. Given that these L1-problems that are included in the concurrent L1-min problem can be independently computed, each one of them is assigned to a stream in the proposed CFISTASOL-SM. Fig. 2 shows the parallel framework of CFISTASOL-SM, which defines the following contents: 1) illustrating the tasks of the CPU and the stream, 2) listing the execution steps of CFISTASOL-SM on each stream, and 3) designating which operations should be grouped into a single kernel.
For temp ¼ Ay k À b; rf ðy k Þ ¼ A T temp, and z k ¼ Ax kþ1 in Fig. 2, they are easy to be implemented on the basis of our proposed GEMV and GEMV-T implementation methods on the GPU. The optimal vectoroperation and inner-product kernels and their corresponding CUDA parameters are chosen by using the vector-operation and inner-product decision trees.

Thread Blocks
For a specific GPU, the maximum thread blocks can be calculated as TBs ¼ N sm Â N tb : Utilizing the features of multiple thread blocks, based on FISTA, we present a concurrent L1-min solver, called CFISTASOL-TB, to solve the concurrent L1-min problem. In CFISTASOL-TB, a L1-min problem is assigned to one thread block. For each thread block, the idea of constructing the parallel FISTA to solve the L1-min problem is similar to that of implementing FISTA on the stream in Section 4.1 except that here CFISTASOL-TB is implemented in a kernel.

Optimization
When the ith element of x is equal to zero, all elements in the ith column of A do not need to be accessed because they do not have any contribution to the output vector. With the increasing iteration in FISTA, x becomes sparser and sparser, and thus many columns in A are not accessed. By this way, we can improve the CFISTASOL-SM and CFISTASOL-TB performance by reducing accesses to the global memory a. Furthermore, for CFISTASOL-TB and CFISTASOL-SM, each thread block needs to access the global memory a, so we let a be cached in the read-only data cache in order to reduce the number of accesses to a. With the read-only data cache, a is shared by all thread blocks and can be accessed fast.

Concurrent L1-min Solver on Multiple GPUs
For the concurrent L1-min problem, if it includes a great number of L1-min problems, we can easily construct a solver on multiple GPUs to solve it by letting each GPU execute CFISTASOL-SM or CFISTASOL-TB. However, here we design a concurrent L1-min solver on multiple GPUs, called CFISTASOL-MGPU, where each GPU only solves a L1-min problem every time instead of solving multiple L1-min problems by utilizing the streams and the thread blocks. This solver is applied to this case where the number of L1-min problems that are included in the concurrent L1-min problem is much  Fig. 3 shows the parallel framework of CFISTASOL-MGPU, which defines the following contents: 1) illustrating the CPU/GPU tasks, 2) listing the execution steps of CFISTASOL-MGPU on each GPU, and 3) designating which operations should be grouped into a single kernel. The operations in Fig. 3 are easily executed on each GPU using the kernels in Section 3.

Performance Evaluation and Analysis
In this section, we first investigate the effectiveness of our proposed GEMV and GEMV-T kernels by comparing them with GEMV and GEMV-T implementations in the CUBLAS library [26] and those that are presented by Gao et al. [17]. Second, we test the performance of our proposed concurrent L1-min solvers on a GPU, CFISTASOL-SM and CFISTASOL-TB. Finally, we test the performance of our proposed concurrent L1-solver on multiple GPUs, CFISTASOL-MGPU. In the experiments, the number of threads per block is set to 1024 for all algorithms.
Tab. 2 shows NVIDIA GPUs that are used in the performance evaluations. Our source codes are compiled and executed using the CUDA toolkit 10.1. The measured GPU performance for all experiments does not include the data transfer (from the GPU to the CPU or from the CPU to the GPU). The test matrices, which come from the publication [17], are shown in Tab. 3. The elemental values of each test matrix are randomly generated according to the normal distribution. The performance is measured in terms of GFLOPS, which is obtained by 2 Â m Â n the matrix-vector multiplication kernel execution time (the time unit is second) [30].

Performance Evaluation and Analysis of Matrix-Vector Multiplications
First, we compare GEMV and GEMV-T kernels with the implementations in the CUBLAS library [26] and those that are presented by Gao et al. [17]. The test matrices are shown in Tab. 3. Figs. 4 and 5 show the performance comparison of the GEMV kernel with CUBLAS and that of Gao et al., respectively. From Fig. 4, we observe that our proposed GEMV kernel on K40c and GTX1070 is always advantageous over CUBLAS and that of Gao et al. for all test matrices. On K40c and GTX1070, the average performance ratios of the proposed GEMV kernel versus CUBLAS are 3:15Â and 1.12Â, respectively, and those of the proposed GEMV kernel versus Gao's GEMV kernel are 3.18Â and 1.19Â, respectively. For the proposed GEMV-T kernel, it outperforms CUBLAS and Gao's GEMV-T kernel for all test matrices on all two GPUs, as shown in Fig. 5. On K40c and GTX1070, the average performance improvement is respectively 1.40 Â and 1.05 Â compared to CUBLAS, and is respectively 1.09Â and 1.02Â compared to Gao's GEMV-T kernel. These observations verify the effectiveness of the proposed GEMV and GEMV-T kernels.  Second, we take GTX1070 to investigate whether can the proposed GEMV and GEMV-T kernels alleviate the CUBLAS fluctuations? The test matrix sets are as follows: 1) Set 1: n = 100,000 and m = 50, 100, 150, Á Á Á , 5,000; 2) Set 2: m = 1,000 and n = 5,000, 10,000, 15,000, Á Á Á , 500,000.
Figs. 6 and 7 show the performance curves of GEMV and GEMV-T kernels for the matrices in the two sets, respectively. From Fig. 6, we observe that whether m increases when n is fixed to 100,000 or n increases when m is fixed to 1,000, the CUBLAS performance fluctuates, and the difference between the maximum and minimum performance values is significant. However, for our proposed GEMV kernel, it is advantageous over CUBLAS, and its performance almost remains invariable, and always preserves around 107 GFLOPS for all cases. Furthermore, for the test matrices in Set 1, the performance of our proposed GEMV-T kernel has been maintained at around 107 GFLOPS as m increases, as shown in Fig. 7(a). However, the CUBLAS performance fluctuates as m increases. For the test matrices in Set 2, when n increases, our proposed GEMV-T kernel has high performance as CUBLAS, and always achieves around 107 GFLOPS. Based on the above observations, we conclude that our proposed GEMV and GEMV-T kernels enhance those that are suggested by Gao et al., and achieve high performance, and are able to alleviate the performance fluctuations of CUBLAS.

Performance of Concurrent L1-min Solvers on a GPU
In this section, we test the performance of our proposed CFISTASOL-TB and CFISTASOL-SM. Given that these L1-min problems included in the concurrent L1-min problem can be independently computed, as a comparison, we use the FISTA implementation on the CPU using the BLAS library (denoted by BLAS), the FISTA implementation using the CUBLAS library (denoted by CUBLAS), and the FISTA solver (denoted by GAO) that is proposed in [17] to calculate them. 12 test cases are applied. For each test case, 60 L1-min problems are concurrently calculated, and the matrix A comes from Tab. 3. For each L1-min problem, the initial x 0 with 1024 non-zero elements is randomly generated according to the normal distribution, and b ¼ Ax 0 . All algorithms stop after the number of iterations is more than 100 for all test cases. Tabs. 4 and 5 show the execution time of all algorithms on a K40c and a GTX1070, respectively. The time unit is second (denoted bys). In Tabs. 4 and 5, our proposed CFISTASOL-TB and CFISTASOL-SM are abbreviated as TB and SM, respectively.  . On a K40c, the execution time ratios of CUBLAS to TB range from 4.18 to 33.28, and the average execution time ratio is 12.22; The execution time ratios of GAO to TB range from 3.92 to 6.52, and the average execution time ratio is 4.80; The minimum and maximum execution time ratios of CUBLAS to SM are 4.18 and 22.18, respectively, and the average execution time ratio is 11.04; The minimum and maximum execution time ratios of GAO to SM are 3.92 and 5.79, respectively, and the average execution time ratio is 4.54. Furthermore, we observe that TB is slightly better than SM in most cases. On a GTX1070, we obtain the same conclusion as on a K40c from Tab. 5. The average execution time ratios of CUBLAS versus TB and CUBLAS versus SM are 18.00 and 15.76, respectively, and the average execution time ratios of GAO versus TB and GAO versus SM are 7.24 and 6.38, respectively. Especially, compared to the solver on the CPU, BLAS, TB respectively obtains average speedups of 62.78Â and 126.16Â, and SM respectively obtains average speedups of 59.65Â and 110.99Â, on a K40c and a GTX1070 (Fig. 8). These observations verify that our proposed TB and SM have high performance and parallelism.

Performance of the Concurrent L1-min Solver on Multiple GPUs
We take two GPUs and four GPUs for example to test the performance of our proposed CFISTASOL-MGPU. The test setting is as same as in Section 6.2. Fig. 9 shows speedups of CFISTASOL-MGPU versus BLAS on the K40c and GTX1070 GPUs.
From Fig. 9 6, respectively, and the average speedup is 108.4. All observations show that CFISTASOL-MGPU is effective for solving the concurrent L1-min problem, and has high parallelism.

Discussion
From the experimental results, we can observe that CFISTASOL-TB is slightly better than CFISTASOL-SM, and CFISTASOL-SM is advantageous over CFISTASOL-MGPU. In fact, each one of these solvers has its own advantage. Tab. 6 lists the experimental results with different number of L1-min problems that are included in the concurrent L1-min problem on GTX1070. In this experiment, Mat12 is set as the coefficient matrix in the test concurrent L1-problem, and the test setting is as same as in Section 6.2. For the GTX1070, the maximum thread blocks that it launches is 30 when nt = 1024, the number of streams is 15. When the number of L1-min problems is enough to make all thread blocks busy, CFISTASOL-TB is better than other two algorithms (see the first problem in Tab. 6). Otherwise, CFISTASOL-SM will outperform other two algorithms if the number of L1-min problems is enough to make all streams busy (see the second problem in Tab. 6). When the number of L1-min problems is much more than the number of thread blocks and the number of streams, CFISTASOL-MGPU is the best of all three algorithms (see the third problem in Tab. 6). Therefore, we can get better algorithms by utilizing their own advantages to combine them. For example, on the four GTX1070 GPUs, assume that the number of L1-min problems that are included in the concurrent L1-min problem is 128, the first 120 problems are calculated in parallel by letting each GPU execute CFISTASOL-TB, and the remaining 8 problems are computed in parallel by executing CFISTASOL-MGPU.

Conclusion
We investigate how to solve the concurrent L1-min problem in this paper, and present two concurrent L1-min solvers on a GPU and a concurrent L1-min solver on multiple GPUs. Experimental results show that our proposed concurrent L1-min solvers are effective, and have high parallelism.
Next, we will further do research in this field, and apply the proposed algorithms to more practical problems to improve them. Conflicts of Interest: We declare that there are no conflicts of interest to report regarding the present study.