An Optimized SW/HW AVMF Design Based on High-Level Synthesis Flow for Color Images

: In this paper, a software/hardware High-level Synthesis (HLS) design is proposed to compute the Adaptive Vector Median Filter (AVMF) in real-time. In fact, this filter is known by its excellent impulsive noise suppression and chromaticity conservation. The software (SW) study of this filter demon-strates that its implementation is too complex. The purpose of this work is to study the impact of using an HLS tool to design ideal floating-point and optimized fixed-point hardware (HW) architectures for the AVMF filter using square root function (ideal HW) and ROM memory (optimized HW), respectively, to select the best HLS architectures and to design an efficient HLS software/hardware (SW/HW) embedded AVMF design to achieve a trade-off between the processing time, power consumption and hardware cost. For that purpose, some approximations using ROM memory were proposed to perform the square root and develop a fixed-point AVMF algorithm. After that, the best solution generated for each HLS design was integrated in the SW/HW environment and evaluated under ZC702 FPGA platform. The experimental results showed a reduction of about 65% and 98% in both the power consumption and processing time for the ideal SW/HW implementation relative to the ideal SW implementation for an AVMF filter with the same image quality, respectively. Moreover, the power consumption and processing time of the optimized SW/HW are 70% and 97% less than the optimized SW implementation, respectively. In addition, the Look Up Table (LUTs) percentage, power consumption and processing time used by the optimized SW/HW design are improved by nearly 45%, 18% and 61% compared the ideal SW/HW design, respectively, with slight decrease in the image quality.

determine the corruption produced by this noise and ameliorate the quality of the image before further processing.
Image filtering is the most important stage in the image processing operation [1,2]. It helps to suppress the noise and to restore and ameliorate the image quality. Image filtering is used in a vast array of applications such as satellite images where the noise can affect the image quality through the capture and transmission processes [3]. Thus, it is vital to eliminate the noise from satellite imagery because it is used in several vital fields such as security, water bodies, changing lands and planet health [4]. Besides, in the medical profession, images can contain "salt and pepper" noise, which affects image quality, especially in ultrasound imaging and Magnetic Resonance Imaging (MRI) [5]. It is, hence, critical to eliminate noise from medical images as crucial information may be affected.
The earliest filters were based on linear approaches which cannot take account of the nonlinearity of human vision [6] and cannot be suitable for the non-linearities of transmission channels. For that, the non-linear filters are appropriate to the digital color images. Many nonlinear filters have been proposed in the literature such as the Vector Median Filter (VMF) [7], the Adaptive Vector Median Filter (AVMF) [8], the Vector Median Rational Hybrid Filter (VMRHF) [9], etc.
Many researchers have noted the time consuming and the high complexity of non-linear filters [10,11]. However, to reduce this complexity, researchers have adopted hardware acceleration as a solution. In fact, Trivedi et al. [12] propose a hardware implementation of the median filtering on Field-Programmable Gate-Array (FPGA) which consumes less power and less hardware area. Hu et al. [13] propose two hardware architectures to implement the median filtering counting standard and multi-level median filters. In [14], an optimized hardware architecture based on systolic array is developed to implement median filtering. This architecture uses the pipeline structure which requires seven clock cycles to determine the median value. Lee et al. [15] detail a 3 × 3 window median filtering based on a bit serial sorting algorithm, which has high speed of operation and less hardware complexity. In [16], a hardware implementation of the VMRHF for color images is described. This hardware architecture uses some approximations to reduce the implementation complexity of the relational function. Boudabous et al. [10] suggest an efficient fast parallel architecture to implement the VMF. This architecture uses approximation to implement an L 2 norm for the VMF filter.
But these hardware architectures miss the flexibility of design updating and take more development time. Indeed, the development and implementation of these architectures are done by Low-Level Synthesis (LLS) using hardware description language (HDL) on an FPGA circuit. With LLS design, it is possible to adjust the Register Transfer Level (RTL) description to give a very great, optimized netlist. However, producing such an RTL description requires a lot of effort and time to describe the operations of each low-level circuit and, especially for complex applications [17,18]. Nevertheless, designing a complex system is only possible for hardware designers who have specific knowledge and skills. Therefore, there is a real need to raise the design space abstraction level from LLS to High-Level Synthesis (HLS) [19,20] in order to reduce the FPGA design complexity. In fact, HLS allows designers to formalize algorithms using software high-level language (systemC, C/C++, etc.) and synthesizes them via the HLS tool to form a behavioral and structural of the RTL hardware description. In this context, several academic and commercial HLS tools are developed such as Xilinx Vivado HLS, Intel OpenCL [21], Catapult-C [22], and ROCCC [23]. However, without any expertise and skills in hardware design, designers can develop and generate automatically from high-level language a complex hardware design which permits designers to explore and simulate a large design space in the shortest of times, identify design performances (power consumption, processing time and hardware cost) and eliminate the source of many design errors. Unfortunately, to design an optimized and high performant RTL circuit with HLS tools, the code should be restructured in a specific style. Without such restructuring, the HLS tools can still generate an RTL circuit but with poor performance [24].
Given this context, our goal in this work is to use HLS flow to design various hardware architectures for the AVMF filter and integrate these architectures as intellectual property (IPs) blocks with Hardcore ARM processor on Xilinx Zynq FPGA in order to design an efficient software/hardware (SW/HW) embedded system. The SW/HW design should reduce this filter's complexity and power consumption as well as speed up the execution time. However, HW solution is used for performance (processing speed and power consumption). In contrast, SW solution is used for design flexibility [25,26].
The remainder of the paper is organized as follows. In Section 2 below an overview of the AVMF filter is presented. The description of Vivado HLS tool and directives are given in Section 3. The proposed HLS AVMF designs are described in Section 4. Section 5 discusses the experimental results in terms of hardware cost, power consumption and processing time of the SW/HW AVMF implementation on ZC702 platform. Finally, conclusion is given in Section 6.

Overview of the AVMF Filter
In [8], the author presents an Adaptive Vector Median Filter (AVMF) which is based on the VMF filter. It is enhanced by using a threshold to detect the probability of the pixel to be noisy as shown in Fig. 1. We define by V = (x i ∈ Z l ; i = 1, 2, . . . , N) the size (N) of the filtering window. The noisy pixels are presented by x 1 , x 2 , . . . , x N . The position of the filtering window is determined by central pixel which is x (N+1)/2 . We consider that each multichannel pixel x i is associated with a distance measure d i which is calculated by Eq. (1).
where x i − x j 2 measures the distance between to channel pixels x i and x j when using the Euclidean distance.
The output y AVMF of the AVMF is expressed in (2) below: where the vector x (1) represents the VMF output obtained by . It corresponds to the minimum vector distance d (1) ∈ {d 1 , d 2 , . . . , d N } inside the filtering window. d (1) is expressed by (3).
The vector x (N+1)/2 corresponds to the distance measure d (N+1)/2 of the center pixel. ξ (N+1)/2 defines the threshold value given in (4): where λ AVMF allows the adjustment of the proposed method's smoothing properties. Ψ AVMF is the estimated variance which is defined in (5).
The approximation presented in (5) determines the mean distance between the vector median and the different pixels held in V . In this equation, d (1) is divided by (N − 1) that gives the number of distances from x (1) to all other pixels in V . However, from Fig. 1, we can see that if the distance d (N+1)/2 is greater than the threshold ξ (N+1)/2 , therefore x (N+1)/2 is noisy and is changed by the vector x (1) . But, if the distance d (N+1)/2 is less than or equal to ξ (N+1)/2 , then x (N+1)/2 remains unchanged.

Xilinx Vivado HLS Tool
The purpose of HLS methodology is to simplify and accelerate the hardware implementation specially for FPGA circuits. For that, an HLS tool is developed by Xilinx to help engineers to rapidly implement algorithms on the FPGA with gains in resource, power, and performance. This tool is called Vivado HLS which gives a design environment to interpret, analyze, optimize, and transform a software language like C/C++ to RTL design. This design is synthesized and implemented for Xilinx FPGA. Indeed, with the Vivado HLS tool, it is possible to apply different optimizations to increase the hardware design performances by using several directives such as pipelining, loop unrolling, resource, etc. Some optimizations lead to decreasing the hardware area by applying the ALLOCATION directive. This directive can minimize the number of resources using in design by sharing resource between several functions. Moreover, RESOURCE directive can be used to replace vectors and arrays by specific memory blocks (BRAMs). But, to raise the data rate and achieve a higher throughput, the UNROLL or PIPELINE directives can be used. In this case, by unrolling loops, several hardware blocks are built in parallel to operate the loop iterations in parallel. Otherwise, the PIPELINE directive performs pipelining to reach higher throughput. In fact, the pipeline technique permits the loop iteration to begin before the completion of its predecessor. For that, the data dependencies should be satisfied. Furthermore, the ARRAY PARTITION directive divides the large memory into individual registers or multiple smaller memory blocks for parallel data accesses. However, these optimizations lead to an excessive use of FPGA resources. For that, the level of pipelining or parallelism should be customized.
With Vivado HLS tool, some steps should be followed to generate an RTL description. In Step 1, the C/C++ code should be written in a specific style to permit the HLS tool to create an optimized RTL description. In Step 2, the source code is explored to extract the control path and dataflow. In Step 3, a various specific directive is applied for each algorithm for better hardware optimization. In the last step, the Export RTL tool is used to export as an IP module the created RTL design to the Xilinx Vivado tool in order to generate the bitstream file.

HLS Architecture of the AVMF Filter
The developed AVMF C code is given as input to the Vivado HLS tool 18.1 in order to generate a hardware architecture for AVMF algorithm. The generated architecture is illustrated in Fig. 2. This architecture is optimized to reconstruct the filtered color image in minimum of clock cycles. In fact, to optimize the load of pixels, tree lines of image are sent in parallel to the AVMF coprocessor. In order to form a (3 × 3) filtering window, three pixels from each line are selected to be stored in the register bank. Each pixel is composed of three colors (R, G, B). However, the 81 Elementary Distances (ED) which are d ij (x i , x j ) should be calculated for each (3 × 3) filtering window. The d ij (x i , x j ) is given by Eq. (6).
From Fig. 2, we can see that the EDs are implemented based on Eq. (6) and computed by using 81 loop iterations. In fact, the loop 1 is used nine times to accumulate nine EDs and loop 2 is used also nine times to calculate the nine Euclidean distances d i . When the nine Euclidean distances are ready, the comparator determines the minimum distance from these nine distances. With the search for the minimum of nine distances d i , the filtered pixel is supplied and another filtering window for another pixel is started. At the end, to optimize the memory access, the three colors (R, G, B) for the filtered pixel which is determined based on minimum distance d i in the filtering window are concatenated in 24-bits and stored in image memory. All these steps are repeated N × N loop iterations which depend on the image size in order to filter all pixels in the image.
The bottleneck of this architecture is the implementation of the square root (SQRT) which is used to calculate the EDs for the AVMF filter and the floating-point values. Thus, the purpose of this work is to generate two HLS architectures for the AVMF filter. The first architecture is based on the SQRT function. The second architecture is based on the approximation of the SQRT function in order to use the fixed-point values only and reduce the hardware complexity. Our main goal is to design floating-point and fixed-point architectures using Vivado HLS tool and compares the power consumption, the processing time and the area cost of the designed architectures.

HLS Floating-Point AVMF Design
Several designs are generated from AVMF C floating-point code. These designs are generated by adding incrementally specific directives through the Vivado HLS tool and synthesized for Xilinx XC7Z020 FPGA. After that, we are compared the performance in terms of number of clock cycles and hardware resources (LUTs, FFs, BRAMs and DSPs). #Design 1: In this first design, the software code is implemented under the Xilinx XC7Z020 FPGA without any optimizations. The synthesis results are given in Tab. 1 for hardware resources and Fig. 3 for number of clock cycles. From Tab. 1, we can notice that this design uses 14% LUTs, 5% FFs, 93% BRAMs and 10% DSPs and can reach a maximum 261265418 clock cycles.  In the second experiment, the ARRAY PARTITION directive is applied to the filtering window array in order to partition this array into multiple smaller memory modules. This allows a data parallel access. The experimental results record an increase in the percentage of LUTs by 16% and a decrease of about 5% in the number of clock cycles relative to #Design 1. #Design 3: In this design, the PIPELINE directive is applied to the loop iterations with an interval equal to 1 to decrease time latency. This optimization allows a decrease by 97% in number of clock cycles compared to #Design 2, but with an increase of about 79% in the percentage of LUTs and 80% in the number of DSP blocks. #Design 4: For this last design, the ALLOCATION directive is added to process the multiplication for good improvement in the FPGA resources by the fact that it permits the sharing of the hardware resources. This optimization shows a reduction in the percentage of LUTs and DSPs by 39% and 10%, respectively, compared to #Design 3, but with an increase of about 6% in clock cycles.
From these experimental results, #Design 4 is selected for HLS AVMF implementation using the SQRT function (ideal HW). Indeed, it provides a good compromise between number of clock cycles and FPGA area cost. This implementation is done by the Xilinx HLS Vivado tool under the Xilinx XC7Z020 FPGA.

HLS Fixed-Point AVMF Design
The AVMF filter is based on the calculation of the SQRT which should be approximated to decrease the complexity of the hardware architecture. In fact, to simplify the implementation of the AVMF filter, a ROM (Read-Only Memory) is used to store the obtained values of the SQRT [27]. Accordingly, a quantity A in Eq. (7) is defined as follows: The SQRT(A) is computed and stored in a ROM memory. To determine the size of memory and the precision of the fixed-point values of the SQRT(A), we have conducted an experiment simulation for two standard images (Sailboat and Peppers) which are contaminated with 3% of impulsive noise. In this simulation, we have measured the image quality of the filtered image using Normalized Color Difference (NCD) for several memory sizes to store 512, 1024 and 2048 fixed-point values of the SQRT(A) with precision from 1-bit to 12-bit. The simulation results are presented in Fig. 4 where we notice that the NCD is decreased when the memory size and precision bit increase. To have a compromise between the memory size and quality of image, we have chosen to store 1024 values in the ROM memory with the 9-bit as the precision bit. With these parameters, we can see that the NCD of approximated AVMF is nearer than the ideal AVMF. To justify this, the relative error is calculated for NCD between ideal and approximated AVMF. The relative error is given by Eq. (8):   The proposed AVMF C code is used to generate and implement different designs for the AVMF algorithm based on ROM memory. This implementation is realized with the Xilinx Vivado HLS tool. #Design 1: In the first experiment, the fixed-point AVMF C code is synthesized for the XC7Z020 FPGA without any optimizations. The experimental results in terms of FPGA resources and the number of clock cycles are given in (Tab. 3) and (Fig. 6), respectively. As evident in Fig. 6, this design can reach a maximum of 44150282 cycles. Furthermore, the FPGA resources are distributed between 4.8% LUTs, 2% FFs, 95% BRAMs and 4% DSP blocks as shown in Tab. 3.   #Design 4: In this last design, to reduce the hardware cost for the multiplication operations, the ALLOCATION directive is used. This optimization provides a reduction in the percentage of FFs by 4% and 7% in the number of BRAM blocks relative to #Design 3 with an increase of about 17% in clock cycles.
From synthesized results, we can notice that #Design 4 offers the best compromise between FPGA area cost and number of clock cycles. For that, this design is selected for HLS AVMF implementation using ROM memory (optimized HW).
From these experimental results, we can conclude that the PIPELINE and the ARRAY PARTITION directives are mainly used to decrease the processing time, but with a concomitant increase in hardware cost. In contrast, the ALLOCATION directive is exploited to decrease the hardware cost but with an increase in the processing time. The purpose of the next section is to investigate the HLS approach in an SW/HW environment to design and verify a standalone IP (Intellectual property) of the AVMF filter (ideal HW and optimized HW) on the ZC702 development board [28].

SW/HW Performance Validation of AVMF Filter Architecture
ZC702 is a Zynq 7000 development board. The Zynq 7000 is a Xilinx programmable SoC which is used for quickly prototyping and evaluating the functionality of any designed system in SW/HW environment. The inside of the Zynq architecture contains two main parts: The Programmable Logic (PL) for hardware implementation and the Processor System (PS). In the PS part, we find a 32-bit dual Hardcore ARM processor, 32 KBL1 data and instruction caches per core, 512 KBL2 cache and 1GB DDR3. The PS is operating at 667 MHz and supports operating systems or software routines. In Zynq architecture, the connection between the PL and the PS parts are realized using the Advanced eXtensible Interface (AXI4) of the Advanced Microcontroller Bus Architecture (AMBA) protocols. Fig. 7 illustrates the designed SW/HW AVMF architecture. This architecture is developed using the Xilinx Vivado 2018.1 tool and evaluated on the ZC702 development board which is based on the Xilinx XC7Z020 FPGA. In this architecture, the AVMF coprocessor is connected to the SW parts (ARM Cortex-A9 processor) through Direct Memory Access (DMA) by using an AXI4-stream interface which is designed for maximum bandwidth access to DDR memory of the PS [26]. This mode of transfer supports unlimited data burst sizes and offers point-to-point streaming data without using any addresses. However, in our SW/HW AVMF architecture three DMAs are used. Indeed, the DMA1 is used in read/write mode while the DMA2 and DMA3 are configured in read mode only.
Initially, the color image (RGB format) is stored in DDR memory. Then, as shown in Fig. 8, when the Start_transfert and TREADY signals will be asserted, the PS starts to send the noisy pixels to the AVMF coprocessor. TREADY signal indicates that the AVMF coprocessor is ready to receive data. However, three DMAs (DMA1, DMA2, and DMA3) are used to send three image lines in parallel from DDR memory to the hardware coprocessor. The AVMF coprocessor receives the valid data when TVALID signal will be asserted by AXI stream interface and start to perform the AVMF algorithm for the noisy pixels as soon as the nine pixels for the first 3 × 3 filtering window are provided. Then, the coprocessor calculates nine Euclidean distances, determines the RGB filtered pixel which reduces the distance between all pixels in a filtering window and stores the concatenated RGB filtered pixel in the internal image memory. From Fig. 8, we can see that all these steps are done in pipeline to decrease the processing time. To construct the next filtering window, we select the last 6 pixels from the previous window and adding 3 new pixels. Once the AVMF coprocessor finishes the filtering of all pixels in the image, the TREADY and TLAST signals will be asserted and the PS starts to receive RGB filtered pixels through DMA1, disconcatenates and stores them in the DDR memory to construct the filtered image (Fig. 7). Our SW/HW design, proposed in Fig. 7, uses the AXI interface and 3 DMAs to increase the throughput. Further, our design supports various image sizes (i.e.,: 32 × 32, 64 × 64, 128 × 128, 256 × 256). The image size can be increased by increasing the memory size of the image output.
To evaluate the proper functioning of the SW/HW AVMF design for the HLS ideal and optimized HW IPs blocks, we have followed the design flow which is presented in Fig. 9. In fact, the Vivado HLS is used to apply directives and create a stream interface in order to connect the IP blocks with the processor. After that, when the HLS synthesis is completed, the compressed file (.ZIP) including all HDL files is generated and exported to the Xilinx Vivado tool which is used to implement a multiple accelerators blocks connected to the embedded processor across an AXI interface. Then, the Xilinx Vivado tool is used to synthesis, implement the SW/HW design and generate and load the Bitstream file (.bit) in the FPGA platform. Besides, the SW is carried out using the ARM Cortex-A9 processor and compiled with a standalone application using the Xilinx software development toolkit (SDK) to generate the executable file (.elf) which will be performed per the embedded processor.
Tab. 4 reports the implementation results for the ideal and optimized SW/HW AVMF design under the XC7Z020 FPGA. It is obvious from this table that the optimized SW/HW AVMF design presents a decrease per nearly 45% in the number of LUTs and 22% in the number of DSPs compared to the ideal SW/HW design. The next step consists in evaluating the performance in terms of the processing time, power consumption and image quality parameters under the ZC702 FPGA board. The processing time measurement is done by means of the processor timer while power consumption is measured by the Texas Instruments fusion digital power designer software using the Texas Instruments device which is connected to the ZC702 board through a USB interface adapter. Accordingly, taking subjective measurement as an effective way of judging the efficiency of the filter, from Fig. 10, we can see that the implemented filter conserves the chromaticity components as well as the fine details of a color image (Monalisa) using 3% of impulsive noise. Consequently, no differences are noticeable between the images filtered by the SW/HW solutions and those in the output of the SW solutions (ideal and optimized).

Memory (image)
To prove that, the NCD and the Peak Signal to Noise Ratio (PSNR) are used in these simulations and performed for different standard test images (Lena, Flower, Peppers, Sailboat, Mandrill, Monalisa). The size of these color images is 256 × 256 and are contaminated by "salt and pepper" impulsive noise with an intensity equal to 3%. Tab. 5 and 6 present the PSNR and the NCD values of test standard test images for the ideal and optimized SW/HW and SW implementations of the AVMF filter. Otherwise, Fig. 11 and Tab. 7 illustrate the power consumption and the processing time for the SW/HW and the SW ideal and optimized implementations of the AVMF filter, respectively. Fig. 11 and Tab. 7 reflect a reduction of about 65% and 98% in the power consumption and the processing time for the ideal SW/HW implementation relative to the ideal SW implementation of the AVMF filter, respectively, with the same value of PSNR (Tab. 5) and NCD (Tab. 6). Besides, we notice that the power consumption and the processing time of the optimized SW/HW are 70% and 97% less than the optimized SW implementation for same image quality, respectively. These results prove not only the efficiency of the HLS approach, but also the reliability of the proposed SW/HW AVMF design. Furthermore, the LUTs percentage, the power consumption and the processing time required by the optimized SW/HW design are improved by nearly 45%, 18% and 61%, respectively, compared with the ideal SW/HW design for the AVMF filter, with slight decrease in the quality of image. As obvious from the results when performing the HLS to a software code that includes elementary functions of floating-point numbers, the processing time and the hardware size of the converted hardware increase compared to the fixed-point number. Yet, the floating-point architecture yields the same quality of results (QoR) relative to the ideal SW and accelerates design time and time-to-market (TTM) which is not the case for the fixed-point architecture. However, with the fixed-point architecture, a long process is needed to transform a floating-point algorithm and implement it as fixed-point while taking into account the QoR.     Compared to other realizations, we can notice from Tab. 8 that our design produces better performance results than in [8] for the floating-point implementation of the AVMF filter. As a matter of fact, the throughput of our design is 6.8 times faster than [8]. Moreover, our design has better performance than results in [29,30] which present the floating-point implementation of AVMF and VMF filters in Intel Core (TM) i7-4790 at 3.2 GHz and DSP TMS320C6701 at 150 MHz, respectively. It is crucial to mention that our design is implemented on the ZC702 board which has an ARM cortex-A9 core running on 667 MHz and the IP was implemented with a 100 MHz clock frequency. On the other hand, the throughput of our design is 4.9 times faster than [27] for the fixed-point implementation of the AVMF filter in SW/HW environment. Knowing that, it is important to remind that the LLS method is used in [27].  Figure 11: Power consumption measurement for the SW/HW and SW AVMF designs In light of the above findings, the HLS approach presents a good solution to elevate the abstraction level from RTL to algorithms and accelerate both the design time and the time to market (TTM). But, we can notice that with HLS approach, the reference software should be rewrite in a specific format and select the right directives to attain a better performance in terms of FPGA area cost, power consumption and processing time. Thus, for example with the Xilinx Vivado HLS, the ALLOCATION directive allows to reduce the FPGA area cost. Moreover, the PIPELINE and ARRAY PARTITION directives can be used to improve pipeline and parallel processing between loop iterations which helps to reach higher throughput. Besides, it becomes possible to design a HW floating-point architecture where the performance in terms of processing time and power consumption is better than that of the SW solution and with the same QoR. Further, HLS approach can be combined with SW/HW design methodology to guarantee on the one hand the acceleration of the conception and the flexibility to update the design, on the other hand the performance in terms of processing time and power consumption.

Conclusion
In this work, the HLS approach is used to design floating-point and fixed-point hardware architectures for the AVMF filter by applying specific directives (ALLOCATION, PIPELINE and ARRAY PARTITION) to the AVMF C codes. The first design was based on the square root function (ideal HW). The second design was based the ROM memory (optimized HW). These designs are integrated as coprocessor with ARM cortex-A9 processor in the SW/HW environment. The AXI-stream interface is used to speed up the data transfer between the PL part and the DDR memory. The experimental results under the ZC702 FPGA platform show that the SW/HW AVMF designs give a better performance in terms of processing time, power consumption and hardware cost relative to the SW implementation with the same QoR. These results prove not only the efficiency of HLS tool, but also the reliability of the proposed SW/HW AVMF design which can be used for several image sizes.