Design of Precise Multiplier Using Inexact Compressor for Digital Signal Processing

In the recent years, error recovery circuits in optimized data path units are adopted with approximate computing methodology. In this paper the novel multipliers have effective utilization in the newly proposed two different 4:2 approximate compressors that generate Error free Sum (ES) and Error free Carry (EC). Proposed ES and Proposed EC in 4:2 compressors are used for performing Partial Product (PP) compression. The structural arrangement utilizes Dadda structure based PP. Due to the regularity of PP arrangement Dadda multiplier is chosen for compressor implementation that favors easy standard cell ASIC design. In this, the proposed compression idealogy are more effective in the smallest n columns, and the accurate compressor in the remaining most significant columns. This limits the error in the multiplier output to be not more than 2 for an n X n multiplication. The choice among the proposed compressors is decided based on the significance of the sum and carry signals on the multiplier result. As an enhancement to the proposed multiplier, we introduce two Area Efficient (AE) variants viz., Proposed-AE (P-AE), and P-AE with Error Recovery (P-AEER). The proposed basic P-AE, and P-AEER designs exhibit 46.7%, 52.9%, and 52.7% PDP reduction respectively when compared to an approximate multiplier of minimal error type and are designed with 90nm ASIC technology. The proposed design and their performance validation are done by using Cadence Encounter. The performance evaluations are carried out using cadence encounter with 90nm ASIC technology. The proposed-basic P-AEA and P-AEER designs demonstrate 46.7%, 52.9% and 52.7% PDP reduction compared to the minimal error approximate multiplier. The proposed multiplier is implemented in digital image processing which revealed 0.9810 Structural SIMilarity Index (SSIM), to the least, and less than 3% deviation in ECG signal processing application.


Introduction
The process of multiplication and addition are the essential functions carried out by the Arithmetic Unit (AU). Basic elements of the processing applications like signal, Image and multimedia process are formed by using Arithmetic Units (AU). The operating frequency of the processing relies on AUs critical delay and depends on the setback of multiplier. The operating frequency for processing element is determined AUs critical delay which relies on the delay of the multiplier. Array multiplication is the standard algorithm used to multiply 2 input operands that use parallel approach for PP generation. The PP bits are shifted based on weight and compressed using Carry Propagate Adder (CPA).
Scaling at deep submicron technology increases the density of integration in VLSI chips. Hence the power density increases proportionately, and the heat dissipation issues arise. Low supply voltage operation to solve the power density issues, on the other hand, increases delay. A number of approaches to the design of multipliers for optimum performance are proposed in the literature. Low error and areaefficient truncated multipliers are proposed for fixed-width applications [1][2][3][4][5][6]. In [7], a multiplexer based array multiplier is proposed with new adaptive PCT (Pseudo-carry Compensation Truncation) scheme. A low complexity multiplier using a pipelined parallel counter is proposed in [8].
Approximate computing is a recent methodology for the logic design of high speed, reduced power and area-efficient architectures for approximate applications. Approaches to the design of area and powerefficient approximate adders are proposed in [9][10][11][12]. Memristor based approximate full adder [13] and configurable accuracy error tolerant adder [14] are proposed for PP compression in multipliers. A reduced-cost imprecise multiplier employing probability-driven imprecise compressor and inexact halfadder is proposed in [15]. Approaches to the design of inexact compressors are proposed in [16][17][18][19][20][21][22], and approximate multipliers employing inexact compressors are proposed in [23][24][25] targeting error-tolerant applications. Adaptable multipliers with flexibility in switching between exact and approximate modes are proposed using Dual mode compressor [26], variable approximation mode compressor with error recovery, and ultra-low power compressor [27]. A high-speed multiplier employing reduced critical path node capacitance approximate compressor is proposed in [28][29][30][31]. In [32], a probabilistic based approach for PP accumulation based on significant bit position is proposed to minimize logic complexity and power dissipation. Low-error high-speed Wallace multiplier with inexact 4:2 compressor and errorcorrection circuit has newly developed in [33]. For a need based error-tolerant application, multipliers are being designed with energy error trade off [34,35], and multi-level approximation [36].
In this approach, two novel approximate 4:2 compressors were proposed that can optimize area, power, delay, and generate either sum or carry with no error. The proposed compressors are targeted for PP compression in multipliers with structure-based PP arrangement that employ Dadda multipliers. In the targeted multipliers, for n Â n design, the proposed inexact compression highly appreciable in the least n significant(LS) columns and exact 4:2 compressor in the remaining most significant(MS) Partial Product arrangements. The choice among the proposed compressors in PP columns of the approximate part depends on the significance of compressor output signals on the multiplier result. Least significant n columns of multiplier are employed with approximate multipliers that bound the maximum error in the output not more than 2 n . If the error in the output is being reduced less than 2 n then it can be considered as a tolerable level. With the reduced errors, these multipliers are specially suited for error-recovery applications can be utilized in image and signal processing. Area efficiency and speed improvement in the proposed precise multiplier are achieved by non-generation of carry signals for the final stage in the approximate part. Furthermore, to reduce error due to non-generation of carry signals in the area efficient version of the proposed multiplier, we generate error recovery compensation bias using pre-final stage carry signals of n/4 most significant columns in the approximate part and add it to the least significant carry signal of the exact part in the final stage. Novelty and functional verification of the proposed multipliers are done by implementations with Image smoothing and ECG signal processing applications.
In this paper, Section 2 presents a typical 4:2 exact compressor with structural modifications. Section 3 describes about the design and its functionality in the sense of speed in all aspects of multiplication. The performance of the above said novelties are evaluated in Section 4. Section 5 details the newly up to date characteristics multiplier in various digital technology deployments. As a final note, a comprehensive assumption of the behind the proposed work is presented in Section 6.
2 Exact and Other Inexact Compressors 4:2 compressor is one of the perfect choice to perform PP compression in multipliers for the arithmetic units with fixed width data path of 2 (i.e., 2N where N = 3, 4, 5, 6) as its multiplying number likewise as exactly similar to Arithmetic unit of the processors is the major advantage in all kinds of digital technology deployment as Partial Product compression concerned with firm-width data path. The implementation of the compressor is through Dadda type multiplier because of regularity in the PP arrangement that favors easy standard cell ASIC design. In the sub Section 2.1, proposed 4:2compressor and design of approximate compressors are explained.

Exact Compressor
By using exact compressor, the n 1-bit inputs are added, where n is equal to the functionality n < 2 i , which results to 'i' that is first integer and 'n' is the last integer. A conventional 4:2 compressor having A 1 , A 2 , A 3 , A 4 as four inputs along with C i as the previous balance carry which generates Sum, Carry, C o in its output. The architecture and its Boolean expressions can be formulated in Fig. 1 and Eqs. (1)-(3) respectively. The standard Boolean expression the compressor is modified for Partial Product compression of multipliers as logic low in C i and output C o is ignored. The proposed 4:2 new compressor has same inputs as stated above and Sum-S and Carry-C as the outputs. The ignorance of C o in the new compressor causes error when A 1 A 2 A 3 A 4 = "1111" has probability.06 and Maximal Error Deviance that is (Max-ED) = −2.
Contrarily, the algorithm of approximation which generates Sum for logic high as A 1 A 2 A 3 A 4 = "1111" minimize the Max-ED to −1. The block level figure of the modified exact compressor is shown in Fig. 2, and related Boolean expression shown in Eqs. (4), (5).
The delay of computation for AND, OR, XOR, NOT, XNOR, and NAND gates are denoted as t AND ; t OR ; t XOR ; t NOT ; t XNOR ; and, t NAND, respectively. Based on this, logic depth is calculated by counting the number of gates in critical path of the design. The logic-depth of modified 4:2 exact compressor is given as t logic-depth = 3 * t AND + 3 * t OR .
However, in the multiplier implementation of projected circuit explained in Section 3, we considered C o as ignored in the projected circuit has been compensated to eliminate error by adding bit E = A 1 &A 2 &A 3 & A 4 .

Design of Inexact Compressors
This section briefs the design of efficient approximate 4:2 compressors that can generate either sum or carry with no error. In proposed design 1, the logic generates no error in sum and three errors in carry, and it will be referred hereafter as "Proposed compressor with Exact Sum" (Proposed-ES). In proposed design 2, the logic generates no error in carry and three errors in sum, and it will be referred hereafter as a Proposed compressor with Exact Carry(Proposed-EC).As the target for our approximate compressors is to design highspeed multipliers with reduced error, the proposed compressors are used in PP compression logic based on the significance of compressor sum and carry signals on the multiplier output.

Proposed-ES Design
In the proposed-ES design shown in Fig. 3, single bit FA cell takes in three inputs K 1 , K 2 , K 3, and produce the sum signal (FA sum ). Exact sum and error-prone carry outputs of the denoted as S and C' to make separate identity from exact output. The logic that implements S and C' outputs of the Proposed-ES variant by Eqs. (6)- (8).
In reference to the Boolean Eqs. (6), (7), that the Proposed-ES compressor introduces no error in S and three errors in C' for "A 4 A 3 A 2 A 1 = "0000", "0111" & "1000" with error proportion 0.1875. Also, it is noted that the MES of the Proposed-ES design for inputs A 4 A 3 A 2 A 1 = "0000", "0111", "1000" & "1111" is ±2. However, this MES will not affect the performance of multipliers incorporating the Proposed-ES compressor, as it is used only in PP columns where sum signal has a stronger influence on the multiplier output. The logic-depth of Proposed-ES design is stated by t Prop-ES = 2 * tx OR + 1 * t OR .

Proposed-EC Design
In the Proposed-EC design shown in Fig. 4, there are three inputs A 1 , A 2 , A 3 for single bit approximate FA (AFA) that generates the approximate carry denoted as (AFA carry ) and approximate sum denoted as (AFA Sum ) signals as outputs. The sum output is notified as S' to make difference from the exact sum S, and the carry output is C. The boolean expressions that implement S' and C outputs of the Proposed-EC design are stated by Eqs. (11), (12).
It is well known from the Boolean Eqs. (11) and (12), Proposed-EC design produces three sum errors found for inputs A 4 A 3 A 2 A 1 = 10000, 0111 & 0000 and proportion of error to be 0.1875, and no error in carry. For the input "1111", the value of error considered as negligible since it minimizes MES to −1. The compressed partial product probability on MES logic signal is 0.06, and it does not affects the performance of the new variant multipliers since the output approximation to be not more than 2 n compared with exact signals. The target use of the Proposed-EC design is in PP columns where the carry signals are more significant than the multiplier output.
The Proposed-EC compressors logic depth is t Prop-EC = 2 * t XOR + 1 * t OR

Design of Precise Approximate Multiplier
This section briefs the design of n Â n precise approximate multiplier that uses proposed approximate compressors for n least significant columns and exact 4:2 compressor for the remaining most significant PP columns. The PP compressions are performed in several stages with Dadda structure-based PP arrangement by using multipliers and carry save addition. In the final stage, the RCA is employed to perform final addition. The maximum imprecise compressor error in the 2n bit multiplier confined to one unit in Bit Significant Position (BSP) − n (i.e., 2 n ). Fig. 5 illustrates the new methodology in n = 8 bits as the input operators of the multiplier as a [7:0] and b [7:0]. Generated PP bits are arranged in Dadda structure to reduce complexity, and number of Partial Product compression to be done in four levels. Since the carry logic has a higher significance over sum, the Proposed-EC compressor is used in the approximate part, and the modified exact 4:2 compressor is used in the accurate part of the multiplier. Note that the modified exact 4:2 compressor introduces error. For the input = "1111", the new variant 4:2 compressor precedes free from error by adding the compensation bit -E ji is added for reimbursement in the MS column. The Boolean expression of E ji is where 'i' represents stage number in PP compression, and 'j' represents column weight. Fig. 6 shows architecture at gate level for parallel computation of E 1-12 and E 1-13 . This assures that the error compensation in stage-1 due to the modified exact compressor doesn't acquaint with additional delay on the PP compression. In the final stage, sum and carry signals are added using 16 bit RCA. To trade-off area and delay for accuracy in the proposed basic version, we perform final stage addition using two different methodologies. A detailed description of the structural modifications in the proposed multiplier is discussed in subsections 3.2.1 and 3.2.2. In all the proposed multiplier designs for adding 2 and 3 bits, approximation Half Adder (AHA), and approximation Full Adder (AFA) cells are used in LS part and exact adders in the MS part. The expressions for carry and sum outputs of these two adders are given by Eqs. (14) and (15), respectively.

Proposed-AE Design
In the Area Efficient (AE) design designated as Proposed-AE(P-AE), carry bits are not generated in LS imprecise part of the final stage, and MS bits adding of the exact part is performed using n bit RCA. Fig. 7 shows the PP compression using 4:2 compressors, and final stage addition in the P-AE multiplier for n = 8. Note that we use the Proposed-EC compressor in the approximate part in stage-1 since carry signal has a greater influence in stage-2 and on the multiplier output. In stage-2, we use the Proposed-ES compressor for columns with binary weight 3-6, as the influence of the sum signal is higher in these columns on the multiplier output. However, for the column with binary weight 7, we use the Proposed-EC compressor as the carry signal in this column has higher significance on the final result. Note from Fig. 7 that the P-AE multiplier uses n bit RCA in the final stage, which reduces delay and area considerably by a little contraction in its precision. Conversely, the maximum error in the P-AE multiplier due to approximation restrains to 1 unit at BSP n (i.e., 2 n ).

Proposed-AEER Design
In the Proposed Area Efficient with Error Recovery multiplier (P-AEER), carry bits are generated in the most significant two PP columns of the approximate part in the pre-final stage. An error recovery (E R ) signal is generated using AND logic on these carry bits and is added with the least significant carry signal in the accurate part. The logic of E R is given by Eq. (16).
Symbols + and & represents arithmetic OR and logical AND operations, respectively. Fig. 8 shows the PP compression in P-AEER design using exact and proposed 4:2 compressors for n = 8. Note from Fig. 8, for the final stage addition, we use only n bit RCA in the most significant part, and it reduces the delay in carry propagation significantly. Hence, the P-AEER multiplier achieves better area and delay reductions compared to the basic version and fair better improvement in average error compared to P-AE design.  Tab. 1 shows E R for various combinations of signals used in the error recovery in P-AEER design. However, the maximum error in P-AEER multiplier restrains to 1 unit at BSP n (i.e., 2 n ).

Results and Discussion
The novel multipliers, inexact compressors and its design are explained in the review of literature section and are designed using structural Verilog HDL codes. The multipliers are synthesized using Cadence Encounter in 90nm technology. To optimize supply voltage for simulations, we made a performance estimate of the proposed compressor design in terms of power and delay, and found that at supply voltage-1 V, the PDP of proposed compressors is low, and hence performance comparison of multipliers with new variant compressors and the novel multiplier variants is made using simulations with supply voltage-1 V.

Approximate Compressors Performance Comparison
Performance metrics in terms of power, area, delay, and PDP of proposed compressors and state-of-theart approximate designs used for comparison are shown in Tab (I4-2C), respectively. Approximate multiplier with adder and DQ4:2C3 compressors generate carry output directly from Cin and X4 respectively. Hence in HDL modeling, a buffer is used for the generation of carry signals in these designs. Hence PDP of compressors in Approximate multiplier with adder and DQ4:2C3 are significantly low compared to Exact, designs in XOR-XNOR module, modified architecture of Dadda Multiplier, imprecise 4-2 compressor and proposed compressors. Approximate compressor in 4-2 compressor-based approximate multiplier demonstrates the lowest PDP compared to all other designs considered. It is due to the parallel design of logic that reduces delay significantly but at the expense of 5 errors.

Error Metrics
Error metrics are the important parameters to evaluate the efficacy of an approximate design in errortolerant applications. In this section, the performance of the proposed approximate multipliers and stateof-the-art approximate designs is evaluated in terms of various error metrics Modified architecture of Dadda Multiplier, 4-2 compressor-based approximate multiplier, DQ4:2C3 using standard output as the recent works. The accuracy metrics considered are Mean Error Distance (MED), Mean Relative Error Distance (MRED), Normalized Error Distance(NED), and Percentage Accuracy. Tab. 3 shows the MED, MRED, and NED values of proposed and prior multiplier designs. It is noted from Tab. 3, MED values of our basic, AE, and AEER designs are considerably low compared to approximate multipliers in 4-2 compressor-based approximate multiplier, DQ4:2C1, DQ4:2C2, DQ4:2C3, and Modified architecture of Dadda Multiplier, thanks to the proposed compressor designs that reduce the signal error probability that influences multiplier output. Moreover, the modified exact 4:2 compressors used in most significant n PP columns and compensation bit added to bias the error for the input "1111", maintains the error precisely within 2 n . Consistently, MRED and NED values of proposed multipliers are relatively low compared to approximate multipliers in 4-2 compressor-based approximate multiplier, DQ4:2C1, DQ4:2C2, DQ4:2C3, and Modified architecture of Dadda Multiplier.
Area cost of the proposed designs is low compared to Exact, and design in Imprecise 4-2 compressor. Approximate multipliers in DQ4:2C1, DQ4:2C2, DQ4:2C3, DQ4:2C4 use approximate compressors in all PP columns, while approximate designs in 4-2 compressor-based approximate multiplierandModified architecture of Dadda Multiplier don't add error compensation bias for X4X3X2X1 = "1111" in the exact part, and hence exhibit low area compared to the proposed designs. However, the average error of  Fig. 9b; the proposed multipliers demonstrate low MRED and high area while multiplier in DQ4:2C1 reveal low area and high MRED. Approximate designs in DQ4:2C2 and 4-2 compressor-based approximate multiplierexhibit moderate MRED and area among all other multipliers compared. Furthermore, Tab. 5 gives a brief comparison of power, delay, area, and PDP metrics of proposed multipliers for n = 8, 12, 16. It is noted from Tab. 5 that the power dissipation and area increases at 6X and 4.8X proportion for 2X increase in input operand bit-width.

Implementation in Digital Image Processing
An implementation in image enhancement viz., smoothing & scaling, and signal processing applications is done in FPGA board to justify the novelty of modified multipliers in fault-tolerant image processing applications. The Verilog HDL models of the modified new variant multipliers and futuristic approximate designs defined in the literature are synthesized using Xilinx ISE 14.2 tool, and the prototype model for the application system is made by using Spartan 6 FPGA (XC6XLX45-CSG324 device). Input images and signals are send to the FPGA Board using Xilinx-MATLAB co-simulation with System Generator tool.  Design [16] Design [25] Prop.-Basic P-AE P-AE ER

Image Smoothing
In the digital images, Image smoothing technique is performed to reduce the blurring effect and noise. It is a pre-processing operation performed on images prior to the main object extraction. The smoothing operation performs averaging on the pixel intensity values of the input image in a pre-defined window and replaces the processing pixel with the result. The weight of the pixel in the window considered for smoothing operation depends on the type of mask used. For example, filter with 3 Â 3 mask replace processing pixel with intensity value G (x, y) defined by where f(x ± s, y ± t) represents the pixel intensity values of the input image in the window. s, t can take values 0 or 1, and α i represents the corresponding weights in the filter mask. The window is moved pixel by pixel till all the pixels in the input image are processed. Fig. 10 shows the output processed images of the output by the image smoothing system performed along with new variant and other previous tabulated imprecise multipliers. The quality metrics Mean Absolute Error (MAE), Peak Signal to Noise Ratio (PSNR), and Structural Similarity Index (SSIM) [37] are used as measures to evaluate the operation of proposed approximate designs. It is estimated using output images processed by the smoothing system designed with error-tolerant and exact multipliers. MAE PSNR and MSE are defined by Eqs. (18)- (20).
y¼0 G x; y ð Þ ÀG 0 x; y ð Þ ½ 2 (20)  G(x, y) and G′(x, y) indicate the image size of the output of the exact and error-tolerant system outputs, a and b. Note that, from Fig. 10, PSNR, MAE, and SSIM output image values are processed by image smoothing system with proposed multipliers-basic, P-AE, P-AE ER designs are more appreciable while comparing approximate multipliers in 4-2 compressor-based approximate multiplier, DQ4:2C1, DQ4:2C2, DQ4:2C3, DQ4:2C4 and Modified architecture of Dadda Multiplier based systems. This is due to the low average error performance of the proposed designs due to faithful approximation. Additionally, we have extracted Average MAE (AMAE) metric using different output images processed by a smoothing system with different standard input images -Lena, Boat, Cameraman, Bridge, Peppers, and shown in Fig. 11. Also by means of Fig. 11, AMAE output image values are processed by proposed-basic, P-AE and P-AE ER designs based systems are significantly better compared to approximate multipliers 4-2 compressorbased approximate multiplier, DQ4:2C1, DQ4:2C2, DQ4:2C3, DQ4:2C4 and Modified architecture of Dadda Multiplierbased systems. AMAE value of approximate multiplier based system fair better compared to the proposed technique and other approximate multiplier systems; however, it exhibits a high area. Also, it is noted that image smoothing system with proposed designs fairs better area compared to systems with designs in Modified architecture of Dadda Multiplier, Imprecise 4-2 compressor and conversely, it is higher when compared to systems with approximate multipliers in 4-2 compressor-based approximate multiplier, DQ4:2C1, DQ4:2C2, DQ4:2C3 and DQ4:2C4.P-AE and P-AEER based smoothing systems fair better processing delay compared to Proposed-basic,Modified architecture of Dadda Multiplier, DQ4:2C3, DQ4:2C4, 4-2 compressor-based approximate multiplier, and Imprecise 4-2 compressor built systems. Imprecise 4-2 compressor based system demonstrates the highest area and processing delay compared to related approximate multiplier smoothing systems.

ECG Signal Processing
As implementation of 27 tap Finite impulse Response is done to check the functionality of proposed multiplier. Fig. 12 shows the Data Flow Graph (DFG) of n tap FIR filter. From Fig. 12 it is noted that the multiplier is the significant element that contributes towards the area and critical delay of the FIR filter. Coefficients for the FIR filter are chosen in MATLAB using the Remeez commands. An ECG signal is added and it is fed as an input to the FIR Filter. The processed output signals of the FIR filter implemented with new variant and tabulated previously developed approximate multipliers are measured with standard output to measure the accuracy efficacy of proposed multipliers. The I/O signals processed by the FIR filter designed with various multipliers are illustrated in Figs. 13a-13c. Note from Fig. 13c shows that the output waves.
Processed by Proposed-basic and FIR filters (P-AE based) have small deviations when compared with standard output. Output waves processed by Q4:2C1 and 4-2 compressor-based approximate multiplier FIR systems display the highest and moderate deviations, respectively, compared to the standard output.   Ref [14] Proposed-Basic Proposed-AE Figure 13: (a) Standard ECG input signal (b) ECG signal corrupted by white gaussian noise (c) Output signals processed by FIR filter systems

Conclusion
In the proposed research work, two area-efficient variants of 4:2 compressors (approximate type) targeted in the multiplier using PP compression. The logic of compressors is realized such that the designs generate sum without error in the first variant and carry without error in the second variant. Evaluations revealed that the proposed compressors fair better with regard to gate count and error reductions while comparing with the previous variants discussed in literature. Implementation of the new variant compressors in the Dadda multiplier disclosed the superior performance of the proposed multiplier with regard to processing speed and accuracy when compared to earlier designs. Enhanced variants of the proposed multiplier in terms of area and error recovery demonstrated better efficacy in terms of area at a trade-off in accuracy. Finally, the proposed multipliers are implemented in signal and image processing applications to verify the functionality and driving quality. Visual examination of processed output images and signals concluded that the proposed inexact multipliers perform similar to the standard design with minimal error deviation.