Efficient Energy Optimized Faithful Adder with Parallel Carry Generation

Approximate computing has received significant attention in the design of portable CMOS hardware for error-tolerant applications. This work proposes an approximate adder that to optimize area delay and achieve energy efficiency using Parallel Carry (PC) generation logic. For ‘n’ bits in input, the proposed algorithm use approximate addition for least n/2 significant bits and exact addition for most n/2 significant bits. A simple OR logic with no carry propagation is used to implement the approximate part. In the exact part, addition is performed using 4-bit adder blocks that implement PC at block level to reduce node capacitance in the critical path. Evaluations reveal that the maximum error of the proposed adder confines not more than 2n/2. As an enhancement of the proposed algorithm, we use the Error Recovery (ER) module to reduce the average error. Synthesis results of Proposed-PC (P-PC) and Proposed-PCER (P-PCER) adders with n-16 in 180nm Application Specific Integrated Circuit (ASIC) PDK technology revealed 44.2% & 41.7% PDP reductions and 43.4% & 40.7% ADP reductions, respectively compared to the latest best approximate design compared. The functional and driving effectiveness of proposed adders are examined through digital image processing applications.

In this brief, we propose a high-speed area efficient approximate adder that performs multibit addition in block sets, with parallel carry generation logic implemented at block level in the accurate part and simple OR logic on the least significant n/2 bits(approximate part). Evaluations revealed that the maximal error in the proposed adder, due to approximation, confines to 2 n/2 . The rest of the paper is organized as follows. Section 2 presents the area and delay evaluation of various adder algorithms. Section 3 presents the design methodology of proposed adders. Next, Section 4 evaluates the performance of the proposed adders. Furthermore, Section 5 discusses the implementations of the proposed adders in digital image processing applications. Finally, Section 6 gives a brief conclusion on the proposed work.

Area and Delay Evaluation of Adder Algorithms
The SOP implementation of combinational functions employs AOI (AND-OR-INVERT) logic for CMOS realization. AOI structure for Sum and Carry outputs of a conventional 1-bit Full Adder (FA) cell is shown in Fig. 1. The gates in same row within dotted lines perform a parallel operation, and hence we consider one gate delay for parallel gates. Area cost is evaluated by the total count of AOI gates used in the logic design. From Fig. 1, the logic delay and gate count (GC) of 1 bit FA are given by (1) & (2), respectively.
where '+' represents arithmetic OR operation. Based on the above methodology, we have evaluated various adders' critical delay and area and discussed them in the following subsections.

Ripple Carry Addition
In CMOS realization RCA [1] offers better structural regularity due to the series arrangement of FA cells. An n bit RCA is constructed using n FA cells cascaded together, with the carryout bit of one FA tied to the carry-in bit of the next higher weight FA cell. However, RCA incurs high carry propagation delay due to rippling of carry signal from least significant bit position to the most. Critical delay of n-bit RCA (t RCA ) is given by (3).
where t FA is the carry generation delay of 1 bit FA cell. From (3), it is evident that the delay of n bit RCA depends on t FA and it increases in O (n). Hence, minimizing t FA of 1bit FA cell will minimize t RCA . From Fig. 1, GC of n-bit RCA is given by

Carry Select Addition
CSLA [1] performs high-speed addition using an array of RCA pair and 2:1 Muxes. Each RCA block performs addition independently with carry-in as either logic low or logic high. For n = 16, with 4 bits in an RCA block, the CSLA design will have 4 RCA pairs. MUX units use the Carryout (C out ) signal from preceding blocks to produce final output from Sum signals generated by the RCA pair at the corresponding binary weight. Fig. 2 shows the block level architecture for n bit CSLA.
From Fig. 2, the gate count of n-bit CSLA is given by

Carry Look Ahead Addition
CLA [5,6] eliminates the carry-rippling problem encountered in RCA by generating carry signals at corresponding binary weights using C in and Propagate (P), Generate (G) signals generated at lower weights. Hence the area cost and hardware complexity of CLA are high for large input bit widths compared to RCA. Following are the functional symbols used to express logic functions viz., |−OR; * −AND; ∩−XOR. Based on the above representation, the logic expression for Carry signal (Ci) is given by where, P i = a ∧ i b i and G i = a i &b i and 'i' represents the bit position. Using (7), critical delay of n-bit CLA (t CLA ) is given by Based on AOI representation in Fig. 1, GC of n-bit CLA is given by Tab. 1 compares the logic depth and area of the various adders discussed in subsections Figs. 2a-2c, assuming delay and area of each gate to be 1 unit. Note from Tab. 1, RCA performs better in terms of area saving while CSLA performs better in speed improvement. CLA can optimize area and delay significantly. However, CLA exhibits high hardware complexity and insufficient regularity of cell arrangement compared to RCA for large n, making it unsuitable for CMOS realization. Limiting the number of bits in CLA improves the regularity of structure and reduces hardware complexity. Hence, in the proposed approach input bits are grouped into four, and PC logic-based addition is performed within the group in the accurate part. Furthermore, the approximate part of the adder is implemented with simple OR logic to reduce area and delay.

Proposed Approximate Adders
This section briefs the proposed approximate adders that implement carry-free addition in the least significant approximate part (n/2 bits) and error-free addition in the most significant part. Accurate part of the adder is implemented with a hybrid approach that combines blocklevel addition with look ahead based fast parallel carry generation. Fig. 3 shows the block-level architecture of the proposed adder.  In the proposed addition algorithm, for n bits in input operands and 4 bits in a block, the number of blocks in the exact part will be n/8. It is noted from (9), C i is a combination Primary Carry Propagate (PCP) term that propagates input carry-C in and Intermediate Carry Propagate (ICP) term that propagates in-between carry signals generated at lower significant bit positions. PCP and ICP are defined by (10) & (11), respectively.
Analysis of (10) & (11) reveals that parallelism can be exploited at two levels viz., (i) within ICP by merging consecutive odd and even terms, (ii) between ICP and PCP by generating PCP AND terms in parallel with level-1 logic reduction in ICP. Generation of carry signals using proposed delay balanced carry look ahead algorithm for 4-bit addition is shown in (12)-(15), and the corresponding gate level diagram is shown in Fig. 4.
Note from Fig. 4, ICP use certain PCP AND terms generated in level-1 logic reduction in further stages and this in turn reduces the gate count and delay significantly. From Fig. 4, the logic delay of C in to C 1 -C 4 (t c ) is given by Also, it is noted from Fig. 4 that delay balanced PC logic maintains uniform gate delay from C in to carry out signals in the accurate part. Furthermore, it is observed that C in is used as an input in the pre-final stage and hence provides a choice for further delay reduction in subsequent higher weight adder blocks. Evaluations revealed that the Proposed PC (P-PC) adder incurs n/8 * t c logic delay while conventional RCA implementation of accurate part incurs n/2(t c ) logic delay. Hence, delay balanced PC logic in the proposed adder demonstrates 83.3% logic depth reduction compared to the standard implementation. In the least significant approximate part of the P-PC adder, carry signals are not generated, and sum signals are generated using approximate full adder (AFA) cells. Boolean expression that defines the logic of AFA cell is given by (17). Note from (10) AFA in the approximate part generates an error for input a = b = logic high. However, approximate sum = 1 for a = b = 1, which reduces the mean error distance (MED) due to the elimination of carry signal to 1 at the corresponding binary weight. Hence, the proposed approximation logic reduces delay and average error significantly. Evaluations reveal that the maximum error due to approximation in P-PC adder is not more than 1unit at bit position n/2 (UBP-n/2).
The novelty of the proposed design is that it trades-off error for penalty in area overhead, and this can be realized using an error recovery unit. In proposed design 2 (P-PCER), the algorithm generates an error recovery (E R ) signal using Generate (G) and approximate sum signals and is added to the exact part of adder at least significant bit position. Boolean expression defining logic of E R is given by (18).
where 'j' represents the count of most significant bits in the approximate part considered for E R signal generation. Note that, E R signal reduces the maximum error (ME) in the proposed adder to 2 n/2 -1 with probability P(ME) = 2i/n and mean error significantly. Fig. 5 shows the block level architecture of the P-PCER design.

Error Metrics
Error metrics are the essential parameters to evaluate the efficiency of an approximate design in error-tolerant applications. In this section, performance of the proposed approximate adders and state-of-the-art approximate designs is evaluated in terms of various error metrics [19,20] using standard output as the reference. The accuracy metrics considered are: Mean Error Distance (MED), Mean Relative Error Distance (MRED), Normalized Error Distance (NED) and Percentage Accuracy defined as follows.
where, ED i -Error Distance at output combination corresponding to i th input combination, S i -exact output corresponding to i th combination, D-maximum possible ED of an approximate circuit and R e -exact result.
Tab. 2 compares MED, MRED and NED value of the proposed and prior adder designs. Note from Tab. 2, that proposed adders fares better MED and MRED values compared to ET-CSLA [16], SAET-CSLA [16], HPETA1 [17], HPETA2 [17], SARA [18], AEPA-FTFA1 [19] and AEPA-FTFA2 [19] designs. AEPA-EFA [19] design use exact full adder cell in approximate and accurate part. Hence MED, MRED and NED values of AEPA-EFA [19] design are significantly better compared to proposed and state-of the art adder designs. P-PCER design fare better in MED, MRED and NED values compared to the P-PC design, thanks to the error recovery unit used in P-PCER design.
In addition, we have extracted power dissipation of P-PC and P-PCER adders against various operating frequency and temperature using transistor level schematics designed in Cadence Virtuoso design environment and simulations using Cadence Spectre. Fig. 7 shows the Power vs. frequency plot of P-PC adder in 90 and 180 nm ASIC PDK technology.  Fig. 7, the power dissipation increases linearly for a rise in frequency. Fig. 8 shows the power vs. temperature plots of proposed P-PC and P-PCER adders in 180 nm ASIC PDK. Note from Fig. 8, power dissipation of proposed adders decreases moderately for a rise in temperature in the range 27 to 120 • C and exhibits functionality without any degradation.

Implementation in Digital Image Processing
To determine the novelty of the proposed approximate adders in fault-tolerant applications and verify their driving competence, implementations in image enhancement applications viz., filtering and blending are done in FPGA platform. The Verilog HDL models of the proposed adders and state-of-the-art approximate designs defined in literature are synthesized using Xilinx ISE 14.2, tool and hardware for the application system is prototyped on Spartan 6 FPGA (XC6XLX45-CSG324) device. Input images of size 256 × 256 are fed from MATLAB environment to FPGA hardware using Xilinx-MATLAB co-simulation with System Generator tool.

Gaussian Filter
To assess the performance of proposed P-PC and P-PCER adders in recurring addition, an implementation in Gaussian Filter (GF) is done. The Gaussian-smoothing filter is a 2-D convolution filter used to remove Gaussian noise and smoothen the noisy input image. Statistical noise having Probability Density Function (PDF) equal to that of the normal distribution is known as Gaussian noise. The PDF of Gaussian random variable is given by: where, z = grey level, μ = mean value and σ = standard deviation. Fig. 9 shows the block-level architecture of GF. 256 × 256 standard images corrupted with Gaussian noise (μ = 0.2 and σ = 0.005) are used as inputs to the GF system implemented with exact and various approximate adders. Processed outputs of various GFs are shown in Fig. 10. The quality metrics Mean Square Error (MSE) and Structural Similarity Index (SSIM) between output images processed by approximate and exact GF systems are used as a measure to evaluate the performance of the proposed approximate designs. MSE is given by (26).
where, a, b-size of the image; S (i, j) & S' (i, j) are the exact and approximate filter outputs. Note that, from Fig. 10, SSIM and MSE values of output images processed by P-PC and P-PCER GFs are better compared to ET-CSLA [16], SAET-CSLA [16], HPETA1 [17], SARA [18], AEPA-FTFA1 [19] and AEPA-FTFA2 [19] GF systems. AEPA-EFA [19] GF system fare better SSIM and MSE compared to proposed and other approximate GFs. This is due to the exact FA cells used in the approximate logic of AEPA-EFA [19], limiting the maximum error to UBP-n/4.

Image Blending
Image blending is the process of adding the corresponding pixel values of two input images of the same size. Pixel intensity values of the input images are scaled such that the pixel value of the resulting image does not exceed the maximum threshold. Fig. 12 shows the block-level architecture for image blending. Input images of size 256 × 256, with pixel intensity values scaled by fractional constant 'α' and '1 −α' respectively, are fed to the adder unit. Adder unit sums the scaled images to produce the resultant blended image.  13 shows the output images processed by the image blending system implemented with proposed and prior approximate adders. Note from Fig. 13, SSIM and MSE values of output images processed by image blending system implemented with P-PC and P-PCER adders are significantly better compared to output images processed by adders considered for comparison. In addition, we have extracted the AMSE metric of various blending systems using processed images for various combinations of standard input images-Lena, Boat, Cameraman, Bridge and Peppers with α = 0.8, 0.5 & 0.2 and shown in Fig. 14. It is noted from Fig. 14, AMSE values of P-PC and P-PCER adder systems are significantly reduced compared to conventional CLA [1], ET CSLA [16], SAET-CSLA [16], HPETA1 [17], HPETA2 [17], SARA [18], AEPA-FTFA1 [19] and AEPA-FTFA2 [19] systems. In this article, we have proposed area-delay and energy optimized faithful approximate adders using parallel carry generation logic with reduced critical node capacitance. To trade-off error for trifling increase in area, error recovery units are used in the approximate part. Synthesis results with n-16 and 4-bit parallel carry generation block revealed that the proposed P-PC and P-PCER adders outperformed standard and other approximate designs in terms of area, ADP and PDP reductions. Functionality and driving capability of the proposed adders are evaluated with implementation in image de-noising and image blending systems. The superiority of proposed adders in image processing is demonstrated through a visual examination and structural similarity of processed outputs. At an instance, the proposed adders with n = 16 demonstrated at least 94% structural similarity and 42.4% & 43.7% decline in ADP and PDP, respectively, compared to the recent approximate design.