Document Type : Reasearch Paper
Authors
^{1} Department of Electrical Engineering, Rasht Branch, Islamic Azad University, Rasht, Iran.
^{2} Department of Electrical and Computer Engineering, University of Mohaghegh Ardabili, Ardabil 56199-11367, Iran.
Abstract
Keywords
Main Subjects
INTRODUCTION
As described in literature, Carbon Nanotubes (CNTs) are composed of one dimensional tubular structure in which their corresponding size ranges from few nanometers for diameter to micrometers for length [1]. Firstly, introduced by Iijima in 1991 [2], a CNT is stacked up of at least one concentrated graphene sheet where depending on the number of layers, different classifications will be defined to categorize the CNTs [3]. Among the different types, Single-Wall CNTs (SWCNTs) showed good promise for implementation of next generation of electronic devices [4] due to their unique electrical, thermal and mechanical properties [5-6]. However, it took 15 years from invention to the hardware realization of CNT Field-Effect Transistor (CNTFET) based devices which was a long period for the design of first CNT based architecture [7-8].
Nowadays, the applications of CNTFETs cover a wide range of electronic devices especially low power digital electronics [9-10]. Following the same opinion, the idea can be extended for the design of higher level systems such as multipliers, Digital Signal Processors (DSPs) and microprocessors. One of the most important systems in modern electronics are parallel multipliers and because of complexity of multiplication process compared to addition and subtraction, the power consumption will be a critical value in an integrated system planning [11].
Regardless of the traditional methods for multiplication of two binary numbers which include the common method consisting of partial product addition and shifting and the algorithmic method which is mainly employed in parallel multipliers, a modern multiplier is composed of three main blocks [12]:
- Partial Product Generation (PPG) block
- Partial Product Reduction Tree (PPRT)
- Final accumulation stage
Over the past decades the improvements in CMOS submicron technologies have resulted in implementation of efficient multiplication algorithms especially radix-4 Booth scheme and many circuits were introduced during recent years [13-19]. On the other hand, the invention of newer processes with better performance and features and on top of them CNTFET technology has motivated many circuit designers to switch to these advanced processes.
In this paper, a new architecture for a 4-2 compressor is presented which reduces the delay and power consumption of the final proposed circuit. By combination of the modified truth table and using efficient logic (PTL) and circuit blocks, the proposed design is able to reduce the circuit area in addition to the delay and power consumption. This was demonstrated by the number of the transistors utilized in the circuit which represent the area that will be occupied by the final circuit.
Although it is obvious that the use of an efficient multiplication algorithm can significantly improve the speed of the multiplication process, however, in a parallel multiplier the main part of total delay comes from the PPRT block. Most cases of the compressors used in the multiplier constitute this stage [13-14]. Therefore, if the delay of this stage would be decreased then it can considerably improve the speed of multiplication process and the delay corresponding to the critical path will be reduced.
In a PPRT block the Partial Products (PPs) are summed up and reduced to two rows. The most popular function that is used in PPRT block is a 4-2 compressor which is shown in conventional form in Fig. 1. Many efforts have been devoted in the literatures for efficient design of this block concerning its functional speed and power consumption [20-25].
Pass Transistor Logic (PTL) has been utilized in [20] to improve the functional speed of the compressor which was resulted in a gate level delay of 4 XOR gates from the input to the output. In [21], a power reduction mechanism was proposed and was the main emphasis of the design, while the structures reported in [22-23] have achieved gate level latencies less than 2 XOR gates. The designs in [22-23] were able to reduce the gate delay by means of modification to the conventional truth table of the 4-2 compressor. Over recent years, the decrement of the power dissipation was the main emphasis of the deign in some reported works [24-25].
Considering the above mentioned statements, the aim of this paper is to implement a novel low power 4-2 compressor block using CNTFET technology which will be widely used in PPRT of a parallel multiplier and will be able to considerably improve the delay and power consumption of the realized multiplier. Along with consideration of power dissipation, the latency of critical path has been reduced in this work which further embosses the enhanced features of proposed structure. In addition and in order for better comparison between CMOS and CNTFET technologies, the proposed architecture has been simulated with the similar CMOS technology node so that a fair comparison can be performed between the CMOS and CNTFET technologies.
The article is organized as follows. Section 2 explains the design procedure of the novel 4-2 compressor. In section 3, the simulation results for the proposed compressor along with comparison with recent designs will be presented. Finally, some conclusions will be drawn in section 4.
EXPERIMENTAL
In the traditional architecture of Fig. 1, at least a latency of 4 XOR gates is expected in the input-output path for implementation of a 4-2 compressor [23]. However, in [21], this gate level delay was reduced to 3 XOR gates using some optimizations in the middle stages of the logic circuitry. On the other hand, in [22] and [23] due to simplification made to the conventional truth table of the 4-2 compressor that is shown in Fig. 2, the gate level has been reduced to less than 2 XOR gates. In this modified truth table and in order to better indicate the improvement, two different states and separate tables have been considered for high and low value of the C_{in}, Fig. 2(a) for C_{in}=0 and Fig. 2(b) for C_{in}=1. Considering the transistor level latency, the circuit of [22] has a latency of 4 transistors from input to the output. However, from the active area and power consumption viewpoint that is represented by the total number of the transistors and power consumption, the architecture proposed in [21] has only 60 transistors and will occupy smaller area on chip.
The study of the truth table in Fig. 2 depicts that the basis of truth table segmentation in [22] to eight distinct states originates from don’t-care states of the C_{out} and Carry outputs. The result of this simplification in the proposed architecture of [22] was the improvement of speed performance. However, the proposed design in [22] has considerable power consumption.
Since the main goal of this work is to improve both speed and power consumption of the circuit, then the simplified truth table of Fig. 2 has been considered here for implementation of 4-2 compressor to reduce the delay of the final circuit. However, for the power dissipation, the following simplifications have been performed to the generate the outputs with minimum power consumption,
In order to produce the Sum output, the following expression can be written:
(1)
In order to reduce the delay and to decrease the transistors count for creation of this output, the first XOR gate has been designed using Transmission Gates (TGs). As a result, the latency has been decreased to 4 transistors. Fig. 3(a) demonstrates the circuitry to generate the Sum output while the proposed Multiplexer (MUX) has been illustrated in Fig. 3(b) that was designed using TGs. Assuming a delay of about one transistor for the MUX circuits in the first and last stage, the total delay from input to the output will be equal to four transistors. It must be mentioned that the XOR/XNOR gate has been designed based on the circuit in [21] in which the latency from input to the output is equal to 2 transistors.
An insight into to the proposed architecture in [22] depicts that the circuitry, which produces Carry signal, can be further simplified to reduce the transistors count utilized in this structure. For instance, an NMOS transistor is enough to transfer the low logic level while a PMOS transistor is good enough to transfer the high logic level. Hence, the TGs that are used for this purpose can be replaced with single transistors. The optimized design is shown in Fig. 4 in which the complement states of the inputs are used to produce Carry signal that is contrary to the realization of [22] where the complement state was transferred to the output node. Fig. 4(a) shows the gate-level schematic of the control signal F while Fig. 4(b) illustrates the circuit schematic for the Carry signal designed based on the control waveforms. Equations (2), (3) and (4) illustrate the relationship between the inputs and Carry output:
(2)
(3)
(4)
Finally, the third output, which is sent to the adjacent 4-2 compressor block (C_{out}), is produced with the help of TGs. If two of the inputs contain the logic value of “1”, then C_{out} will rise to high logic level. Fig. 5 shows the corresponding circuitry in which single transistors are utilized instead of TGs to implement the C_{out} circuit and reduce the number of the total transistor count.
The RC model of the proposed circuit for generating the Sum output is shown in Fig. 6. This model is used to analyze the delay and power consumption of the circuit and compare the CMOS and CNTFET counterparts. The Sum output is considered to be the critical path for the circuit and the bottleneck of reducing the circuit delay. In this circuit model, the sum of three time constants corresponding to three circuit elements TG1 (first transmission gate), XOR gate and TG2 (second transmission gate) is used to evaluate the circuit delay. In this model R_{TG1} is the equivalent resistor of the TG1, C_{TG1} is the diffusion capacitance of the TG1, C_{inxor} is the input capacitance of the XOR circuit, R_{t1} and R_{t2} are equivalent resistors of the stacked transistors inside the XOR gate, C2 and C3 are the diffusion capacitance for the transistors of the XOR gate, R_{TG2} is the equivalent resistor of the TG2, C_{TG2} is the diffusion capacitance of the TG2 and C_{L} is the load capacitance. The following equations are summarizing the analysis of the circuit delay based on this model in Fig. 6.
Assuming that the equivalent resistor of a transmission gate is equal to half of a signal transistor and the resistor of a single transistor is equal to R, and also all the circuit capacitances are equal to C then it can be written that,
The mobility and the gate-channel capacitance of the CNTFET is higher than MOSFET transistor [26]. This will lead to higher level of current and transconductance for the CNTFET transistor in compare to MOSFET counterpart. Besides, this will reduce the equivalent on resistor of the transistor when is used as a switch and is working in the linear region. In addition to lower on resistor, the CNTFET transistor exhibit smaller diffusion capacitance in drain and source terminals that is due to oxide layer which is buried beneath the channel (nano-tubes) area. This will prevent the formation of a pn-junction from the source/drain to the substrate area, which is the major contributor to the diffusion capacitance of the transistor. Therefore, having smaller equivalent resistor and diffusion capacitances and according to (5) the delay of the proposed circuit, which is realized in CNTFET, process will be smaller than MOSFET technology. This concept is confirmed by the simulation results of the both circuits which is demonstrated in the next section.
The circuit model of the Sum circuit in Fig. 6 is used to estimate the circuit power consumption. The following equation is used to summarize the power consumption of the circuit for a single operation cycle,
(6)
Assuming that all the capacitors will be charged up to the VDD then the maximum possible power consumption for a single cycle of the circuit can written as,
(7)
Which according to what stated above for diffusion capacitances of a CNTFET transistor, the power consumption of the circuit realized in CNTFET process will be smaller than that in the CMOS process.
RESULTS AND DISCUSSIONS
The proposed 4-2 compressor was designed and simulated in CMOS and CNTFET 32nm process to evaluate its power and delay performance. Fig. 7 illustrates the simulation setup in which two compressor blocks have been cascaded to construct the critical path. This path starts from the inputs, reaches to the C_{out} output of first stage compressor and then ends up at the Carry output of the second compressor block. The simulations have been performed at the operating frequency of 200MHz and for capacitive load of 20fF, while the rise time and fall time of the input signal is equal to 100ps.
In Fig. 8 the results for a specific input state have been demonstrated which indicate the correct behavior of the implemented compressor architecture employing CNTFET 32nm standard process and depicts a delay of about 116ps.
The noise margin of the proposed circuit was extracted from the transfer function of the circuit and from the voltages corresponding to the points that are tangent to line with the slope value equal to -1 [27]. The voltage values corresponding to the V_{IL}, V_{IH}, V_{OL} and V_{OH} were extracted from the transfer function; according to these extracted voltages the value of the V_{NML} and V_{NMH} is both equal to 0.28V.
The proposed design was also simulated in CMOS 32nm process to enable the fair comparison to CNTFET realization possible. The result of this simulation is shown in Fig. 9 which illustrates a delay value of about 134ps. This value is larger than CNTFET simulation result which demonstrates the superiority of the CNTFET realization over the CMOS counterpart.
Moreover, for demonstrating advantage of the proposed architecture for the compressor block, some of the best recently reported works [20-23] were designed and simulated in CNTFET 32nm process along with the proposed design to enable the fair comparison of the this work possible. This comparison was performed on the delay and power consumption specification to give the reader a clear picture of the optimum design. As illustrated in Fig. 7, buffers are used at the inputs and outputs to have the balance loading for all of the simulated designs. The results which are summarized in Fig. 10 and Fig. 11 for delay and power comparison, respectively, illustrate the superiority of the introduced 4-2 compressor circuit in this paper.
In order to investigate the reliability of the proposed circuit under different circumstances, the temperature of the circuit has been swept from -20°C to 120°C in CNTFET technology. Fig. 12 demonstrates the simulation results based on this temperature sweep range. Fig. 1 (a) illustrates the variations for the delay of the critical path and Fig. 1(b) depicts the curvature for power dissipation over this temperature range. The result of the simulations for the delay specification indicates that for lower temperature values the delay of the circuit is higher which could be contributed to lower mobility coefficient.
Table 1 summarizes the results obtained for the delay and power consumption of the proposed circuit, which has been simulated in CMOS and CNTFET technologies. In this simulation the supply voltage value has been swept from 0.6 to 0.9 V to investigate the circuit performance concerning the delay and power consumption in different supply levels. Table 2 shows the comparison result of the different state of the art designs with the proposed work here. This comparison has been done in the CNTFET 32nm process and as mentioned before all other designs that are listed in the Table 2 were redesigned and simulated with the similar setup and input signal conditions.
CONCLUSIONS
In this article a low power 4-2 compressor has been introduced which outperforms the previous works concerning the total transistor count. With the help of modified truth table and by means of PTL, a novel structure has been proposed which has about 15% smaller delay in compare to the best recent reported design in the literatures.
In addition to smaller delay value, the proposed design has less number of transistors, which will help the circuit to have small area in compare to other designs. The proposed circuit for the comparator block alleviates the carry rippling mechanism and its drawbacks between the cascaded compressor blocks. The simulation results of the proposed design in CNTFET 32nm technology shows that proposed circuit has about 116ps delay which is smaller than best reported design with about 136ps of delay. The proposed circuit consumes about 474nW, which is similar to the value reported in [21], however the presented work in [21] has much higher delay value (174ps) in compare to the circuit reported here.
Besides the superiority of the proposed circuit to other designs from the structural design point, the proposed design was simulated in CNTFET process to demonstrate superior features of this advanced process. The proposed circuit was simulated and compared in both CMOS and CNTFET processes. The results of these simulations indicate that the proposed design in CNTFET process has about 18% smaller delay and considerably 67% less power consumption in comparison to the design in the similar CMOS process.
CONFLICT OF INTEREST
The authors declare that there is no conflict of interests regarding the publication of this review article.