Group name: NAND

From UVA ECE & BME wiki
Jump to: navigation, search


Group Members

Team NAND!
To protect the world from devastation ... To unite all people with in our nations ... To denounce the evil of truth and love ... To extend our reach to the stars above ... Kyle Andrew Izak Jacob Kenny ... Team NAND blast off at the speed of light ... Surrender now or prepare to fight ... Meowth, that's right!
Kyle Powers
Dar-Eaum Andrew Nam
Izak Baranowski
Jacob Gravely
Min Je "Kenny" Chang


Figure 1 – The DSP system that we must design to compete for the PICo contract.


We are a design team for a startup company that specializes in digital IC design. Our team has learned of the opportunity to win a large contract from Portable Instruments Company (PICo) to develop an embedded Digital Signal Processor design in the FreePDK 45nm technology. In order to win the contract, our team must design and implement a proof of concept circuit block and demonstrate it using simulation. We will be competing with teams from other companies to win the contract. Professor Calhoun is the liaison to our group from PICo who will evaluate our design.

Design Description

Figure 1 to the right illustrates the basic structure of the DSP block that we must design:

Project Documents

File:FinalReport.pdf | File:PresentationSlides.pdf | File:Assignment.pdf

DSP Block Diagram

DSP Block Diagram
Figure 2 - DSP block diagram designed in Logisim


The DSP is composed of two 16-bit inputs A<0:15> and B<0:15>, an Arithmetic Logic Unit (ALU), three rising edge-triggered registers, and a 16-bit output Out<0:15>. The multiplexers are governed by a 3-bit select signal and the registers are controlled by a 1-bit enable signal and a clock signal. The ALU contains the following functions:

ALU Function Description
NOP Do nothing – No change at Out
ADD Out = A + B
SUB Out = A - B
SHIFT Out = A << B
AND Out = A & B
OR Out = A | B
Pass A Out = A
Arbitrary Out = <function of your choice>


The block diagram was created using the freely available application Logisim

Adder and Subtractor

Kogge-Stone Parallel Prefix Adder

8-bit Kogge-Stone PPA tree graph
Figure 3 - 8-bit Kogge-Stone PPA tree graph
16-bit Kogge-Stone tree graph
Figure 4 - 16-bit Kogge-Stone tree graph
16-bit Brent-Kung tree graph
Figure 5 - 16-bit Brent-Kung tree graph
16-bit Han-Carlson tree graph
Figure 6 - 16-bit Han-Carlson tree graph

For our adder topology, we chose to implement a high performance parallel prefix adder (PPA). In a PPA, a prefix operation is constructed that permits the computation of intermediate carries. In conjunction with generate and propagate signals, the prefix operator allows PPAs to obtain an advantageous latency of O(log2N) instead of O(N) (like in a Ripple Carry adder), where N is the word length. Figure 3 illustrates a tree graph of an 8-bit Kogge-Stone PPA.


The PPA functions properly by successfully going through three stages. The first stage is a pre-computation stage, which generates the generate and propagate signals, and is governed by the equations:

gi = Gi:i = ai AND bi ----- pi = Pi:i = ai XOR bi ---(1)

where ai and bi are inputs bits. Additionally,

g0 = G0:0 = Cin ----- p0 = P0:0 = 0 ---(2)

The prefix operation serves as the middle stages in this high performance adder topology. A dot operator denotes the prefix operation:

(Gj:mv Pj:mv) = (Gj:k Pj:k) • (Gl:mv Pl:m) ---(3)
Gj:m = Gj:k OR (Pj:k AND Gl:m) ----- Pj:m = Pj:k AND Pl:m ---(4)

where j ≤ (k, l) ≤ m. The dot operator is associative and idempotent, but not commutative.
The Carry logic is also utilized in the middle stages. By utilizing equations (3) and (4), the prefix tree can determine the carry:

ci =Gi:0 ---(5)

The last stage in this high performance PPA is the sum computation. The sum is governed by the following equation:

si = ci XOR pi ---(6)
Design choice

The attributes associated with a Kogge-­Stone (Figure 4) are low logic depth, high node count, and minimal fanout. While a high node count implies a larger area, the low logic depth and minimal fanout allow for faster performance. If PICo were to prefer metrics more along the lines of reduced area, a Brent-­‐Kung PPA (Figure 5) could optionally be constructed, which with its minimum number of nodes and maximum logic depth, trades speed for reduced area. If a compromise between speed and area was sought after, a Han-­Carlson (a hybrid Kogge­‐Stone/Brent-­Kung PPA) could be implemented instead (Figure 6).



For our subtractor topology, we reused the Kogge-­Stone parallel prefix adder (PPA). What this implies is that it is a high performance adder, which by utilizing it in a subtractor, allows for very speedy subtractions. The goal for the subtractor was to implement the following equation:

out<0:15> = a<0:15> ‐ b<0:15> ---(1)

An easy approach to accomplishing this is realizing that in two’s complement, subtraction is achieved via inverting the subtrahend and adding one, like so:

out = a + b’ + 1 ---(2)

where b’ implies its negation. Thusly, with this equation, we permit ourselves to use our high performance PPA. The following diagram (Figure 7) illustrates our topology setup:

16-bit Subtractor
Figure 7 - 16-bit Subtractor using a 16-bit inverter and a 16-bit adder

Other ALU Operations

SHIFT Function

The SHIFT function is a logical shift of A by the number of bit locations specified in the bottom 2 bits of B with the mapping (00 -> 1b shift, 01 -> 2b shift, 10 -> 3b shift, 11 -> 4b shift).

More information regarding logical shifting can be found on Wikipedia.

1-bit AND gate in CMOS transistor logic
Figure 8 - 1-bit AND gate in CMOS transistor logic

AND Operation


The functionality intended by the AND gate is to perform the logical AND operation on A<0:15> and B<0:15>. Follows is the AND truth table:

Truth Table
0 0 0
0 1 0
1 0 0
1 1 1
CMOS Implementation

Figure 8 to the right illustrates how an AND gate is implemented in CMOS transistor logic. The same topology is used in our DSP, where it is extended fifteen additional times to construct a 16-bit AND gate.

1-bit OR gate in CMOS transistor logic
Figure 9 - 1-bit OR gate in CMOS transistor logic

OR Operation


The functionality intended by the OR gate is to perform the logical OR operation on A<0:15> and B<0:15>. Follows is the OR truth table:

Truth Table
0 0 0
0 1 1
1 0 1
1 1 1
CMOS Implementation

Figure 9 to the right illustrates how an OR gate is implemented in CMOS transistor logic. The same topology is used in our DSP, where it is extended fifteen additional times to construct a 16-bit OR gate.

Multiplexers and Registers


An 8x1 Multiplexer using pass gate transistor logic
Figure 10 - An 8x1 Multiplexer using pass gate transistor logic

The purpose of a multiplexer is to take in a certain amount of inputs, and by utilizing some corresponding number of "select" lines, be able to select which input into the MUX gets sent to the output. In this case, we required the use of an 8x1 multiplexor. To do this, we needed three select lines, because 23 = 8. More information on MUXes can be found here. Our design methodology follows:

Design Decision

For our MUX, we chose to implement pass gate logic, as opposed to the standard MUX topology that requires eight 4-input AND gates and an 8-input OR gate. The pass gate topology offers many improvements, most notably a greatly reduced cost in area. The passgate topology uses only twenty-four transistors (including the inverters used to invert the select line signals and to buffer the output) versus the ninety-eight transistors required in the gate based topology. Another benefit is the reduced fanout to the select line inputs results in less power consumption. There are tradeoffs however, in order to compensate for the faults of the pass gate topology, a buffer was created with two inverters that prevents the signal from weakening significantly. Figure 10 illustrates our MUX.


A register is used in conjunction with a clock signal to hold/write values. We used the following register topology:

A 1-bit Master-Slave Rising-Edge Trigged Register
Figure 11 - A 1-bit Master-Slave Rising-Edge Trigged Register

Arbitrary Function

8x8 Multiplier with Dadda Reduction Scheme
8x8 Multiplier with Dadda Reduction Scheme

Dadda Multiplier


The following is an excerpt from the Wikipedia page on Dadda Multipliers:
The Dadda multiplier is a hardware multiplier design invented by computer scientist Luigi Dadda in 1965. It is similar to the Wallace multiplier, but it is slightly faster (for all operand sizes) and requires fewer gates (for all but the smallest operand sizes). In fact, Dadda and Wallace multipliers have the same 3 steps:

1. Multiply (that is - AND) each bit of one of the arguments, by each bit of the other, yielding results. Depending on position of the multiplied bits, the wires carry different weights, for example wire of bit carrying result of is 32.
2. Reduce the number of partial products to two layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.

However, unlike Wallace multipliers that reduce as much as possible on each layer, Dadda multipliers do as few reductions as possible. Because of this, Dadda multipliers have less expensive reduction phase, but the numbers may be a few bits longer, thus requiring slightly bigger adders.

Design Decision

We chose to implement this multiplier topology because its reduction scheme allows it to reach O(logN) time, where as with a regular array multiplier, it would require O(log2N) time. We learned about array multipliers and the wallace tree multiplier in class, but thought it would be a neat and rewarding experience to learn about an additional multiplier topology.



Reduced power consumption, area, and delay could potentially be achieved by encoding the inputs prior to the ALU. Doing this would lead to smaller logic gates and arithmetic functions, which in turn would consume less power, comprise less area, and reduce delay. After the inputs have been performed on by the ALU, they could then be decoded, thusly achieving a fairly advantageous DSP.


Utilizing a multiplexor to MUX VDD to each ALU function could save power, as in a traditional ALU, all the functions are still "on" and consistently performing its operation. If one were to only turn on one function at a time, power consumption could be greatly reduced.

Fault-tolerant Adder

Click this link and read this paper to see how on-going research at UVA could allow future parallel prefix adders to accomplish robust multi-fault-tolerance and reconfigurability.

  • Professor Stan, Stevo Bailey, and Kyle Powers

Advice to Future Classes


Personal tools