# Group name: NAND

## Contents |

## Group Members

##### Kyle Powers

- Email: kep4md@virginia.edu

##### Dar-Eaum Andrew Nam

- Email: dan6ra@virginia.edu

##### Izak Baranowski

- Email: jb4er@virginia.edu

##### Jacob Gravely

- Email: jdg5ra@virginia.edu

##### Min Je "Kenny" Chang

- Email: mc3jx@virginia.edu

## Introduction

### Motivation

We are a design team for a startup company that specializes in digital IC design. Our team has learned of the opportunity to win a large contract from Portable Instruments Company (PICo) to develop an embedded Digital Signal Processor design in the FreePDK 45nm technology. In order to win the contract, our team must design and implement a proof of concept circuit block and demonstrate it using simulation. We will be competing with teams from other companies to win the contract. Professor Calhoun is the liaison to our group from PICo who will evaluate our design.

### Design Description

Figure 1 to the right illustrates the basic structure of the DSP block that we must design:

### Project Documents

File:FinalReport.pdf | File:PresentationSlides.pdf | File:Assignment.pdf

## DSP Block Diagram

### Composition

The DSP is composed of two 16-bit inputs A<0:15> and B<0:15>, an Arithmetic Logic Unit (ALU), three rising edge-triggered registers, and a 16-bit output Out<0:15>. The multiplexers are governed by a 3-bit select signal and the registers are controlled by a 1-bit enable signal and a clock signal. The ALU contains the following functions:

￼ALU Function | Description |

NOP | Do nothing – No change at Out |

ADD | Out = A + B |

SUB | Out = A - B |

SHIFT | Out = A << B |

AND | Out = A & B |

OR | Out = A | B |

Pass A | Out = A |

Arbitrary | Out = <function of your choice> |

### Creation

The block diagram was created using the freely available application Logisim

## Adder and Subtractor

### Kogge-Stone Parallel Prefix Adder

##### Description

For our adder topology, we chose to implement a high performance parallel prefix adder (PPA). In a PPA, a prefix operation is constructed that permits the computation of intermediate carries. In conjunction with generate and propagate signals, the prefix operator allows PPAs to obtain an advantageous latency of *O(log2N)* instead of *O(N)* (like in a Ripple Carry adder), where N is the word length. Figure 3 illustrates a tree graph of an 8-bit Kogge-Stone PPA.

##### Functionality

The PPA functions properly by successfully going through three stages. The first stage is a pre-computation stage, which generates the generate and propagate signals, and is governed by the equations:

*g*_{i}= G_{i:i}= a_{i}AND b_{i}----- p_{i}= P_{i:i}= a_{i}XOR b_{i}---**(1)**

where a_{i} and b_{i} are inputs bits. Additionally,

*g*_{0}= G_{0:0}= C_{in}----- p_{0}= P_{0:0}= 0 ---**(2)**

The prefix operation serves as the middle stages in this high performance adder topology. A dot operator denotes the prefix operation:

*(G*_{j:mv}P_{j:mv}) = (G_{j:k}P_{j:k}) • (G_{l:mv}P_{l:m}) ---**(3)***G*_{j:m}= G_{j:k}OR (P_{j:k}AND G_{l:m}) ----- P_{j:m}= P_{j:k}AND P_{l:m}---**(4)**

where j ≤ (k, l) ≤ m. The dot operator is associative and idempotent, but not commutative.

The Carry logic is also utilized in the middle stages. By utilizing equations (3) and (4), the prefix tree can determine the carry:

*c*_{i}=G_{i:0}---**(5)**

The last stage in this high performance PPA is the sum computation. The sum is governed by the following equation:

*s*￼￼_{i}= c_{i}XOR p_{i}---**(6)**

##### Design choice

The attributes associated with a Kogge-Stone (Figure 4) are low logic depth, high node count, and minimal fanout. While a high node count implies a larger area, the low logic depth and minimal fanout allow for faster performance. If PICo were to prefer metrics more along the lines of reduced area, a Brent-‐Kung PPA (Figure 5) could optionally be constructed, which with its minimum number of nodes and maximum logic depth, trades speed for reduced area. If a compromise between speed and area was sought after, a Han-Carlson (a hybrid Kogge‐Stone/Brent-Kung PPA) could be implemented instead (Figure 6).

### Subtractor

##### Description

For our subtractor topology, we reused the Kogge-Stone parallel prefix adder (PPA). What this implies is that it is a high performance adder, which by utilizing it in a subtractor, allows for very speedy subtractions. The goal for the subtractor was to implement the following equation:

*out<0:15> = a<0:15> ‐ b<0:15> ---***(1)**

An easy approach to accomplishing this is realizing that in two’s complement, subtraction is achieved via inverting the subtrahend and adding one, like so:

*out = a + b’ + 1 ---***(2)**

where b’ implies its negation. Thusly, with this equation, we permit ourselves to use our high performance PPA. The following diagram (Figure 7) illustrates our topology setup:

## Other ALU Operations

### SHIFT Function

The SHIFT function is a logical shift of A by the number of bit locations specified in the bottom 2 bits of B with the mapping (00 -> 1b shift, 01 -> 2b shift, 10 -> 3b shift, 11 -> 4b shift).

More information regarding logical shifting can be found on Wikipedia.

### AND Operation

##### Description

The functionality intended by the AND gate is to perform the logical AND operation on A<0:15> and B<0:15>. Follows is the AND truth table:

##### Truth Table

A | B | AND |

0 | 0 | 0 |

0 | 1 | 0 |

1 | 0 | 0 |

1 | 1 | 1 |

##### CMOS Implementation

Figure 8 to the right illustrates how an AND gate is implemented in CMOS transistor logic. The same topology is used in our DSP, where it is extended fifteen additional times to construct a 16-bit AND gate.

### OR Operation

##### Description

The functionality intended by the OR gate is to perform the logical OR operation on A<0:15> and B<0:15>. Follows is the OR truth table:

##### Truth Table

A | B | OR |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

##### CMOS Implementation

Figure 9 to the right illustrates how an OR gate is implemented in CMOS transistor logic. The same topology is used in our DSP, where it is extended fifteen additional times to construct a 16-bit OR gate.

## Multiplexers and Registers

### Multiplexers

##### Functionality

The purpose of a multiplexer is to take in a certain amount of inputs, and by utilizing some corresponding number of "select" lines, be able to select which input into the MUX gets sent to the output. In this case, we required the use of an 8x1 multiplexor. To do this, we needed three select lines, because 2^{3} = 8. More information on MUXes can be found here. Our design methodology follows:

##### Design Decision

For our MUX, we chose to implement pass gate logic, as opposed to the standard MUX topology that requires eight 4-input AND gates and an 8-input OR gate. The pass gate topology offers many improvements, most notably a greatly reduced cost in area. The passgate topology uses only twenty-four transistors (including the inverters used to invert the select line signals and to buffer the output) versus the ninety-eight transistors required in the gate based topology. Another benefit is the reduced fanout to the select line inputs results in less power consumption. There are tradeoffs however, in order to compensate for the faults of the pass gate topology, a buffer was created with two inverters that prevents the signal from weakening significantly. Figure 10 illustrates our MUX.

### Registers

A register is used in conjunction with a clock signal to hold/write values. We used the following register topology:

## Arbitrary Function

### Dadda Multiplier

##### Description

The following is an excerpt from the Wikipedia page on Dadda Multipliers:

The Dadda multiplier is a hardware multiplier design invented by computer scientist Luigi Dadda in 1965. It is similar to the Wallace multiplier, but it is slightly faster (for all operand sizes) and requires fewer gates (for all but the smallest operand sizes). In fact, Dadda and Wallace multipliers have the same 3 steps:

- 1. Multiply (that is - AND) each bit of one of the arguments, by each bit of the other, yielding results. Depending on position of the multiplied bits, the wires carry different weights, for example wire of bit carrying result of is 32.

- 2. Reduce the number of partial products to two layers of full and half adders.

- 3. Group the wires in two numbers, and add them with a conventional adder.

However, unlike Wallace multipliers that reduce as much as possible on each layer, Dadda multipliers do as few reductions as possible. Because of this, Dadda multipliers have less expensive reduction phase, but the numbers may be a few bits longer, thus requiring slightly bigger adders.

##### Design Decision

We chose to implement this multiplier topology because its reduction scheme allows it to reach *O(logN)* time, where as with a regular array multiplier, it would require *O(log ^{2}N)* time. We learned about array multipliers and the wallace tree multiplier in class, but thought it would be a neat and rewarding experience to learn about an additional multiplier topology.

## Extensions

##### Encoding/Decoding

Reduced power consumption, area, and delay could potentially be achieved by encoding the inputs prior to the ALU. Doing this would lead to smaller logic gates and arithmetic functions, which in turn would consume less power, comprise less area, and reduce delay. After the inputs have been performed on by the ALU, they could then be decoded, thusly achieving a fairly advantageous DSP.

##### MUXing VDD

Utilizing a multiplexor to MUX VDD to each ALU function could save power, as in a traditional ALU, all the functions are still "on" and consistently performing its operation. If one were to only turn on one function at a time, power consumption could be greatly reduced.

##### Fault-tolerant Adder

Click this link and read this paper to see how on-going research at UVA could allow future parallel prefix adders to accomplish robust multi-fault-tolerance and reconfigurability.

- Professor Stan, Stevo Bailey, and Kyle Powers

## Advice to Future Classes

- Start early and ask questions, go to office hours
- Learn Ocean for testing and if you can, SKILL for making topologies
- Pick a theme from the beginning and stick with it (i.e. really fast, or low power, or least area)
- Pay attention to Professor Calhoun's design review critiques/remarks, fix them before the next design review

## Timeline

- Issued: 02/02/12

- Form Group: 02/14/12

- Design Review 1 Due: 03/13/12

- Design Review 2 Due: 04/10/12

- Final Report Due: 04/24/12

- Presentation Due: 04/24/12