# CHAPTER 29: ARITHMETIC CIRCUITS

This chapter introduces circuits that perform arithmetic functions. With serial processes, (n) is the best asymptotic time bound we can hope to achieve for adding two n-digit numbers. With circuits that operate in parallel, however, we can do better. In this chapter, we shall design circuits that can quickly perform addition and multiplication. (Subtraction is essentially the same as addition, and division is deferred to Problem 29-1.) We shall assume that all inputs are n-bit natural numbers, expressed in binary.

We start in Section 29.1 by presenting combinational circuits. We shall see how the depth of a circuit corresponds to its "running time." The full adder, which is a building block of most of the circuits in this chapter, serves as our first example of a combinational circuit. Section 29.2 presents two combinational circuits for addition: the ripple-carry adder, which works in (n) time, and the carry-lookahead adder, which takes only O(1g n) time. It also presents the carry-save adder, which can reduce the problem of summing three numbers to the problem of summing two numbers in (1) time. Section 29.3 introduces two combinational multipliers: the array multiplier, which takes (n) time, and the Wallace-tree multiplier, which requires only (1g n) time. Finally, Section 29.4 presents circuits with clocked storage elements (registers) and shows how hardware can be saved by reusing combinational circuitry.

# 29.1 Combinational circuits

## Combinational circuits

`x  y  z  c  s`

`-------------`

`0  0  0  0  0`

`0  0  1  0  1`

`0  1  0  0  1`

`0  1  1  1  0`

`1  0  0  0  1`

`1  0  1  1  0`

`1  1  0  1  0`

`1  1  1  1  1`

`s = parity(x,y,z) = x  y  z,`

#### (29.1)

`c = majority(x,y,z) = (x  y)  (y  z)  (x  z).`

#### (29.2)

(In general, the parity and majority functions can take any number of input bits. The parity is 1 if and only if an odd number of the inputs are 1's. The majority is 1 if and only if more than half the inputs are 1's.) Note that the c and s bits, taken together, give the sum of x, y, and z. For example, if x = 1, y = 0, and z = 1, then c, s = 10,1 which is the binary representation of 2, the sum of x, y, and z.

1For clarity, we omit the commas between sequence elements when they are bits.

To examine how the full adder operates, assume that each gate operates in unit time. Figure 29.2(a) shows a set of inputs that becomes stable at time 0. Gates A-D, and no other gates, have all their input values stable at that time and therefore produce the values shown in Figure 29.2(b) at time 1. Note that gates A-D operate in parallel. Gates E and F, but not gate G, have stable inputs at time 1 and produce the values shown in Figure 29.2(c) at time 2. The output of gate E is bit s, and so the s output from the full adder is ready at time 2. The c output is not yet ready, however. Gate G finally has stable inputs at time 2, and it produces the c output shown in Figure 29.2(d) at time 3.

## Circuit depth

If each combinational element takes constant time to compute its output values, then the worst-case propagation delay through a combinational circuit is proportional to its depth. Figure 29.2 shows the depth of each gate in the full adder. Since the gate with the largest depth is gate G, the full adder itself has depth 3, which is proportional to the worst-case time it takes for the circuit to perform its function.

A combinational circuit can sometimes compute faster than its depth. Suppose that a large subcircuit feeds into one input of a 2-input AND gate but that the other input of the AND gate has value 0. The output of the gate will then be 0, independent of the input from the large subcircuit. In general, however, we cannot count on specific inputs being applied to the circuit, and the abstraction of depth as the "running time" of the circuit is therefore quite reasonable.

## Circuit size

This definition of circuit size is not particularly useful for small circuits. After all, since a full adder has a constant number of inputs and outputs and computes a well-defined function, it satisfies the definition of a combinational element. A full adder built from a single full-adder combinational element therefore has size 1. In fact, according to this definition, any combinational element has size 1.

The definition of circuit size is intended to apply to families of circuits that compute similar functions. For example, we shall soon see an addition circuit that takes two n-bit inputs. We are really not talking about a single circuit here, but rather a family of circuits--one for each size of input. In this context, the definition of circuit size makes good sense. It allows us to define convenient circuit elements without affecting the size of any implementation of the circuit by more than a constant factor. Of course, in practice, measurements of size are much more complicated, involving not only the choice of combinational elements, but also concerns such as the area the circuit requires when integrated on a silicon chip.

## Exercises

In Figure 29.2, change input y to a 1. Show the resulting value carried on each wire.

Show how to construct an n-input parity circuit with n - 1 XOR gates and depth lg n.

Show that any boolean combinational element can be constructed from a constant number of AND, OR, and NOT gates. (Hint: Implement the truth table for the element.)

Show that any boolean function can be constructed entirely out of NAND gates.

Construct a combinational circuit that performs the exclusive-or function using only four 2-input NAND gates.

Let C be an n-input, n-output combinational circuit of depth d. If two copies of C are connected, with the outputs of one feeding directly into the inputs of the other, what is the maximum possible depth of this tandem circuit? What is the minimum possible depth?

`8  7  6  5  4  3  2  1  0     i`

`1  1  0  1  1  1  0  0  0  =  c`

`   0  1  0  1  1  1  1  0  =  a`

`   1  1  0  1  0  1  0  1  =  b`

`-------------------------------`

`1  0  0  1  1  0  0  1  1  =  s`

#### Figure 29.3 Adding two 8-bit numbers a = 01011110 and b = 11010101 to produce a 9-bit sum s = 100110011. Each bit ci is a carry bit. Each column of bits represents, from top to bottom, ci, ai, bi, and si for some i. Carry-in c0 is always 0.

Observe that each sum bit si is the parity of bits ai, bi, and ci (see equation (29.1)). Moreover, the carry-out bit ci+1 is the majority of ai, bi, and ci (see equation (29.2)). Thus, each stage of the addition can be performed by a full adder.

Because the carry bits ripple through all n full adders, the time required by an n-bit ripple-carry adder is (n). More precisely, full adder FAi is at depth i + 1 in the circuit. Because FAn-1 is at the largest depth of any full adder in the circuit, the depth of the ripple-carry adder is n. The size of the circuit is (n) because it contains n combinational elements.

#### Figure 29.4 An 8-bit ripple-carry adder performing the addition of Figure 29.3. Carry bit c0 is hardwired to 0, indicated by the diamond, and carry bits ripple from right to left.

The key observation is that in ripple-carry addition, for i 1, full adder FAi has two of its input values, namely ai and bi, ready long before the carry-in ci is ready. The idea behind the carry-lookahead adder is to exploit this partial information.

`ai-1  bi-1   ci   carry status`

`-----------------------------`

`  0    0     0        k`

`  0    1    ci-1      p`

`  1    0    ci-1      p`

`  1    1     1        g`

#### Figure 29.5 The carry-out bit ci and carry status corresponding to inputs ai-1, bi-1, and ci-1 of full adder FAi-1 in ripple-carry addition.

`              FAi`

`         k   p   g`

`--------------------`

`       k  k   k   g`

`FAi-1  p  k   p   g`

`       g  k   g   g`

#### Figure 29.6 The carry status of the combination of full adders FAi-1 and FAi in terms of their individual carry statuses, given by the carry-status operator over the domain {k, p, g}.

We can use the carry-status operator to express each carry bit ci in terms of the inputs. We start by defining x0 = k and

#### (29.3)

for i = 1, 2, . . . , n. Thus, for i = 1, 2, . . . , n, the value of xi is the carry status given by Figure 29.5.

`yi  = yi-1  xi`

`= x0  x1    xi`

#### (29.4)

for i = 1, 2, . . . , n. We can think of yi as a "prefix" of x0 x1 xn; we call the process of computing the values y0, y1, . . . , yn a prefix computation. (Chapter 30 discusses prefix computations in a more general parallel context.) Figure 29.7 shows the values of xi and yi corresponding to the binary addition shown in Figure 29.3. The following lemma gives the significance of the yi values for carry-lookahead addition.

#### Figure 29.7 The values of xi and yi for i = 0, 1, . . . , 8 that correspond to the values of ai, bi, and ci in the binary-addition problem of Figure 29.3. Each value of xi is shaded with the values of ai-1 and bi-1 that it depends on.

Define x0, x1, . . . , xn and y0, y1, . . . , yn by equations (29.3) and (29.4). For i = 0, 1, . . . ,n, the following conditions hold:

1. yi = k implies ci = 0,

2. yi = g implies ci = 1, and

3. yi = p does not occur.

Proof The proof is by induction on i. For the basis, i = 0. We have y0 = x0 = k by definition, and also c0 = 0. For the inductive step, assume that the lemma holds for i - 1. There are three cases depending on the value of yi.

1. If yi = k, then since yi = yi-1 xi, the definition of the carry-status operator from Figure 29.6 implies either that xi = k or that xi = p and yi-1 = k. If xi = k, then equation (29.3) implies that ai-1 = bi-1 = 0, and thus ci = majority (ai-1, bi-1, ci-1) = 0. If xi = p and yi-1 = k, then ai-1 bi-1 and, by induction, ci-1 = 0. Thus, majority (ai-1, bi-1, ci-1) = 0, and thus ci = 0.

2. If yi = g, then either we have xi = g or we have xi = p and yi-1 = g. If xi = g, then ai-1 = bi-1= 1, which implies ci = 1. If xi = p and yi-1 = g, then ai-1 bi-1 and, by induction, ci-1 = 1, which implies ci= 1.

3. If yi = p, then Figure 29.6 implies that yi-1 = p, which contradicts the inductive hypothesis.

Lemma 29.1 implies that we can compute each carry bit ci by computing each carry status yi. Once we have all the carry bits, we can compute the entire sum in (1) time by computing in parallel the sum bits si = parity (ai, bi, ci) for i = 0, 1, . . . , n (taking an = bn = 0). Thus, the problem of quickly adding two numbers reduces to the prefix computation of the carry statuses y0, y1, . . . ,yn.

### Computing carry statuses with a parallel prefix circuit

Before constructing the parallel prefix circuit, we introduce a notation that will aid our understanding of how the circuit operates. For integers i and j in the range 0 i j n, we define

`[i,j] = xi  xi+1    xj.`

Thus, for i = 0, 1, . . . , n, we have [i, i] = xi, since the composition of just one carry status xi is itself. For i, j, and k satisfying 0 i < j k n, we also have the identity

`[i,k] = [i,j - 1]  [j,k],`

#### (29.5)

since the carry-status operator is associative. The goal of a prefix computation, in terms of this notation, is to compute yi = [0, i] for i= 0, 1, . . . , n.

The only combinational element used in the parallel prefix circuit is a circuit that computes the operator. Figure 29.8 shows how pairs of elements are organized to form the internal nodes of a complete binary tree, and Figure 29.9 illustrates the parallel prefix circuit for n = 8. Note that the wires in the circuit follow the structure of a tree, but the circuit itself is not a tree, although it is purely combinational. The inputs x1, x2, . . . , xn are supplied at the leaves, and the input x0 is provided at the root. The outputs y0, y1, . . . , yn - 1 are produced at leaves, and the output yn is produced at the root. (For ease in understanding the prefix computation, variable indices increase from left to right in Figures 29.8 and 29.9, rather than from right to left as in other figures of this section.)

The two elements in each node typically operate at different times and have different depths in the circuit. As shown in Figure 29.8, if the subtree rooted at a given node spans some range xi, xi+1, . . . , xk of inputs, its left subtree spans the range xi, xi+1, . . . , xj - 1, and its right subtree spans the range xj, xj+1, . . . , xk, then the node must produce for its parent the product [i, k] of all inputs spanned by its subtree. Since we can assume inductively that the node's left and right children produce the products [i, j - 1] and [j, k], the node simply uses one of its two elements to compute [i, k] [i, j - 1] [j, k].

Some time after this upward phase of computation, the node receives from its parent the product [0, i - 1] of all inputs that come before the leftmost input xi that it spans. The node now likewise computes values for its children. The leftmost input spanned by the node's left child is also xi, and so it passes the value [0, i - 1] to the left child unchanged. The leftmost input spanned by its right child is xj, and so it must produce [0, j - 1]. Since the node receives the value [0, i - 1] from its parent and the value [i, j - 1] from its left child, it simply computes [0, j - 1] [0, i - 1] [i, k] and sends this value to the right child.

#### Figure 29.8 The organization of a parallel prefix circuit. The node shown is the root of a subtree whose leaves input the values xi to xk. The node's left subtree spans inputs xi to xj - 1, and its right subtree spans inputs xj to xk. The node consists of two elements, which operate at different times during the operation of the circuit. One element computes [i, k] [i, j - 1] [j, k], and the other element computes [0, j - 1] [0, i - 1] [i, j - 1]. The values computed are shown on the wires.

Figure 29.9 shows the resulting circuit, including the boundary case that arises at the root. The value x0 = [0, 0] is provided as input at the root, and one more element is used to compute (in general) the value yn = [0, n] = [0, 0] [1, n].

If n is an exact power of 2, then the parallel prefix circuit uses 2n - 1 elements. It takes only O(lg n) time to compute all n + 1 prefixes, since the computation proceeds up the tree and then back down. Exercise 29.2-5 studies the depth of the circuit in more detail.

#### Figure 29.10 The construction of an n-bit carry-lookahead adder, shown here for n = 8. It consists of n + 1 KPG boxes KPGi for i = 0, 1, . . . , n. Each box KPGi takes external inputs ai and bi (where an and bn are hardwired to 0, as indicated by the diamond) and computes carry status xi+1. These values are fed into the parallel prefix circuit, which returns the results yi of the prefix computation. Each box KPGi now takes yi as input, interprets it as the carry-in bit ci, and then outputs the sum bit si = parity (ai, bi, ci). Sample values corresponding to those shown in Figures 29.3 and 29.9 are shown.

Given three n-bit numbers x = xn-1, xn-2, . . . . , x0, y = yn-1, yn-2, . . . , y0, and z = zn-1,zn-2, . . . , z0, an n-bit carry-save adder produces an n-bit number u = un-1, un-2, . . . , u0 and an (n + 1)-bit number v = vn, vn-1, . . . , v0 such that

`u + v = x + y + z.`

As shown in Figure 29.11 (a), it does this by computing

`ui = parity (xi, yi, zi),`

`vi + 1 = majority (xi, yi, zi),`

for i = 0, 1, . . . , n - 1. Bit v0 always equals 0.

The n-bit carry-save adder shown in Figure 29.11(b) consists of n full adders FA0, FA1, . . . , FAn - 1. For i = 0, 1, . . . , n - 1, full adder FAi takes inputs xi, yi, and zi. The sum-bit output of FAi is taken as ui, and the carry-out of FAi is taken as vi+1. Bit v0 is hardwired to 0.

Since the computations of all 2n + 1 output bits are independent, they can be performed in parallel. Thus, a carry-save adder operates in (1) time and has (n) size. To sum three n-bit numbers, therefore, we need only perform a carry-save addition, taking (1) time, and then perform a carry-lookahead addition, taking O(lg n) time. Although this method is not asymptotically better than the method of using two carry-lookahead additions, it is much faster in practice. Moreover, we shall see in Section 29.3 that carry-save addition is central to fast algorithms for multiplication.

## Exercises

Let a = 01111111, b = 00000001, and n = 8. Show the sum and carry bits output by full adders when ripple-carry addition is performed on these two sequences. Show the carry statuses x0, x1, . . . , x8 corresponding to a and b, label each wire of the parallel prefix circuit of Figure 29.9 with the value it has given these xi inputs, and show the resulting outputs y0, y1, . . . , y8.

Prove that the carry-status operator given by Figure 29.5 is associative.

#### Figure 29.12 A parallel prefix circuit for use in Exercise 29.2-6.

Show the gate-level construction of the box KPGi. Assume that each output xi is represented by 00 if xi = k, by 11 if xi = g, and by 01 or 10 if xi = p. Assume also that each input yi is represented by 0 if yi = k and by 1 if yi = g.

Give a recursive block diagram of the circuit in Figure 29.12 for any number n of inputs that is an exact power of 2. Argue on the basis of your block diagram that the circuit indeed performs a prefix computation. Show that the depth of the circuit is (lg n) and that it has (n lg n) size.

What is the maximum fan-out of any wire in the carry-lookahead adder? Show that addition can still be performed in O(lg n) time by a (n)-size circuit even if we restrict gates to have O(1) fan-out.

Show that n-bit addition can be accomplished with a combinational circuit of depth 4 and size polynomial in n if AND and OR gates are allowed arbitrarily high fan-in. (Optional: Achieve depth 3.)

# 29.3 Multiplication circuits

In this section, we examine two circuits for multiplying two n-bit numbers. Array multipliers operate in (n) time and have (n2) size. Wallace-tree multipliers also have (n2) size, but they operate in (lg n) time. Both circuits are based on the grade-school algorithm.

`            1  1  1  0  =  a`

`            1  1  0  1  =  b`

`-------------------------------`

`            1  1  1  0  =  m(0)`

`         0  0  0  0     =  m(1)`

`      1  1  1  0        =  m(2)`

`   1  1  1  0           =  m(3)`

`-------------------------------`

`1  0  1  1  0  1  1  0  =  p`

## 29.3.1 Array multipliers

Figure 29.14 shows an array multiplier for two input numbers a = an-1, an-2, . . . , a0 and b = bn-1, bn-2, . . . , b0. The aj values run vertically, and the bi values run horizontally. Each input bit fans out to n AND gates to form partial products. Full adders, which are organized as carry-save adders, sum partial products. The lower-order bits of the final product are output on the right. The higher-order bits are formed by adding the two numbers output by the last carry-save adder.

Let us examine the construction of the array multiplier more closely. Given the two input numbers a = an-1, an-2, . . . , a0 and b = bn-1, bn-2, . . . , b0, the bits of the partial products are easy to compute. Specifically, for i, j = 0, 1, . . . , n - 1, we have

Since the product of 1-bit values can be computed directly with an AND gate, all the bits of the partial products (except those known to be 0, which need not be explicitly computed) can be produced in one step using n2 AND gates.

Figure 29.15 illustrates how the array multiplier performs the carry-save additions when summing the partial products in Figure 29.13. It starts by carry-save adding m(0), m(1), and 0, yielding an (n + 1)-bit number u(1) and an (n + 1)-bit number v(1). (The number v(1) has only n + 1 bits, not n + 2, because the (n + 1)st bits of both 0 and m(0) are 0.) Thus, m(0) + m(1) = u(1) + v(1). It then carry-save adds u(1), v(1), and m(2), yielding an (n + 2)-bit number u(2) and an (n + 2)-bit number v(2). (Again, v(2) has only n + 2 bits because both and are 0.) We then have m(0) + m(1) + m(2) = u(2) + v(2). The multiplier continues on, carry-save adding u(i-1), v(i-1), and m(i) for i = 2, 3, . . . , n - 1. The result is a (2n - 1)-bit number u(n - l) and a (2n - 1)-bit number v(n - 1), where

#### Figure 29.14 An array multiplier that computes the product p = p2n-1,p2n-2, . . . , p0 of two n-bit numbers a = an-1,an-2, . . . ,a0 and b = bn-1,bn-2, . . . , b0, shown here for n = 4. Each AND gate computes partial-product bit . Each row of full adders constitutes a carry-save adder. The lower n bits of the product are and the u bits coming out from the rightmost column of full adders. The upper n product bits are formed by adding the u and v bits coming out from the bottom row of full adders. Shown are bit values for inputs a = 1110 and b = 1101 and product p = 10110110, corresponding to Figures 29.13 and 29.15.

`            0  0  0  0  =  0`

`            1  1  1  0  =  m(0)`

`         0  0  0  0     =  m(1)`

`-------------------------------`

`         0  1  1  1  0  =  u(1)`

`         0  0  0        =  v(1)`

`      1  1  1  0        =  m(2)`

`-------------------------------`

`      1  1  0  1  1  0  =  u(2)`

`      0  1  0           =  v(2)`

`   1  1  1  0           =  m(3)`

`-------------------------------`

`   1  0  1  0  1  1  0  =  u(3)`

`   1  1  0              =  v(3)`

`-------------------------------`

`1  0  1  1  0  1  1  0  =  p`

#### Figure 29.15 Evaluating the sum of the partial products by repeated carry-save addition. For this example, a = 1110 and b = 1101 . Bits that are blank are 0 regardless of the values of a and b. We first evaluate m(0) + m(1) + 0 = u(1) + v(1) then u(1) + v(1)+ m(2) = u(2) + v(2), then u(2) + v(2) + m(3) = u(3) + v(3), and finally p = m(0) + m(1) + m(2) + m(3) = u(3) + v(3) Note that and for i = 1, 2, . . . , n - 1.

In fact, the carry-save additions in Figure 29.15 operate on more bits than strictly necessary. Observe that for i = 1, 2, . . . , n - 1 and j = 0, 1, . . . , i - 1, we have because of how we shift the partial products. Observe also that for i = 1, 2, . . . , n - 1 and j = 0, 1, . . . , i, i + n, i + n + 1, . . . , 2n - 1. (See Exercise 29.3-1.) Each carry-save addition, therefore, needs to operate on only n - 1 bits.

Let us now examine the correspondence between the array multiplier and the repeated carry-save addition scheme. Each AND gate is labeled by for some i and j in the ranges 0 i n - 1 and 0 j 2n - 2. Gate produces , the jth bit of the ith partial product. For i = 0, 1, . . . , n - 1, the ith row of AND gates computes the n significant bits of the partial product m(i), that is, .

Except for the full adders in the top row (that is, for i = 2, 3, . . . , n - 1), each full adder takes three input , and --and produces two output and . (Note that in the leftmost column of full adders, .) Each full adder in the top row takes inputs , and 0 and produces bits .

Finally, let us examine the output of the array multiplier. As we observed above, for j = 0, 1, . . . , n - 1. Thus, for j = 0, 1, . . . , n - 1. Moreover, since , we have , and since the lowest-order i bits of each m(i) and v(i - 1) are 0, we have for i = 2, 3, . . . , n- 1 and j = 0, 1, . . . , i - 1. Thus, and, by induction, for i = 1, 2, . . . , n - 1. Product bitsp2n-1, p2n-2, . . . , pn are produced by an n-bit adder that adds the outputs from the last row of full adders:

### Analysis

Data ripple through an array multiplier from upper left to lower right. It takes (n) time for the lower-order product bits pn-1, pn-2, . . . , p0 to be produced, and it takes (n) time for the inputs to the adder to be ready. If the adder is a ripple-carry adder, it takes another (n) time for the higher-order product bits p2n-1, p2n-2, . . . , pn to emerge. If the adder is a carry-lookahead adder, only (lg n) time is needed, but the total time remains (n).

There are n2 AND gates and n2 - n full adders in the array multiplier. The adder for the high-order output bits contributes only another (n) gates. Thus, the array multiplier has (n2) size.

## 29.3.2 Wallace-tree multipliers

Figure 29.16 shows a Wallace tree2 that adds 8 partial products m(0), m(1), . . . , m(7). Partial product m(l) consists of n + i bits. Each line represents an entire number, not just a single bit; next to each line is the number of bits the line represents (see Exercise 29.3-3). The carry-lookahead adder at the bottom adds a (2n - 1)-bit number to a 2n-bit number to give the 2n-bit product.

2 As you can see from the figure, a Wallace tree is not truly a tree, but rather a directed acyclic graph. The name is historical.

### Analysis

The time required by an n-input Wallace tree depends on the depth of the carry-save adders. At each level of the tree, each group of 3 numbers is reduced to 2 numbers, with at most 2 numbers left over (as in the case of m(6) and m(7) at the top level). Thus, the maximum depth D(n) of a carry-save adder in an n-input Wallace tree is given by the recurrence

which has the solution D(n) = (1g n) by case 2 of the master theorem (Theorem 4.1). Each carry-save adder takes (1) time. All n partial products can be formed in (1) time in parallel. (The lowest-order i - 1 bits of m(i), for i = 1, 2, . . . , n - 1, are hardwired to 0.) The carry-lookahead adder takes O(lg n) time. Thus, the entire multiplication of two n-bit numbers takes (lg n) time.

A Wallace-tree multiplier for two n-bit numbers has (n2) size, which we can see as follows. We first bound the circuit size of the carry-save adders. A lower bound of (n2) is easy to obtain, since there are 2n/3 carry-save adders at depth 1, and each one consists of at least n full adders. To get the upper bound of O(n2), observe that since the final product has 2n bits, each carry-save adder in the Wallace tree contains at most 2n full adders. We need to show that there are O(n) carry-save adders altogether. Let C(n) be the total number of carry-save adders in a Wallace tree with n input numbers. We have the recurrence

which has the solution C(n) = (n) by case 3 of the master theorem. We thus obtain an asymptotically tight bound of (n2) size for the carry-save adders of a Wallace-tree multiplier. The circuitry to set up the n partial products has (n2) size, and the carry-lookahead adder at the end has (n) size. Thus, the size of the entire multiplier is (n2).

Although the Wallace-tree-based multiplier is asymptotically faster than the array multiplier and has the same asymptotic size, its layout when it is implemented is not as regular as the array multiplier's, nor is it as "dense" (in the sense of having little wasted space between circuit elements). In practice, a compromise between the two designs is often used. The idea is to use two arrays in parallel, one adding up half of the partial products and one adding up the other half. The propagation delay is only half of that incurred by a single array adding up all n partial products. Two more carry-save additions reduce the 4 numbers output by the arrays to 2 numbers, and a carry-lookahead adder then adds the 2 numbers to yield the product. The total propagation delay is a little more than half that of a full array multiplier, plus an additional O(lg n) term.

## Exercises

Prove that in an array multiplier, = 0 for i = 1, 2, . . . , n - 1 and j = 0, 1,. . . , i, i + n, i + n + 1, . . . , 2n - 1.

Show that in the array multiplier of Figure 29.14, all but one of the full adders in the top row are unnecessary. You will need to do some rewiring.

Show that multiplication can still be performed in O(lg n) time with O(n2) size even if we restrict gates to have O(1) fan-out.

Describe an efficient circuit to compute the quotient when a binary number x is divided by 3. (Hint: Note that in binary, .010101 . . . = .01 X 1.01 X 1.0001 X .

# 29.4 Clocked circuits

The elements of a combinational circuit are used only once during a computation. By introducing clocked memory elements into the circuit, we can reuse combinational elements. Because they can use hardware more than once, clocked circuits can often be much smaller than combinational circuits for the same function.

#### Figure 29.17 The operation of a bit-serial adder. During the ith clock period, for i = 0, 1, . . . , n, the full adder FA takes input bits ai and bi from the outside world and a carry bit ci from the register. The full adder then outputs sum bit si, which is provided externally, and carry bit ci+1, which is stored back in the register to be used during the next clock period. The register is initialized with c0 = 0. (a)-(e) The state of the circuit in each of the five clock periods during the addition of a = 1011 and b = 1001 to produce s = 10100.

Let us examine the operation of a register in a little more detail. We treat each clock tick as a momentary pulse. At a given tick, a register reads the input value presented to it at that moment and stores it. This stored value then appears at the register's output, where it can be used to compute values that are moved into other registers at the next clock tick. In other words, the value at a register's input during one clock period appears on the register's output during the next clock period.

Now let us examine the circuit in Figure 29.17, which we call a bit-serial adder. In order for the full adder's outputs to be correct, we require that the clock period be at least as long as the propagation delay of the full adder, so that the combinational circuitry has an opportunity to settle between clock ticks. During clock period 0, shown in Figure 29.17(a), the external world applies input bits a0 and b0 to two of the full adder's inputs. We assume that the register is initialized to store a 0; the initial carry-in bit, which is the register output, is thus c0 = 0. Later in this clock period, sum bit s0 and carry-out c1 emerge from the full adder. The sum bit goes to the external world, where presumably it will be saved as part of the entire sum s. The wire from the carry-out of the full adder feeds into the register, so that c1 is read into the register upon the next clock tick. At the beginning of clock period 1, therefore, the register contains c1. During clock period 1, shown in Figure 29.17(b), the outside world applies a1 and b1 to the full adder, which, reading c1 from the register, produces outputs s1 and c2. The sum bit s1 goes out to the outside world, and c2 goes to the register. This cycle continues until clock period n, shown in Figure 29.17(e), in which the register contains cn. The external world then applies an = bn = 0, so that we get sn = cn.

### Analysis

Let us see how long it takes to add two n-bit numbers bit-serially. Each clock period takes (1) time because the depth of the full adder is (1). Since n + 1 clock ticks are required to produce all the outputs, the total time to perform bit-serial addition is (n + 1) (1) = (n).

The size of the bit-serial adder (number of combinational elements plus number of registers) is (1).

In general, we can replace any clocked circuit by an equivalent combinational circuit having the same asymptotic time delay if we know in advance how many clock periods the clocked circuit runs for. There is, of course, a trade-off involved. The clocked circuit uses fewer circuit elements (a factor of (n) less for the bit-serial adder versus the ripple-carry adder), but the combinational circuit has the advantage of less control circuitry--we need no clock or synchronized external circuit to present input bits and store sum bits. Moreover, although the circuits have the same asymptotic time delay, the combinational circuit typically runs slightly faster in practice. The extra speed is possible because the combinational circuit doesn't have to wait for values to stabilize during each clock period. If all the inputs stabilize at once, values just ripple through the circuit at the maximum possible speed, without waiting for the clock.

## 29.4.2 Linear-array multipliers

Although the Russian peasant's algorithm may seem remarkable at first, Figure 29.18(b) shows that it is really just a binary-number-system implementation of the grade-school multiplication method, but with numbers expressed in decimal rather than binary. Rows in which the a-column entry is odd contribute to the product a term of b multiplied by the appropriate power of 2.

### A slow linear-array implementation

Figure 29.19(a) shows one way to implement the Russian peasant's algorithm for two n-bit numbers. We use a clocked circuit consisting of a linear array of 2n cells. Each cell contains three registers. One register holds a bit from an a entry, one holds a bit from a b entry, and one holds a bit of the product p. We use superscripts to denote cell values before each step of the algorithm. For example, the value of bit ai before the jth step is , and we define .

The algorithm executes a sequence of n steps, numbered 0,1, . . . , n - 1, where each step takes one clock period. The algorithm maintains the invariant that before the jth step,

`a(j)  b(j) + p(j) = a  b`

#### (29.6)

(see Exercise 29.4-2). Initially, a(0) = a, b(0) = b, and p(0) = 0. The jth step consists of the following computations.

1. If a(j) is odd (that is, ), then add b into p: p(j+1) b(j) + p(j). (The addition is performed by a ripple-carry adder that runs the length of the array; carry bits ripple from right to left.) If a(j) is even, then carry p through to the next step: p(j+1) p(j).

2. Shift a right by one bit position:

3. Shift b left by one bit position:

After running n steps, we have shifted out all the bits of a; thus, a(n) = 0. Invariant (29.6) then implies that p(n) = a b.

We now analyze the algorithm. There are n steps, assuming that the control information is broadcast to each cell simultaneously. Each step takes (n) time in the worst case, because the depth of the ripple-carry adder is (n), and thus the duration of the clock period must be at least (n). Each shift takes only (1) time. Overall, therefore, the algorithm takes (n2) time. Because each cell has constant size, the entire linear array has (n) size.

### A fast linear-array implementation

By using carry-save addition instead of ripple-carry addition, we can decrease the time for each step to (1), thus improving the overall time to (n). As Figure 29.19(b) shows, once again each cell contains a bit of an a entry and a bit of a b entry. Each cell also contains two more bits, from u and v, which are the outputs from carry-save addition. Using a carry-save representation to accumulate the product, we maintain the invariant that before the jth step,

`a(j)  b(j) + u(j) + v(j) = a  b`

#### (29.7)

(again, see Exercise 29.4-2). Each step shifts a and b in the same way as the slow implementation, so that we can combine equations (29.6) and (29.7) to yield u(j) + v(j) = p(j). Thus, the u and v bits contain the same information as the p bits in the slow implementation.

#### Figure 29.19 Two linear-array implementations of the Russian peasant's algorithm, showing the multiplication of a = 19 = 10011 by b = 29 = 11101, with n = 5. The situation at the beginning of each step j is shown, with the remaining significant bits of a(j) and b(j) shaded. (a) A slow implementation that runs in (n2) time. Because a(5) = 0, we have p(5) = a b. There are n steps, and each step uses a ripple-carry addition. The clock period is therefore proportional to the length of the array, or (n), leading to (n2) time overall. (b) A fast implementation that runs in (n) time because each step uses carry-save addition rather than ripple-carry addition, thus taking only (1) time. There are a total of 2n - 1 = 9 steps; after the last step shown, repeated carry-save addition of u and v yields u(9) = a b.

The jth step of the fast implementation performs carry-save addition on u and v, where the operands depend on whether a is odd or even. If , we compute

and

Otherwise, , and we compute

and

After updating u and v, the jth step shifts a to the right and b to the left in the same manner as the slow implementation.

The fast implementation performs a total of 2n - 1 steps. For j n, we have a(j) = 0, and invariant (29.7) therefore implies that u(j) + v(j) = a b. Once a(j) = 0, all further steps serve only to carry-save add u and v. Exercise 29.4-3 asks you to show that v(2n-1) = 0, so that u(2n-1) = a b.

The total time in the worst case is (n), since each of the 2n - 1 steps takes (1) time. Because each cell still has constant size, the total size remains (n).

## Exercises

Let a = 101101, b = 011110, and n = 6. Show how the Russian peasant's algorithm operates, in both decimal and binary, for inputs a and b.

Prove the invariants (29.6) and (29.7) for the linear-array multipliers.

Prove that in the fast linear-array multiplier, v(2n-1) = 0.

Describe how the array multiplier from Section 29.3.1 represents an "un-rolling" of the computation of the fast linear-array multiplier.

Consider a data stream x1, x2, . . . that arrives at a clocked circuit at the rate of 1 value per clock tick. For a fixed value n, the circuit must compute the value

for t = n,n + 1, . . .. That is, yt is the maximum of the most recent n values received by the circuit. Give an O(n)-size circuit that on each clock tick inputs the value xt and computes the output value yt in O(1) time. The circuit can use registers and combinational elements that compute the maximum of two inputs.

Redo Exercise 29.4-5 using only O(1g n) "maximum" elements.

# Problems

The idea is to compute a sequence y0, ,y1, y2, . . . of approximations to the reciprocal of a number x by using the formula

Assume that x is given as an n-bit binary fraction in the range 1/2 x 1. Since the reciprocal can be an infinite repeating fraction, we shall concentrate on computing an n-bit approximation accurate up to its least significant bit.

a. Suppose that yi - 1/x for some constant > 0. Prove that yi+1 - 1/x 2.

b. Give an initial approximation y0 such that yk satisfies yk - 1/x 2-2k for all k 0. How large must k be for the approximation yk to be accurate up to its least significant bit?

c. Describe a combinational circuit that, given an n-bit input x, computes an n-bit approximation to 1/x in O(lg2 n) time. What is the size of your circuit? (Hint: With a little cleverness, you can beat the size bound of (n2 lgn).)

`f(x1, x2, . . . , xn) = (x(1), x(2), . . . , x(n) )`

for any permutation of {1,2, . . . , n}. In this problem, we shall show that there is a boolean formula representing whose size is polynomial in n. (For our purposes, a boolean formula is a string comprised of the variables x1, x2, . . . , xn, parentheses, and the boolean operators V, , and .) Our approach will be to convert a logarithmic-depth boolean circuit to an equivalent polynomial-size boolean formula. We shall assume that all circuits are constructed from 2-input AND, 2-input OR, and NOT gates.

Describe an O(lg n)-depth combinational circuit for majorityn. (Hint: Build a tree of adders.)

b. Suppose that is an arbitrary boolean function of the n boolean variables x1, x2, . . . , xn. Suppose further that there is a circuit C of depth d that computes . Show how to construct from C a boolean formula for of length O(2d). Conclude that there is polynomial-size formula for majorityn.

c. Argue that any symmetric boolean function f(x1, x2, . . . , xn ) can be expressed as a function of

d. Argue that any symmetric function on n boolean inputs can be computed by an O(lg n)-depth combinational circuit.

e. Argue that any symmetric boolean function on n boolean variables can be represented by a boolean formula whose length is polynomial in n.

# Chapter notes

Most books on computer arithmetic focus more on practical implementations of circuitry than on algorithmic theory. Savage [173] is one of the few that investigates algorithmic aspects of the subject. The more hardware-oriented books on computer arithmetic by Cavanagh [39] and Hwang [108] are especially readable. Good books on combinational and sequential logic design include Hill and Peterson [96] and, with a twist toward formal language theory, Kohavi [126].

Aiken and Hopper [7] trace the early history of arithmetic algorithms. Ripple-carry addition is as at least as old as the abacus, which has been around for over 5000 years. The first mechanical calculator employing ripple-carry addition was devised by B. Pascal in 1642. A calculating machine adapted to repeated addition for multiplication was conceived by S. Morland in 1666 and independently by G. W. Leibnitz in 1671. The Russian peasant's algorithm for multiplication is apparently much older than its use in Russia in the nineteenth century. According to Knuth [122], it was used by Egyptian mathematicians as long ago as 1800 B.C.

The kill, generate, and propagate statuses of a carry chain were exploited in a relay calculator built at Harvard during the mid-1940's [180]. One of the first implementations of carry-lookahead addition was described by Weinberger and Smith [199], but their lookahead method requires large gates. Ofman [152] proved that n-bit numbers could be added in O(lg n) time using carry-lookahead addition with constant-size gates.

The idea of using carry-save addition to speed up multiplication is due to Estrin, Gilchrist, and Pomerene [64]. Atrubin [13] describes a linear-array multiplier of infinite length that can be used to multiply binary numbers of arbitrary length. The multiplier produces the nth bit of the product immediately upon receiving the nth bits of the inputs. The Wallace-tree multiplier is attributed to Wallace [197], but the idea was also independently discovered by Ofman [152].

Division algorithms date back to I. Newton, who around 1665 invented what has become known as Newton iteration. Problem 29-1 uses Newton iteration to construct a division circuit with (lg2 n) depth. This method was improved by Beame, Cook, and Hoover [19], who showed that n-bit division can in fact be performed in (lg n) depth.