As parallel-processing computers have proliferated, interest has increased in **parallel algorithms***:* algorithms that perform more than one operation at a time. The study of parallel algorithms has now developed into a research area in its own right. Indeed, parallel algorithms have been developed for many of the problems we have solved in this text using ordinary serial algorithms. In this chapter, we shall describe a few simple parallel algorithms that illustrate fundamental issues and techniques.

Figure 30.1 shows the basic architecture of the * parallel random-access machine (PRAM)*. There are

Concurrent versus exclusive memory accesses

A * concurrent-read* algorithm is a PRAM algorithm during whose execution multiple processors can read from the same location of shared memory at the same time. An

* EREW*: exclusive read and exclusive write,

* CREW*: concurrent read and exclusive write,

* ERCW*: exclusive read and concurrent write, and

* CRCW*: concurrent read and concurrent write.

(These abbreviations are usually pronounced not as words but rather as strings of letters.)

When multiple processors write to the same location in a CRCW algorithm, the effect of the parallel write is not well defined without additional elaboration. In this chapter, we shall use the * common-CRCW* model: when several processors write into the same memory location, they must all write a common (the same) value. There are several alternative types of PRAM's in the literature that handle this problem with a different assumption. Other choices include

**arbitrary***:* an arbitrary value from among those written is actually stored,

**priority***:* the value written by the lowest-indexed processor is stored, and

**combining***:* the value stored is some specified combination of the values written.

Among the more interesting PRAM algorithms are those that involve pointers. In this section, we investigate a powerful technique called pointer jumping, which yields fast algorithms for operating on lists. Specifically, we introduce an *O*(lg *n*)-time algorithm that computes the distance to the end of the list for each object in an *n*-object list. We then modify this algorithm to perform a "parallel prefix" computation on an *n*-object list in *O*(lg *n*) time. Finally, we investigate a technique that allows many problems on trees to be converted to list problems, which can then be solved by pointer jumping. All of the algorithms in this section are EREW algorithms: no concurrent accesses to global memory are required.

Our first parallel algorithm operates on lists. We can store a list in a PRAM much as we store lists in an ordinary RAM. To operate on list objects in parallel, however, it is convenient to assign a "responsible" processor to each object. We shall assume that there are as many processors as list objects, and that the *i*th processor is responsible for the *i*th object. Figure 30.2(a), for example, shows a linked list consisting of the sequence of objects 3,4,6,1,0,5. Since there is one processor per list object, every object in the list can be operated on by its responsible processor in *O*(1) time.

We call the problem of computing the *d* values the **list-ranking problem.**

LIST-RANK(L)

1foreach processori, in parallel

2do ifnext[i] = NIL

3thend[i] 0

4elsed[i] 1

5whilethere exists an objectisuch thatnext[i] NIL

6do foreach processori,in parallel

7do ifnext[i] NIL

8thend[i]d[i] +d[next[i]]

9next[i]next[next[i]]

Besides parallel running time, there is another interesting performance measure for parallel algorithms. We define the * work* performed by a parallel algorithm as the product of its running time and the number of processors it requires. Intuitively, the work is the amount of computing that a serial RAM performs when it simulates the parallel algorithm.

We define a PRAM algorithm *A* to be * work-efficient* with respect to another (serial or parallel) algorithm

The technique of pointer jumping extends well beyond the application of list ranking. Section 29.2.2 shows how, in the context of arithmetic circuits, a "prefix" computation can be used to perform binary addition quickly. We now investigate how pointer jumping can be used to perform prefix computations. Our EREW algorithm for the prefix problem runs in *O*(lg* n*) time on *n*-object lists.

A * prefix computation* is defined in terms of a binary, associative operator . The computation takes as input a sequence

y-1_{k }= y_{k}x_{k}

=x_{1}x_{2 }. . .x_{k}

[i, j]= x_{i }1_{ }x_{i+}x_{j }

for integers *i* and *j* in the range 1 *i * j * n.* Then, [*k,k*]* = x _{k}* for

k =1, 2,...,n, and

[i,k]=[i,j][j+1,k]

LIST-PREFIX(L)

1foreach processori, in parallel

2doy[i]x[i]

3whilethere exists an objectisuch thatnext[i] NIL

4do foreach processori,in parallel

5do ifnext[i] NIL

6theny[next[i]]y[i]y[next[i]]

7next[i]next[next[i]]

In this section, we shall introduce the Euler-tour technique and show how it can be applied to the problem of computing the depth of each node in an *n*-node binary tree. A key step in this *O*(lg* n*)-time EREW algorithm is a parallel prefix computation.

Computing the depth of each node in an *n*-node tree takes *O*(*n*) time on a serial RAM. A simple parallel algorithm to compute depths propagates a "wave" downward from the root of the tree. The wave reaches all nodes at the same depth simultaneously, and thus by incrementing a counter carried along with the wave, we can compute the depth of each node. This parallel algorithm works well on a complete binary tree, since it runs in time proportional to the tree's height. The height of the tree could be as large as *n* - 1, however, in which case the algorithm would run in (*n*) time--no better than the serial algorithm. Using the Euler-tour technique, however, we can compute node depths in *O*(l*g n*) time on an EREW PRAM, whatever the height of the tree.

An * Euler tour* of a graph is a cycle that traverses each edge exactly once, although it may visit a vertex more than once. By Problem 23-3, a connected, directed graph has an Euler tour if and only if for all vertices

The debate about whether or not concurrent memory accesses should be provided by the hardware of a parallel computer is a messy one. Some argue that hardware mechanisms to support CRCW algorithms are too expensive and used too infrequenly to be justified. Others complain that EREW PRAM's provide too restrictive a programming model. The answer to this debate probably lies somewhere in the middle, and various compromise models have been proposed. Nevertheless, it is instructive to examine what algorithmic advantage is provided by concurrent accesses to memory.

In this section, we shall show that there are problems on which a CRCW algorithm outperforms the best possible EREW algorithm. For the problem of finding the identities of the roots of trees in a forest, concurrent reads allow for a faster algorithm. For the problem of finding the maximum element in an array, concurrent writes permit a faster algorithm.

FIND-ROOTS(F)

1.foreach processori, in parallel

2.do ifparent[i] = NIL

3.thenroot[i]i

4.whilethere exists a nodeisuch thatparent[i] NIL

5.do foreach processori, in parallel

6.do ifparent[i] NIL

7.thenroot[i]root[parent[i]]

8.parent[i]parent[parent[i]]

To demonstrate that concurrent writes offer a performance advantage over exclusive writes, we examine the problem of finding the the maximum element in an array of real numbers. We shall see that any EREW algorithm for this problem takes (lg *n*) time and that no CREW algorithm does any better. The problem can be solved in *O*(1) time using a common-CRCW algorithm, in which when several processors write to the same location, they all write the same value.

FAST-MAX(A)

1nlength[A]

2fori0ton- 1, in parallel

3dom[i] TRUE

4fori0ton-1 andj0ton- 1, in parallel

5do ifA[i] <A[j]

6thenm[i] FALSE

7fori0ton- 1, in parallel

8do ifm[i] = TRUE

9thenmaxA[i]

10returnmax

In a sense, the key to FAST-MAX is that a CRCW PRAM is capable of performing a boolean AND of *n* variables in *O*(1) time with *n* processors. (Since this capability holds in the common-CRCW model, it holds in the more powerful CRCW PRAM models as well.) The code actually performs several AND's at once, computing for *i* = 0, 1, . . . , *n* - 1,

We now know that CRCW algorithms can solve some problems more quickly than can EREW algorithms. Moreover, any EREW algorithm can be executed on a CRCW PRAM. Thus, the CRCW model is strictly more powerful than the EREW model. But how much more powerful is it? In Section 30.3, we shall show that a *p*-processor EREW PRAM can sort *p *numbers in *O*(lg *p*) time. We now use this result to provide a theoretical upper bound on the power of a CRCW PRAM over an EREW PRAM.

Other models for concurrent writing can be simulated as well. (See Exercise 30.2-9.)

Suppose we know that a forest of binary trees consists of only a single tree with *n* nodes. Show that with this assumption, a CREW implementation of FIND-ROOTS can be made to run in *O*(1) time, independent of the depth of the tree. Argue that any EREW algorithm takes (lg *n*) time.

Give an EREW algorithm for FIND-ROOTS that runs in *O*(lg *n*) time on a forest of *n* nodes.

Give an *n*-processor CRCW algorithm that can compute the OR of *n* boolean values in *O*(1) time.

Describe an efficient CRCW algorithm to multiply two *n* *n* boolean matrices using *n*^{3} processors.

Describe an *O*(lg *n*)-time EREW algorithm to multiply two *n* *n* matrices of real numbers using *n*^{3} processors. Is there a faster common-CRCW algorithm? Is there a faster algorithm in one of the stronger CRCW models?

Prove that for any constant > 0, there is an *O*(1)-time CRCW algorithm using *O*(*n*^{1+}) processors to find the maximum element of an *n*-element array.

Show how to merge two sorted arrays, each with *n* numbers, in *O*(1) time using a priority-CRCW algorithm. Describe how to use this algorithm to sort in *O*(lg *n*) time. Is your sorting algorithm work-efficient?

A * combinational circuit* is an acyclic network of

Any depth-*d*, size-*n* combinational circuit with bounded fan-in can be stimulated by a *p*-processor CREW algorithm in *O*(*n/p* + *d*) time.

Consider the *n _{i}* combinational elements at depth

Any depth-*d*, size-*n* combinational circuit with bounded fan-in and fan-out can be simulated on a *p*-processor EREW PRAM in *O*(*n*/*p* + *d*) time.

Corollary 30.3 provides us with a fast EREW sorting algorithm. As explained in the chapter notes of Chapter 28, the AKS sorting network can sort *n* numbers in *O*(lg *n*) depth using *O*(*n *lg *n*) comparators. Since comparators have bounded fan-in, there is an EREW algorithm to sort *n* numbers in *O*(lg *n*) time using *n* processors. (We used this result in Theorem 30.1 to show that an EREW PRAM can simulate a CRCW PRAM with at most logarithmic slowdown.) Unfortunately, the constants hidden by the *O*-notation are so large that this sorting algorithm has solely theoretical interest. More practical EREW sorting algorithms have been discovered, however, notably the parallel merge-sorting algorithm due to Cole [46].

Now suppose that we have a PRAM algorithm that uses at most *p *processors, but we have a PRAM with only *p**'* < *p* processors. We would like to be able to run the *p*-processor algorithm on the smaller *p*'-processor PRAM in a work-efficient fashion. By using the idea in the proof of Brent's theorem, we can give a condition for when this is possible.

Prove a result analogous to Brent's theorem for a CRCW simulation of boolean combinational circuits having AND and OR gates with unbounded fan-in. (*Hint:* Let the "size" be the total number of inputs to gates in the circuit.)

Show that a parallel prefix computation on *n* values stored in an array of memory can be implemented in *O*(lg *n*) time on an EREW PRAM using *O*(*n*/lg *n*) processors. Why does this result not extend immediately to a list of *n* values?

Show how to multiply an *n* *n* matrix *A* by an *n*-vector *b* in *O*(lg *n*) time with a work-efficient EREW algorithm. (*Hint:* Construct a combinational circuit for the problem.)

Give a CRCW algorithm using *n*^{2} processors to multiply two *n* *n* matrices. The algorithm should be work-efficient with respect to the normal (*n*^{3})-time serial algorithm for multiplying matrices. Can you make the algorithm EREW?

Some parallel models allow processors to become inactive, so that the number of processors executing at any step varies. Define the work in this model as the total number of steps executed during an algorithm by active processors. Show that any CRCW algorithm that performs *w *work and runs in *t* time can be run on a *p*-processor EREW PRAM in *O*((*w/p* + *t*) lg *p*) time. (*Hint:* The hard part is scheduling the active processors *while* the computation is proceeding.)

In Section 30.1.2, we examined an *O*(lg *n*)-time EREW algorithm LIST- RANK that can perform a prefix computation on an *n-*object linked list. The algorithm uses* n* processors and performs (*n *lg *n*) work. Since we can easily perform a prefix computation in (*n*) time on a serial machine, LIST-RANK is not work-efficient.

This section presents a randomized EREW parallel prefix algorithm that is work-efficient. The algorithm uses (*n*/lg *n*) processors, and it runs in *O*(lg *n*) time with high probability. Thus, it is work-efficient with high probability. Moreover, by Theorem 30.4, this algorithm immediately yields work-efficient algorithms for any number *p* = *O*(*n*/lg *n*) of processors.

The randomized parallel prefix algorithm RANDOMIZED-LIST-PREFIX operates on a linked list of *n* objects using *p* = (*n*/ lg *n*) processors. During the algorithm, each processor is responsible for *n/p* = (lg *n*) of the objects in the original list. The objects are assigned to processors arbitrarily (not necessarily contiguously) before the recursion begins, and "ownership" of objects never changes. For convenience, we assume that the list is doubly linked, since doubly linking a single list takes *O*(1) time.

1. At most one object of those belonging to a given processor is selected for elimination.

2. No two adjacent objects are selected for elimination.

1. The processor picks an object *i* that has not previously been selected from among those it owns.

2. It then "flips a coin," choosing the values HEAD and TAIL with equal probability.

Since each recursive step of RANDOMIZED-LIST-PREFIX runs in *O*(1) time, to analyze the algorithm we need only determine how many steps it takes to eliminate all the objects in the original list. At each step, a processor has at least probability 1/4 of eliminating the object *i* it picks. Why? It flips HEAD with probability 1/2, and the probability that it either does not pick *next*[*i*] or picks it and flips TAIL is at least 1/2. Since the two coin flips are independent events, we can multiply their probabilities, yielding the probability of at least 1/4 of a processor eliminating the object it picks. Since each processor owns (lg *n*) objects, the expected time for a processor to eliminate all its objects is (lg *n*).

Our high-probability argument is based on observing that the experiment of a given processor eliminating the objects it picks can be viewed as a sequence of Bernoulli trials (see Chapter 6). The experiment is a success if the object is selected for elimination, and it is a failure otherwise. Since we are interested in showing that the probability is small that very few successes are obtained, we can assume that successes occur with probability exactly 1/4, rather than with probability at least 1/4. (See Exercises 6.4-8 and 6.4-9 for a formal justification of similar assumptions.)

Draw figures to illustrate what can go wrong in RANDOMIZED-LIST-PREFIX if two adjacent list objects are selected for elimination.

Consider a situation in which two processors wish to acquire mutually exclusive access to an object. How can the processors determine which should acquire access first? We wish to avoid the scenario in which both are granted access, as well as the scenario in which neither is granted access. The problem of choosing one of the processors is an example of * symmetry breaking*. We have all seen the momentary confusion and diplomatic impasses that arise when two people attempt to go through a door simultaneously. Similar symmetry-breaking problems are pervasive in the design of parallel algorithms, and efficient solutions are extremely useful.

We shall use the same idea, but in a much more clever fashion, in an algorithm to break symmetry in an *n*-object linked list. The goal is to choose a constant fraction of the objects in the list but to avoid picking two adjacent objects. This algorithm can be performed with *n* processors in *O*(lg^{*} *n*) time by a deterministic EREW algorithm. Since lg^{*} *n* 5 for all *n* 2^{65536}, the value lg^{*} *n* can be viewed as a small constant for all practical purposes (see page 36).

A * coloring* of an undirected graph

An **independent*** set* of a graph

The algorithm SIX-COLOR computes a 6-coloring of a list. We won't give pseudocode for the algorithm, but we shall describe it in some detail. We assume that initially each object *x* in the linked list is associated with a distinct processor *P*(*x*)* * *{0, 1, . . . , *n* - 1}.*

C+1[_{k}x] =i, a_{i}

=ilgr- 1,ilgr-2, . . . ,i0,ai.

lg^{*}n= min {i0 : lg^{(i) }n1} .

r= lg_{i }r-_{i}_{1}+ 1

lg(lg-^{(i}^{1)}n+ 2) + 1

lg(lg^{(i}-^{1)}n+ 3) + 1

lg(2 lg^{(i}-^{1)}n) + 1

= lg(lg^{(i}-^{1)}n) + 1 + 1

= lg^{(i)}n+ 2.

Coloring is the hard part of symmetry breaking. The EREW algorithm LIST-MIS uses *n *processors to find a maximal independent set in *O*(*c*) time given a *c*-coloring of an *n*-object list. Thus, once we have computed a 6-coloring of a list, we can find a maximal independent set of the linked list in *O*(1) time.

Give an efficient PRAM algorithm to *O*(1)-color a degre-3 graph. Analyze your algorithm.

A * k-ruling set* of a linked list is a set of objects (rulers) in the list such that no rulers are adjacent and at most

30-1 Segmented parallel prefix

Like an ordinary prefix computation, a * segmented prefix computation* is defined in terms of a binary, associative operator . It takes an input sequence

* a.* Define the operator on ordered pairs (

* c.* Describe an

30-2 Processor-efficient maximum algorithm

We wish to find the maximum of *n* numbers on a CRCW PRAM with *p* = *n* processors.

In this problem, we investigate an arbitrary-CRCW algorithm for computing the connected components of an undirected graph *G = *(*V, E*) that uses |*V* + *E*| processors. The data structure used is a disjoint-set forest (see Section 22.3). Each vertex *v ** V* maintains a pointer *p*[*v*] to a parent. Initially, *p*[*v*]* = v:* the vertex points to itself. At the end of the algorithm, for any two vertices *u, v ** V,* we have *p*[*u*]* = p*[*v*] if and only if in *G*. During the algorithm, the *p* pointers form a forest of rooted * pointer *trees. A

HOOK(G)

1 STAR (G)

2foreach edge (u, v)E[G], in parallel

3do ifstar[u]andp[u]> p[v]

4thenp[p[u]]p[v]

5 STAR(G)

6foreach edge (u, v)E[G], in parallel

7do ifstar[u] andp[u]p[v]

8thenp[p[u]]p[v]

JUMP(G)

1foreachvV[G], in parallel

2dop[v]p[p[v]]

* a.* Give pseudocode for STAR(

* d.* Prove that after the initial HOOK, there are no pointer trees of height 0 and (

* e.* Argue that after the initial HOOK, subsequent HOOK operations never increase (

* h.* Conclude that the algorithm determines all the connected components of

30-4. Transposing a raster image

A raster-graphics frame buffer can be viewed as a *p *X* p* matrix *M* of bits. The raster-graphics display hardware makes the *n* X *n* upper left submatrix of *M* visible on the user's screen. A BITBLT operation (BLock Transfer of BITs) is used to move a rectangle of bits from one position to another. Specifically, BITBLT(*r*_{1}, *c*_{1}, *r*_{2}, *c*_{2}, *nr*, *nc*, *) sets

M[r_{2 }+i,c_{2 }+j]M[r_{2 }+i,c_{2 }+_{ j] * }M[r_{1}+i,c_{1}+j]

The theory of parallel computing began in the late 1940's when J. Von Neumann [38] introduced a restricted model of parallel computing called a cellular automaton, which is essentially a two-dimensional array of finite-state processors interconnected in meshlike fashion. The PRAM model was formalized in 1978 by Fortune and Wyllie [73], although many other authors had previously discussed essentially similar models.