Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. Typically, the text is a document being edited, and the pattern searched for is a particular word supplied by the user. Efficient algorithms for this problem can greatly aid the responsiveness of the text-editing program. String-matching algorithms are also used, for example, to search for particular patterns in DNA sequences.

We formalize the * string-matching problem* as follows. We assume that the text is an array

We say that pattern *P* **occurs with shift*** s* in text

We shall let * (read "sigma-star") denote the set of all finite-length strings formed using characters from the alphabet . In this chapter, we consider only finite-length strings. The zero-length * empty string*, denoted , also belongs to *. The length of a string

We say that a string *w* is a * prefix* of a string

Suppose that *x*, *y*, and *z* are strings such that . If |*x*| |*y*|, then . If |*x*| |*y*|, then . If |*x*| = |*y*|, then *x* = *y*.

* Proof *See Figure 34.2 for a graphical proof.

The naive algorithm finds all valid shifts using a loop that checks the condition *P*[1 . . *m*] = *T*[*s + 1 . . s + m*] for each of the *n* - *m* + 1 possible values of *s*.

NAIVE-STRING-MATCHER(T, P)

1nlength[T]

2mlength[P]

3fors0ton-m

4doifP[1 . .m] =T[s+ 1 . .s+m]

5thenprint "Pattern occurs with shift"s

Suppose we allow the pattern *P* to contain occurrences of a * gap character * that can match an

Rabin and Karp have proposed a string-matching algorithm that performs well in practice and that also generalizes to other algorithms for related problems, such as two-dimensional pattern matching. The worst-case running time of the Rabin-Karp algorithm is *O*((*n* - *m* + 1)*m*), but it has a good average-case running time.

We can compute *p* in time *O*(*m*) using Horner's rule (see Section 32.1):

p=P[m] + 10 (P[m- 1] + 10(P[m - 2] +. . .+ 10(P[2] + 10P[1]). . .)).

The value *t*_{0} can be similarly computed from *T*[1 . . *m*] in time O(*m*).

t+ 1 = 10(_{s}t- 10_{s}- 1^{m}T[s+ 1]) +T[s+ m+1].

t+1 = 10(31415 - 10000.3) + 2_{s}

= 14152 .

t+1 = (_{s}d(t-_{s}T[s+ 1]h) +T[s+m+ 1]) modq,

The ointment of working modulo *q* now contains a fly, however, since *t _{s}*

RABIN-KARP-MATCHER(T, P, d, q)

1nlength[T]

2mlength[P]

3hd-^{m}1q^{ mod }

4p0

5t_{0}0

6fori1tom

7dop(dp+P[i]) modq

8t_{0}(dt_{0}+T[i]) modq

9fors0ton-m

10do ifp=t_{s}

11then ifP[1 . .m] =T[s+ 1 . .s+m]

12then"Pattern occurs with shift"s

13ifs<n-m

14thent+1 (_{s}d(t-_{s}T[s+ 1]h) +T[s+m+ 1]) modq

The running time of RABIN-KARP-MATCHER is ((*n* - *m* + 1)*m*) in the worst case, since (like the naive string-matching algorithm) the Rabin-Karp algorithm explicitly verifies every valid shift. If *P* = a* ^{m}* and

O(n) +O(m(v+n/q)),

Alice has a copy of a long *n*-bit file *A* = *a _{n}*-1,

Many string-matching algorithms build a finite automaton that scans the text string *T* for all occurrences of the pattern *P*. This section presents a method for building such an automaton. These string-matching automata are very efficient: they examine each text character *exactly once*, taking constant time per text character. The time used--after the automaton is built--is therefore (*n*). The time to build the automaton, however, can be large if is large. Section 34.4 describes a clever way around this problem.

A **finite automaton***M* is a 5-tuple (*Q, q*_{0}, *A*, , ), where

*A* *Q* is a distinguished set of **accepting states**,

is a function from *Q* X into *Q*, called the **transition function** of *M*.

The finite automaton begins in state *q*_{0} and reads the characters of its input string one at a time. If the automaton is in state *q* and reads input character *a*, it moves ("makes a transition") from state *q* to state (*q, a*). Whenever its current state *q* is a member of *A*, the machine *M* is said to have * accepted* the string read so far. An input that is not accepted is said to be

A finite automaton *M* induces a function ø, called the * final-state function*, from * to

ø() =q_{}_{,}

ø(wa) =(ø(w),a) forw*,a.

There is a string-matching automaton for every pattern *P*; this automaton must be constructed from the pattern in a preprocessing step before it can be used to search the text string. Figure 34.6 illustrates this construction for the pattern *P* = ababaca. From now on, we shall assume that *P* is a given fixed pattern string; for brevity, we shall not indicate the dependence upon *P* in our notation.

In order to specify the string-matching automaton corresponding to a given pattern *P*[1 . . *m*], we first define an auxiliary function , called the * suffix function* corresponding to

We define the string-matching automaton corresponding to a given pattern P[1 . . *m*] as follows.

The transition function is defined by the following equation, for any state *q* and character *a*:

(q,a) = (P) ._{q}a

(T) = (_{i}T) ;_{i}

FINITE-AUTOMATON-MATCHER(T,,m)

1nlength[T]

2q0

3fori1ton

4doq(q, T[i])

5ifq=m

6thensi-m

7 print "Pattern occurs with shift"s

For any string *x* and character *a***,** we have *(*xa*)* * **(*x*) *+ *1*.

For any string *x* and character *a*, if *q* = (*x*), then (*xa*) = (*P _{q}a*).

Now, we assume that and prove that . Let *q* denote , and let *a* denote *T*[*i* + 1]. Then,

By induction, the theorem is proved.

The following procedure computes the transition function *from a given pattern *P*[1 . . *m*].*

COMPUTE-TRANSITION-FUNCTION(P, )

1mlength[P]

2forq0tom

3do foreach charactera*

4dokmin (m+ 1,q+ 2)

5repeatkk- 1

7(q, a)k

8return

We call a pattern *P* * nonoverlappable* if implies

Given two patterns *P* and *P*', describe how to construct a finite automaton that determines all occurrences of *either* pattern. Try to minimize the number of states in your automaton.

Given a pattern *P* containing gap characters (see Exercise 34.1-5), show how to build a finite automaton that can find an occurrence of *P* in a text *T *in *O*(*n*) time, where *n* = *T*.

We now present a linear-time string-matching algorithm due to Knuth, Morris, and Pratt. Their algorithm achieves a (*n + m*) running time by avoiding the computation of the transition function * *altogether, and it does the pattern matching using just an auxiliary function [1 . . *m*] precomputed from the pattern in time *O*(*m*). The array * *allows the transition function * *to be computed efficiently (in an amortized sense) "on the fly" as needed. Roughly speaking, for any state *q* = 0, 1, . . . *, m,*and any character *a* , the value [*q*] contains the information that is independent of *a* and is needed to compute * *(*q, a*). (This remark will be clarified shortly.) Since the array * *has only *m* entries, whereas * *has *O*(*m* ) entries, we save a factor of in the preprocessing by computing * *rather than *.*

The prefix function for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid testing useless shifts in the naive pattern-matching algorithm or to avoid the precomputation of for a string-matching automaton.

P[1 . .k] = T[s' 1 . . s'k],

KMP-MATCHER(T, P)

1nlength[T]

2mlength[P]

3 COMPUTE-PREFIX-FUNCTION(P)

4q0

5fori1ton

6do whileq> 0 andP[q+ 1]T[i]

7doq[q]

8ifP[q+ 1] =T[i]

9thenqq+ 1

10ifq=m

11thenprint "Pattern occurs with shift"i-m

12q[q]

COMPUTE-PREFIX-FUNCTION(P)

1mlength[P]

2 [1] 0

3k0

4forq2tom

5do whilek> 0 andP[k+ 1P[q]

6dok[k]

7ifP[k+ 1] =P[q]

8thenkk+ 1

9 [q]k

10return

The running time of COMPUTE-PREFIX-FUNCTION is *O*(*m*), using an amortized analysis (see Chapter 18). We associate a potential of *k* with the current state *k* of the algorithm. This potential has an initial value of 0, by line 3. Line 6 decreases *k* whenever it is executed, since [*k*] < *k*. Since [*k*] 0 for all *k*, however, *k* can never become negative. The only other line that affects *k* is line 8, which increases *k* by at most one during each execution of the **for** loop body. Since *k* < *q* upon entering the **for** loop, and since *q* is incremented in each iteration of the **for** loop body, *k* < *q* always holds. (This justifies the claim that [*q*] < *q* as well, by line 9.) We can pay for each execution of the **while** loop body on line 6 with the corresponding decrease in the potential function, since [*k*] < *k*. Line 8 increases the potential function by at most one, so that the amortized cost of the loop body on lines 5-9 is *O*(1). Since the number of outer-loop iterations is *O*(*m*), and since the final potential function is at least as great as the initial potential function, the total actual worst-case running time of COMPUTE-PREFIX-FUNCTION is *O*(*m*).

*[q] = {q, [q],^{2}[q],^{3}[q], . . . ,^{t}[q]} ,

Let *P* be a pattern of length *m* with prefix function . Then, for *q* = 1, 2, . . . , *m*, we have .

i*[q] implies .

Figure 34.10 illustrates this lemma.

For *q* = 2, 3, . . . , *m*, define the subset *E _{q}*-1 *[

E-1 = {_{q}k:k*[q- 1] andP[k+ 1] =P[q]} .

Let *P* be a pattern of length *m*, and let be the prefix function for *P*. For *q* = 2, 3, . . . , *m*,

* Proof *If

r= 1 + max {k*[q- 1]:P[k+ 1] =P[q]} .

Compute the prefix function for the pattern ababbabbababbababbabb when the alphabet is = {a, b}.

Give a linear-time algorithm to determine if a text *T* is a cyclic rotation of another string *T**'*. For example, arc and car are cyclic rotations of each other.

Give an efficient algorithm for computing the transition function for the string-matching automaton corresponding to a given pattern *P*. Your algorithm should run in time *O*(*m*||). (*Hint*: Prove that *(*q, a*) = ([*q*], *a *) if *q* = *m* or *P*[*q* + 1] *a*.)*

If the pattern *P* is relatively long and the alphabet is reasonably large, then an algorithm due to Robert S. Boyer and J. Strother Moore is likely to be the most efficient string-matching algorithm.

BOYER-MOORE-MATCHER(T,P, )

1nlength[T]

2mlength[P]

3 COMPUTE-LAST-OCCURRENCE-FUNCTION(P, m,)

4 COMPUTE-GOOD-SUFFIX-FUNCTION(P, m)

5s0

6whilesn - m

7dojm

8whilej> 0 andP[j] =T[s+j]

9dojj- 1

10ifj= 0

11thenprint "Pattern occurs at shift"s

12ss+ [0]

13elsess+ max([j],j- [T[s+j]])

12ss+ 1

13elsess+ 1

When a mismatch occurs, the bad-character heuristic uses information about where the bad text character *T*[*s* + *j*] occurs in the pattern (if it occurs at all) to propose a new shift. In the best case, the mismatch occurs on the first comparison (*P*[*m*] *T*[*s* + *m*]) and the bad character *T*[*s* + *m*] does not occur in the pattern at all. (Imagine searching for a* ^{m }*in the text string b

The following simple program defines [*a*] to be the index of the rightmost position in the pattern at which character *a* occurs, for each *a* . If *a* does not occur in the pattern, then [*a*] is set to 0. We call the * last-occurrence function* for the pattern. With this definition, the expression

COMPUTE-LAST-OCCURRENCE-FUNCTION(P, m, )

1foreach charactera

2do[a] = 0

3forj1tom

4do[P[j]]j

5return

The running time of procedure COMPUTE-LAST-OCCURRENCE-FUNCTION is *O*(|| + *m*).

Let us define the relation *Q *~* R* (read "*Q* is similar to *R"*) for strings *Q *and *R* to mean that . If two strings are similar, then we can align them with their rightmost characters matched, and no pair of aligned characters will disagree. The relation "~" is symmetric: *Q *~* R* if and only if *R *~* Q*. We also have, as a consequence of Lemma 34.1, that

[j] =m- max {k: 0k<mandP[j+ 1 . .m] ~P} ._{k}

That is, [*j*] is the least amount we can advance *s* and not cause any characters in the "good suffix" *T*[*s* + j + 1 . . s + *m*] to be mismatched against the new alignment of the pattern. The function is well defined for all *j*, since *P*[*j* + 1 . . *m*] ~ *P*_{0} for all *j*: the empty string is similar to all strings. We call the **good-suffix function **for the pattern *P*.

We can now rewrite our definition of as

[j] =m- max {k: [m]k<mandP[j+ 1 . .m] ~P} ._{k}

If* k* is the largest possible value such that , then we claim that

'[l] =m - j,

[j]= m- max({[m]}

{m - l+ '[l]: 1lmandj = m- '[l]})

= min({m- [m]}

{l- '[l]: 1lmandj = m- '[l]}) .

Again, the second set may be empty.

We are now ready to examine the procedure for computing .

COMPUTE-GOOD-SUFFIX-FUNCTION(P, m)

1 COMPUTE-PREFIX-FUNCTION(P)

2P' reverse(P)

3 ' COMPUTE-PREFIX-FUNCTION(P')

4forj0tom

5do[j]m- [m]

6forl1mto

7dojm- '[l]

8if[j] >l -'[l]

9then[j]l- '[l]

10return

Compute the and functions for the pattern *P* = 0101101201 and the alphabet = {0, 1, 2}.

'[j] =m- max{k: 0k<mandP[j+ 1 . .m] ~Pand_{k}

(k-m+j> 0 impliesP[j]P[k-m+j])} .

34-1 String matching based on repetition factors

Let *y ^{i}* denote the concatenation of string

REPETITION-MATCHER(P,T)

1mlength[P]

2nlength[T]

3k1 +p^{*}(P)

4q0

5s0

6whilesn - m

7do ifT[s+q+ 1] =P[q+ 1]

8thenqq+ 1

9ifq=m

10thenprint "Pattern occurs with shift"s

11ifq = morT[s+q+ 1]P[q +1]

12thenss+ max(1,q/k)

13q0

* a*. Prove that denotes functional composition:

* b*. Argue that is an associative operation.

* c*. Argue that

* d*. Prove that (