# Applications of String Matching Algorithms

**Introduction**

The string matching problem also known as “the needle in a haystack” is one of the classics. A seemingly simple problem has a broad spectrum of applications including the areas of Document Similarity, Information Security, Document Matching, Bioinformatics (DNA sequence matching), Pattern Recognition among others. In this blog, we attempt to explore a couple of interesting string matching algorithms and their direct applications, namely the ** Rabin Karp Algorithm **(hash based comparison)

*and the*

*Knuth Morris Pratt Algorithm**(character by character based comparison). Both these algorithms perform in linear time. So let’s dive deep into these!*

# Detecting Plagiarism - Rabin Karp Algorithm

**Plagiarism**, as defined by the Oxford Dictionary, is *the act of copying someone else’s words, work or ideas and passing them off as your own* *without appropriate acknowledgement.*

The advent of the Internet has made everything available at the go of one click making it all the more easier to plagiarize content. However, the same technology can come to the rescue of authentic content by identifying the copied one with the help of plagiarism detectors. Computer software can scan large chunks of data to find similarity and thereby determine the plagiarism percentage of two documents using the ** Rabin Karp algorithm** in addition to preprocessing, stemming, and lemmatization.

# Rabin Karp Algorithm

The **Rabin-Karp** **algorithm **is a string matching algorithm that uses **hash function** and **rolling hash technique **for comparing strings. It was proposed by Rabin and Karp in 1987 in an attempt to improve the complexity of the naïve approach to linear time. The ** brute force** method compares the

**pattern**and

**text**character by character in

**time whereas the**

*O(mn)***compares their hash values thereby having a complexity of**

*Rabin-Karp algorithm***.**

*O(m+n)*A** hash function **typically maps input of a certain large arbitrary size to an output of a smaller fixed size. **Rolling hash** is an algorithm that calculates the hash value without having to rehash the entire string. By reusing previously computed hash values, new ones are computed in constant time.

Hashing does have its drawback in terms of collisions i.e. more than one value from the input space can map to the same hash value. This is like placing

or moreninpigeonsimplying that at least one hole will be shared by more than one pigeon.n-1 holes

## How It Works

1. Precompute the hash value for the pattern to be searched.

2. Iterate through the length of the text :

- Consider the substring of the text to be a window of length equal to that of the pattern
- Compute the hash value for the current window and compare with that of the pattern
- If a
is*match*, move to the*not found*and repeat from step 2.*next window*

The previous hash values are used to compute subsequent ones. In the above animation, the hash value of

“isi”will be computed from the hash value of“thi”and so on.

- If a
, there is a possibility that it is a*match is found*owing to the non-injective nature of hash functions, e.g.: anagrams like*spurious hit**“listen”*and*“silent”*can have the same hash values. Thus, a character by character comparison is needed to confirm the match. In order to reduce the spurious hits, we use modulus with a large enough prime number.

In the above example, we found a match in the 8th window.

## Hashing

We compute the hash value of a string using the following hash function:

where:

- m = no. of characters in the string to be compared
- a = ASCII value of characters in the string to be compared
- b = a constant, generally equal to no. of symbols in the alphabet Σ
- p = any suitable prime number used as modulo

**Rolling Hash**

We compute our subsequent hash value using the previous one as shown below:

where:

- Hₙ₋₁ = hash of previous string a₁a₂…aₘ
- Hₙ = hash of new string a₂…aₘaₘ₊₁

If you didn’t understand any of these jargon formulae, don’t worry. Here’s an example:

Given the text **“thisisacat”, **you want to compute the hash values of **“aca” and “cat” **using the hash function mentioned earlier in combination with the rolling hash function.

aca:Hash functionSo here,

m= 3,b= 26, aca = Σ(a1a2a3),p= 31. For simplicity, assume ASCII value of alphabetsa ϵ [1–26]whereASCII (a) = 1, ASCII (b) = 2, …, ASCII (z) = 26.

a1 , a2 , a3 = 1, 3, 1

H(aca)= (1x262 + 3x261 + 1x260) mod 809 =11

cat:Rolling Hash functionSo here,

H(cat) = H(aca)— a + t,p= 31. ASCII (a) = 1, ASCII (t) = 20

H(cat)= ((10–1x262 ) x 26 + 20x260 ) mod 31 =2

**Key Points of Rabin Karp Algorithm**

*uses a hashing function and rolling hash**preprocessing**phase in**O(m) time**complexity and constant space;**searching**phase in**O(mn) time**complexity;**O(m+n)**expected running time**.*

Pretty simple, wasn’t it? Let’s look at another interesting application of string matching.

# DNA Sequence Matching - KMP Algorithm

DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. Nucleotide order determines Amino acid order and by extension , protein structure and functions(proteomics).

DNA consists of four bases : **Adenine (A), Guanine (G), Cytosine (C ), Thymine (T).** Processing of DNA sequences has always been a major problem and a number of solutions were proposed. But they need a significant amount of computational resources and memory. Therefore, we need an efficient pattern matching algorithm for finding DNA sequences in given data.

It was challenging to find a linear time algorithm, but

David Knuth and Vaughan Pratt discovered it in the year 1970.James Morrisa professor of Computer Science at Carnegie Mellonalso discovered this algorithm in the same year. A joint paper was published by them in the year 1977 about the KMP(Knuth-Morris-Pratt) algorithm which focused on bioinformatics and problems like DNA sequence matching and detecting the intensity of a disease.

# Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt (KMP) algorithm is a string matching algorithm that is used to search for a Pattern (** P**), in a given text string (

**), in linear time.**

*T*While using any naïve string matching approach, if you encounter a mismatch between the pattern and the text, you just move the pattern one position ahead and again start matching the pattern from the start. But the KMP algorithm does something different to make it better.

## Key Idea

In the KMP algorithm, the key idea used is to **minimize the amount of character comparisons when a mismatch occurs.** This can be minimized if we know the following things:

- Existence of a proper prefix of P in the text T, which can be avoided from re-matching whenever there is a mismatch.
- The length of that proper prefix.

But how does this algorithm know that? Well, it’s using some help from a small data structure, an array of **Longest Proper Prefix** which is Suffix **(LPS)**, constructed for the pattern as a preprocessing step for the actual string matching algorithm.

The LPS is constructed on the fact that :

Upon a mismatch at a particular position, all the characters previous to this position are matching correctly. This clearly means that a prefix ofPoccurred amongst these matching characters which is also a suffix ofIn other words, LPS[i] will represent the length of the longest proper prefix of P, which is also a proper suffix of P (considering P till theP.ithindex only).

## How It Works

Once the preprocessing is performed and we have the LPS array, the KMP algorithm is not much different from the usual naïve searching algorithm. But one major difference lies in the case where a mismatch occurs. *The **naïve algorithm** will only move one position ahead and **start checking from the beginning of the pattern**,* whereas *KMP uses the preprocessed LPS array** to continue matching from the same place where the mismatch occurred **after shifting the string by certain characters.*

**Here is one more example of how the KMP algorithm performs DNA sequence matching!**

*KMP Algorithm Summary*

*KMP Algorithm Summary*

*performs the comparisons from left to right;**preprocessing of pattern takes**O(m)**space and time complexity where m is the pattern length**searching phase in**O(m+n)**time complexity (independent from the alphabet size)*

# Conclusion

A variety of other algorithms like Boyer Moore Horspool, Automaton Matching and Aho-Corasick Algorithm among the exact matching ones while fuzzy string matching algorithm in approximate matching have interesting approach and applications in Digital Forensics, Spelling Checkers, Intrusion Detection Systems and Spam Filtering.