Applications of String Matching Algorithms

Rashi Bhansali
7 min readMay 26, 2021

--

https://launchberg.com/content/images/2020/07/plagiarism-checker.jpg

Introduction

The string matching problem also known as “the needle in a haystack” is one of the classics. A seemingly simple problem has a broad spectrum of applications including the areas of Document Similarity, Information Security, Document Matching, Bioinformatics (DNA sequence matching), Pattern Recognition among others. In this blog, we attempt to explore a couple of interesting string matching algorithms and their direct applications, namely the Rabin Karp Algorithm (hash based comparison) and the Knuth Morris Pratt Algorithm (character by character based comparison). Both these algorithms perform in linear time. So let’s dive deep into these!

Detecting Plagiarism - Rabin Karp Algorithm

https://www.cmu.edu/teaching/solveproblem/strat-cheating/plagiarism.html

Plagiarism, as defined by the Oxford Dictionary, is the act of copying someone else’s words, work or ideas and passing them off as your own without appropriate acknowledgement.

The advent of the Internet has made everything available at the go of one click making it all the more easier to plagiarize content. However, the same technology can come to the rescue of authentic content by identifying the copied one with the help of plagiarism detectors. Computer software can scan large chunks of data to find similarity and thereby determine the plagiarism percentage of two documents using the Rabin Karp algorithm in addition to preprocessing, stemming, and lemmatization.

Rabin Karp Algorithm

The Rabin-Karp algorithm is a string matching algorithm that uses hash function and rolling hash technique for comparing strings. It was proposed by Rabin and Karp in 1987 in an attempt to improve the complexity of the naïve approach to linear time. The brute force method compares the pattern and text character by character in O(mn) time whereas the Rabin-Karp algorithm compares their hash values thereby having a complexity of O(m+n).

A hash function typically maps input of a certain large arbitrary size to an output of a smaller fixed size. Rolling hash is an algorithm that calculates the hash value without having to rehash the entire string. By reusing previously computed hash values, new ones are computed in constant time.

Hashing does have its drawback in terms of collisions i.e. more than one value from the input space can map to the same hash value. This is like placing n or more pigeons in n-1 holes implying that at least one hole will be shared by more than one pigeon.

https://www.pinterest.de/amp/pin/264164334363094984/

How It Works

1. Precompute the hash value for the pattern to be searched.

2. Iterate through the length of the text :

  • Consider the substring of the text to be a window of length equal to that of the pattern
  • Compute the hash value for the current window and compare with that of the pattern
  • If a match is not found, move to the next window and repeat from step 2.

The previous hash values are used to compute subsequent ones. In the above animation, the hash value of “isi” will be computed from the hash value of “thi” and so on.

  • If a match is found, there is a possibility that it is a spurious hit owing to the non-injective nature of hash functions, e.g.: anagrams like “listen” and “silent” can have the same hash values. Thus, a character by character comparison is needed to confirm the match. In order to reduce the spurious hits, we use modulus with a large enough prime number.

In the above example, we found a match in the 8th window.

Hashing

We compute the hash value of a string using the following hash function:

where:

  • m = no. of characters in the string to be compared
  • a = ASCII value of characters in the string to be compared
  • b = a constant, generally equal to no. of symbols in the alphabet Σ
  • p = any suitable prime number used as modulo

Rolling Hash

We compute our subsequent hash value using the previous one as shown below:

where:

  • Hₙ₋₁ = hash of previous string a₁a₂…aₘ​
  • Hₙ = hash of new string a₂…aₘ​aₘ₊₁

If you didn’t understand any of these jargon formulae, don’t worry. Here’s an example:

Given the text “thisisacat”, you want to compute the hash values of “aca” and “cat” using the hash function mentioned earlier in combination with the rolling hash function.

aca: Hash function

So here, m = 3, b = 26, aca = Σ(a1a2a3), p = 31. For simplicity, assume ASCII value of alphabets a ϵ [1–26] where ASCII (a) = 1, ASCII (b) = 2, …, ASCII (z) = 26.

a1 , a2 , a3 = 1, 3, 1

H(aca) = (1x262 + 3x261 + 1x260) mod 809 = 11

cat: Rolling Hash function

So here, H(cat) = H(aca) — a + t, p = 31. ASCII (a) = 1, ASCII (t) = 20

H(cat) = ((10–1x262 ) x 26 + 20x260 ) mod 31 = 2

Key Points of Rabin Karp Algorithm

  • uses a hashing function and rolling hash
  • preprocessing phase in O(m) time complexity and constant space;
  • searching phase in O(mn) time complexity;
  • O(m+n) expected running time.

Pretty simple, wasn’t it? Let’s look at another interesting application of string matching.

DNA Sequence Matching - KMP Algorithm

DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. Nucleotide order determines Amino acid order and by extension , protein structure and functions(proteomics).

https://www.sigmaaldrich.com/technical-documents/articles/biology/sanger-sequencing.html

DNA consists of four bases : Adenine (A), Guanine (G), Cytosine (C ), Thymine (T). Processing of DNA sequences has always been a major problem and a number of solutions were proposed. But they need a significant amount of computational resources and memory. Therefore, we need an efficient pattern matching algorithm for finding DNA sequences in given data.

It was challenging to find a linear time algorithm, but David Knuth and Vaughan Pratt discovered it in the year 1970. James Morris a professor of Computer Science at Carnegie Mellon also discovered this algorithm in the same year. A joint paper was published by them in the year 1977 about the KMP(Knuth-Morris-Pratt) algorithm which focused on bioinformatics and problems like DNA sequence matching and detecting the intensity of a disease.

Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt (KMP) algorithm is a string matching algorithm that is used to search for a Pattern (P), in a given text string (T), in linear time.

While using any naïve string matching approach, if you encounter a mismatch between the pattern and the text, you just move the pattern one position ahead and again start matching the pattern from the start. But the KMP algorithm does something different to make it better.

Key Idea

In the KMP algorithm, the key idea used is to minimize the amount of character comparisons when a mismatch occurs. This can be minimized if we know the following things:

  • Existence of a proper prefix of P in the text T, which can be avoided from re-matching whenever there is a mismatch.
  • The length of that proper prefix.

But how does this algorithm know that? Well, it’s using some help from a small data structure, an array of Longest Proper Prefix which is Suffix (LPS), constructed for the pattern as a preprocessing step for the actual string matching algorithm.

The Image above shows the construction of the LPS array for every index denoted by the red arrow and the vertical bar denoting the substring used for the LPS of the current index. https://towardsdatascience.com/pattern-search-with-the-knuth-morris-pratt-kmp-algorithm-8562407dba5b

The LPS is constructed on the fact that :

Upon a mismatch at a particular position, all the characters previous to this position are matching correctly. This clearly means that a prefix of P occurred amongst these matching characters which is also a suffix of P. In other words, LPS[i] will represent the length of the longest proper prefix of P, which is also a proper suffix of P (considering P till the ith index only).

How It Works

Once the preprocessing is performed and we have the LPS array, the KMP algorithm is not much different from the usual naïve searching algorithm. But one major difference lies in the case where a mismatch occurs. The naïve algorithm will only move one position ahead and start checking from the beginning of the pattern, whereas KMP uses the preprocessed LPS array to continue matching from the same place where the mismatch occurred after shifting the string by certain characters.

Here is one more example of how the KMP algorithm performs DNA sequence matching!

KMP Algorithm Summary

  • performs the comparisons from left to right;
  • preprocessing of pattern takes O(m) space and time complexity where m is the pattern length
  • searching phase in O(m+n) time complexity (independent from the alphabet size)

Conclusion

A variety of other algorithms like Boyer Moore Horspool, Automaton Matching and Aho-Corasick Algorithm among the exact matching ones while fuzzy string matching algorithm in approximate matching have interesting approach and applications in Digital Forensics, Spelling Checkers, Intrusion Detection Systems and Spam Filtering.

--

--