Longest Common Substring

In this article I will describe an algorithm for solving the longest common substring. Suppose we try to decipher the encrypted binary data, first try to find the common patterns by searching the largest substring.
The input string for example: adasDATAHEADER??jpjjwerthhkjbcvkDATAHEADER??kkasdf
We are looking for a line repeated twice: DATAHEADER??

Prefixes

To begin write method for comparing prefixes of two rows, let returns the resulting string in which a left prefix symbols are the symbols of the right prefix.
For example, for strings:


        val lhs = "asdfWUKI"
        val rhs = "asdfIKUW"

The resulting string – asdf
Example of Kotlin:


fun longestPrefix(lhs: String, rhs: String): String {
        val maximalLength = min(lhs.length-1, rhs.length -1)
        for (i in 0..maximalLength) {
            val xChar = lhs.take(i)
            val yChar = rhs.take(i)
                if (xChar != yChar) {
                    return lhs.substring(0, i-1)
                }
        }
        return lhs.substring(0,maximalLength)
}

Brute Force

When it is impossible for a good, should resort to brute force. Using the method longestPrefix pass on row two cycles, the first line takes from i to the end of the second i + 1 to the end, transmits them to the search for the largest prefix. The time complexity of the algorithm is approximately equal to O (n ^ 2) ~ O (n * ^ 3).
Example of Kotlin:


fun searchLongestRepeatedSubstring(searchString: String): String {
        var longestRepeatedSubstring = ""
        for (x in 0..searchString.length-1) {
            val lhs = searchString.substring(x)
            for (y in x+1..searchString.length-1) {
                val rhs = searchString.substring(y)
                val longestPrefix = longestPrefix(lhs, rhs)
                if (longestRepeatedSubstring.length < longestPrefix.length) {
                    longestRepeatedSubstring = longestPrefix
                }
            }
        }
        return longestRepeatedSubstring
}

Suffix array

For a more elegant solution, we need a tool - a data structure called "suffix array." This data structure is an array of substrings filled in a loop, where every substring starts the next character string to the end.
For example for a row:


adasDATAHEADER??

Suffix array looks like this:


adasDATAHEADER??
dasDATAHEADER??
asDATAHEADER??
sDATAHEADER??
DATAHEADER??
ATAHEADER??
TAHEADER??
AHEADER??
HEADER??
EADER??
ADER??
DER??
ER??
R??
??
?

Solve sorting

Sort the suffix array, and then go through all the elements of the cycle where the left hand (lhs) the current item on the right (rhs) and calculate the next longest prefix using longestPrefix method.
Example of Kotlin:


fun searchLongestRepeatedSubstring(searchString: String): String {
    val suffixTree = suffixArray(searchString)
    val sortedSuffixTree = suffixTree.sorted()

    var longestRepeatedSubstring = ""
    for (i in 0..sortedSuffixTree.count() - 2) {
        val lhs = sortedSuffixTree[i]
        val rhs = sortedSuffixTree[i+1]
        val longestPrefix = longestPrefix(lhs, rhs)
        if (longestRepeatedSubstring.length < longestPrefix.length) {
            longestRepeatedSubstring = longestPrefix
        }
    }
    return longestRepeatedSubstring
}

Time complexity O (N log N), which is much better than brute force algorithm.

References

https://en.wikipedia.org/wiki/Longest_common_substring_problem

Source Code

https://gitlab.com/demensdeum/algorithms