In this note I will describe an algorithm for solving the longest common substring problem. Let’s say we are trying to decrypt encrypted binary data, let’s first try to find common patterns using the longest substring search.
Input string for example:
adasDATAHEADER??jpjjwerthhkjbcvkDATAHEADER??kkasdf
We are looking for a string that is repeated twice:
DATAHEADER??
Prefixes
First, let’s write a method for comparing the prefixes of two strings, let it return the resulting string in which the characters of the left prefix are equal to the characters of the right prefix.
For example, for the lines:
val lhs = "asdfWUKI"
val rhs = "asdfIKUW"
The resulting string is – asdf
Example in Kotlin:
fun longestPrefix(lhs: String, rhs: String): String {
val maximalLength = min(lhs.length-1, rhs.length -1)
for (i in 0..maximalLength) {
val xChar = lhs.take(i)
val yChar = rhs.take(i)
if (xChar != yChar) {
return lhs.substring(0, i-1)
}
}
return lhs.substring(0,maximalLength)
}
Brute Force
When it doesn’t work out the good way, it’s worth resorting to brute force. Using the longestPrefix method, we’ll go through the string in two loops, the first one takes the string from i to the end, the second one from i + 1 to the end, and passes them to the search for the largest prefix. The time complexity of this algorithm is approximately O(n^2) ~ O(n*^3).
Example in Kotlin:
fun searchLongestRepeatedSubstring(searchString: String): String {
var longestRepeatedSubstring = ""
for (x in 0..searchString.length-1) {
val lhs = searchString.substring(x)
for (y in x+1..searchString.length-1) {
val rhs = searchString.substring(y)
val longestPrefix = longestPrefix(lhs, rhs)
if (longestRepeatedSubstring.length < longestPrefix.length) {
longestRepeatedSubstring = longestPrefix
}
}
}
return longestRepeatedSubstring
}
Suffix array
For a more elegant solution, we need a tool - a data structure called "Suffix Array". This data structure is an array of substrings filled in a loop, where each substring starts from the next character of the string to the end.
For example for the line:
adasDATAHEADER??
The suffix array looks like this:
adasDATAHEADER??
dasDATAHEADER??
asDATAHEADER??
sDATAHEADER??
DATAHEADER??
ATAHEADER??
TAHEADER??
AHEADER??
HEADER??
EADER??
ADER??
DER??
ER??
R??
??
?
Solving by sorting
We sort the suffix array, then we go through all the elements in a loop where the left hand (lhs) contains the current element, the right hand (rhs) contains the next one, and we calculate the longest prefix using the longestPrefix method.
Example in Kotlin:
fun searchLongestRepeatedSubstring(searchString: String): String {
val suffixTree = suffixArray(searchString)
val sortedSuffixTree = suffixTree.sorted()
var longestRepeatedSubstring = ""
for (i in 0..sortedSuffixTree.count() - 2) {
val lhs = sortedSuffixTree[i]
val rhs = sortedSuffixTree[i+1]
val longestPrefix = longestPrefix(lhs, rhs)
if (longestRepeatedSubstring.length < longestPrefix.length) {
longestRepeatedSubstring = longestPrefix
}
}
return longestRepeatedSubstring
}
The time complexity of the algorithm is O(N log N), which is much better than a brute-force solution.
Sources
https://en.wikipedia.org/wiki/Longest_common_substring_problem
Source code
https://gitlab.com/demensdeum/algorithms