Longest Common Substring

In this article I will describe an algorithm for solving the longest common substring. Suppose we try to decipher the encrypted binary data, first try to find the common patterns by searching the largest substring.
The input string for example: adasDATAHEADER??jpjjwerthhkjbcvkDATAHEADER??kkasdf
We are looking for a line repeated twice: DATAHEADER??

Prefixes

To begin write method for comparing prefixes of two rows, let returns the resulting string in which a left prefix symbols are the symbols of the right prefix.
For example, for strings:


        val lhs = "asdfWUKI"
        val rhs = "asdfIKUW"

The resulting string – asdf
Example of Kotlin:


fun longestPrefix(lhs: String, rhs: String): String {
        val maximalLength = min(lhs.length-1, rhs.length -1)
        for (i in 0..maximalLength) {
            val xChar = lhs.take(i)
            val yChar = rhs.take(i)
                if (xChar != yChar) {
                    return lhs.substring(0, i-1)
                }
        }
        return lhs.substring(0,maximalLength)
}

Brute Force

When it is impossible for a good, should resort to brute force. Using the method longestPrefix pass on row two cycles, the first line takes from i to the end of the second i + 1 to the end, transmits them to the search for the largest prefix. The time complexity of the algorithm is approximately equal to O (n ^ 2) ~ O (n * ^ 3).
Example of Kotlin:


fun searchLongestRepeatedSubstring(searchString: String): String {
        var longestRepeatedSubstring = ""
        for (x in 0..searchString.length-1) {
            val lhs = searchString.substring(x)
            for (y in x+1..searchString.length-1) {
                val rhs = searchString.substring(y)
                val longestPrefix = longestPrefix(lhs, rhs)
                if (longestRepeatedSubstring.length < longestPrefix.length) {
                    longestRepeatedSubstring = longestPrefix
                }
            }
        }
        return longestRepeatedSubstring
}

Suffix array

For a more elegant solution, we need a tool - a data structure called "suffix array." This data structure is an array of substrings filled in a loop, where every substring starts the next character string to the end.
For example for a row:


adasDATAHEADER??

Suffix array looks like this:


adasDATAHEADER??
dasDATAHEADER??
asDATAHEADER??
sDATAHEADER??
DATAHEADER??
ATAHEADER??
TAHEADER??
AHEADER??
HEADER??
EADER??
ADER??
DER??
ER??
R??
??
?

Solve sorting

Sort the suffix array, and then go through all the elements of the cycle where the left hand (lhs) the current item on the right (rhs) and calculate the next longest prefix using longestPrefix method.
Example of Kotlin:


fun searchLongestRepeatedSubstring(searchString: String): String {
    val suffixTree = suffixArray(searchString)
    val sortedSuffixTree = suffixTree.sorted()

    var longestRepeatedSubstring = ""
    for (i in 0..sortedSuffixTree.count() - 2) {
        val lhs = sortedSuffixTree[i]
        val rhs = sortedSuffixTree[i+1]
        val longestPrefix = longestPrefix(lhs, rhs)
        if (longestRepeatedSubstring.length < longestPrefix.length) {
            longestRepeatedSubstring = longestPrefix
        }
    }
    return longestRepeatedSubstring
}

Time complexity O (N log N), which is much better than brute force algorithm.

References

https://en.wikipedia.org/wiki/Longest_common_substring_problem

Source Code

https://gitlab.com/demensdeum/algorithms

0

Insertion Sort, Merge Sort

Insertion Sort

Insertion sort – each element is compared to the previous item in the list and inserted in place of more. Since the items are sorted from first to last, then each successive element is compared with pre-sorted list that *might* reduce the total time. Time complexity O(n^2), that is identical to the bubble sort.

Merge Sort

Merge sort – the list is divided into groups of one element, then the group “merge” in pairs with simultaneous comparison. In my implementation at the merge of pairs elements to the left compared with the right elements, and then moved to the result list, if the items in left are over, then right list elements added to resulting list (their extra comparison is unnecessary, since all the elements in the groups are sorted by iterations)
The operation of this algorithm is very easy to parallelize, pairs merging step can be performed in threads, with thread dispatcher wait.
Algorithm output for single-threaded performance:


["John", "Alice", "Mike", "#1", "Артем", "20", "60", "60", "DoubleTrouble"]
[["John"], ["Alice"], ["Mike"], ["#1"], ["Артем"], ["20"], ["60"], ["60"], ["DoubleTrouble"]]
[["Alice", "John"], ["#1", "Mike"], ["20", "Артем"], ["60", "60"], ["DoubleTrouble"]]
[["#1", "Alice", "John", "Mike"], ["20", "60", "60", "Артем"], ["DoubleTrouble"]]
[["#1", "20", "60", "60", "Alice", "John", "Mike", "Артем"], ["DoubleTrouble"]]
["#1", "20", "60", "60", "Alice", "DoubleTrouble", "John", "Mike", "Артем"]

Algorithm output for multithreaded execution:


["John", "Alice", "Mike", "#1", "Артем", "20", "60", "60", "DoubleTrouble"]
[["John"], ["Alice"], ["Mike"], ["#1"], ["Артем"], ["20"], ["60"], ["60"], ["DoubleTrouble"]]
[["20", "Артем"], ["Alice", "John"], ["60", "60"], ["#1", "Mike"], ["DoubleTrouble"]]
[["#1", "60", "60", "Mike"], ["20", "Alice", "John", "Артем"], ["DoubleTrouble"]]
[["DoubleTrouble"], ["#1", "20", "60", "60", "Alice", "John", "Mike", "Артем"]]
["#1", "20", "60", "60", "Alice", "DoubleTrouble", "John", "Mike", "Артем"]

Time complexity O (n * log (n)), which is slightly better than O (n ^ 2)

References

https://en.wikipedia.org/wiki/Insertion_sort
https://en.wikipedia.org/wiki/Merge_sort

Source Code

https://gitlab.com/demensdeum/algorithms

0

Bubble Sort in Erlang

Bubble sort is quite boring, but it becomes more interesting if you try to implement it in a functional language for telecom – Erlang.

We have a list of numbers, we need to sort it. bubble sort algorithm runs through the whole list, iterating, and comparing the number of pairs. Upon verification of the following occurs: a smaller number is added to the output list, or change the number of places in this list if the right is smaller bust continues with the next iteration number. This bypass is repeated as long as the list will no longer be replaced.

In practice it is not necessary to use due to the large time complexity – O (n ^ 2); I implemented it in Erlang language in an imperative style, but if you’re interested you can look for the best options:


-module(bubbleSort).
-export([main/1]).

startBubbleSort([CurrentHead|Tail]) ->
    compareHeads(CurrentHead, Tail, [], [CurrentHead|Tail]).

compareHeads(CurrentHead, [NextHead|Tail], [], OriginalList) ->   
    if
        CurrentHead < NextHead ->
            compareHeads(NextHead, Tail, [CurrentHead], OriginalList);
        true ->
            compareHeads(CurrentHead, Tail, [NextHead], OriginalList)
    end;
    
compareHeads(CurrentHead, [NextHead|Tail], OriginalOutputList, OriginalList) ->
    if
        CurrentHead < NextHead ->
            OutputList = OriginalOutputList ++ [CurrentHead],
            compareHeads(NextHead, Tail, OutputList, OriginalList);
        true ->
            OutputList = OriginalOutputList ++ [NextHead],
            compareHeads(CurrentHead, Tail, OutputList, OriginalList)
    end;
  
compareHeads(CurrentHead, [], OriginalOutputList, OriginalList) ->
    OutputList = OriginalOutputList ++ [CurrentHead],
    if
        OriginalList == OutputList ->
            io:format("OutputList: ~w~n", [OutputList]);
        true ->
            startBubbleSort(OutputList)
    end.
  
main(_) ->
    UnsortedList = [69,7,4,44,2,9,10,6,26,1],
    startBubbleSort(UnsortedList).

Install and run

In Erlang Ubuntu install is easy, just enter the following command in a terminal sudo apt install erlang. In this language, each file must be of a module (module), a list of functions that can be used outside – export. The interesting features of the language is the lack of variables, only constants, there is no standard syntax for the PLO (which does not prevent the use of OOP techniques), and of course the parallel computing without locks based on actor model.

Start module can be either through an interactive console erl, running one command after another, either through easier escript bubbleSort.erl; For different cases the file will look different, for example escript necessary to make the main function, from which it will start.

References

https://www.erlang.org/

https://habr.com/ru/post/197364/

Source Code

https://gitlab.com/demensdeum/algorithms/blob/master/bubbleSort/bubbleSort.erl

0

Lexicographical comparison

lexicographic comparisons rows algorithm works very simple loop compares the character codes and the result is returned if the symbols are not equal.

An example for the C language can be found here:
https://github.com/gcc-mirror/gcc/blob/master/libiberty/memcmp.c

Keep in mind that you need to compare the characters in a single static encoding, such as Swift, I used the comparison character by character to UTF-32. Array sorting option using memcmp work exactly for single-byte character, otherwise (variable length coding) may order is incorrect. I do not rule out the possibility of implementation on the basis of a variable-length encoding, but is likely to be much more complicated.

Time complexity at most O(1), the average and worst-O(n)

References

https://ru.wikipedia.org/wiki/Лексикографический_порядок

Source Code

https://gitlab.com/demensdeum/algorithms/blob/master/lexiCompare/lexiCompare.swift

0

Binary Search

Suppose we need to find out whether the email address “ demensdeum@gmail.com ” belongs to the list of allowed email addresses for receiving letters.

We sort through the entire list from the first to the last element, checking whether the element is equal to the specified address – we implement the linear search algorithm. But will it be long or not?

To answer this question, use the “Temporary complexity of the algorithms”, “O” notation. In the worst case, the linear search operation time is equal to the nth number of array elements, we write this in the “O” notation – O (n). Next, you need to clarify that for any known algorithm, there are three performance indicators – execution time in the best, worst, and average cases. For example, the mail address “ demensdeum@gmail.com ” is in the first index of the array, then it will be found in the first step of the algorithm, it follows that the execution time in at best, O (1); and if at the end of the list, then this is the worst case – O (n)

But what about the details of the software implementation, the performance of the hardware, should they affect big O? Now exhale and imagine that the calculation of time complexity is calculated for some abstract ideal machine, in which there is only this algorithm and nothing more.

Algorithm

Ok, it turns out that the linear search is rather slow, let’s try using the binary search. To begin with, it should be clarified that we will not work with binary data, this name was given to this method because of the features of its work. Initially, we sort the array into lexicographically , then the algorithm takes the range of the entire array, gets the middle element of the range, compares it lexicographically , and depending on the result comparison decides what range to take to search for more – the top half of the current or lower. That is, at each step of the search, a decision is made of two possible ones – binary logic. This step is repeated until either the word is found or not found (the intersection of the lower and upper indices of the range will occur).

The performance of this algorithm is the best case when an element is immediately found in the middle of the O (1) array, the worst case is O (log n)

Gotchas

When implementing binary search, I came across not only the interesting problem of the lack of standardization of lexicographic comparison in programming language libraries, but I even found that there is no single standard for implementing localeCompare inside JavaScript . The ECMAScript standard allows different implementations of this function, because of this, when sorting using localeCompare, absolutely different results can be observed on different JavaScript engines.

Therefore, for the algorithm to work correctly, you must sort and use only the same lexicographic comparison algorithm in your work, otherwise nothing will work. So, for example, if you try to sort an array in Scala, and search using nodejs without realizing your own sorting / sorting of one implementation, then you will not expect anything other than disappointment in humanity.

References

https://en.wikipedia.org/wiki/Lexicographical_order
https://en.wikipedia.org/wiki/Binary_search_algorithm
https://stackoverflow.com/questions/52941016/sorting-in-localecompare-in-javascript

Source Code

https://gitlab.com/demensdeum/algorithms

0