Find the number of duplicate subarrays in an array

Question

Find the number of duplicate subarrays in an array

There is an array indexed from 0-n (i.e. size = n) containing elements from 0-m , where m < n (suppose that m is 100 or 1000 times smaller than n, i.e. m is much less than n), so many elements or an auxiliary array must be repeated, and we need to find the number of such repeated auxiliary arrays of size 1 or more than 1. For example, consider the array 1 2 4 1 2 4 1 5 6 3 5 , here
1 repeated 2 times
2 repeated 1 time
4 repeated 1 time
5 repeated 1 time
1 2 repeated 1 time
2 4 is repeated 1 time
4 1 repeated 1 time
1 2 4 repeated 1 time
2 4 1 repeated 1 time
1 2 4 1 repeated 1 time
Thus, the final answer is the sum of these, i.e. eleven
Can anyone suggest some data structure or efficient algorithm to solve this problem. I thought about hashing m and noted the value of the index where it occurs, but did not formalize it.
Assume that n,m < 10^5

+6

arrays algorithm

justice league Jul 14 '13 at 10:39

source share

3 answers

I used the suffix tree approach from Schenna's book on algorithm development . Suffix trees - a special kind of trie tree , where you insert each suffix of this line in the trie tree. Three trees is a way of storing several words in one tree: the root is an empty string, and each edge adds a character. I used a very simple implementation as a proof of concept below.

 #include <iostream> #include <string> using std::string; using std::cout; class Node { Node* m_descendants[10]; int m_count; public: Node() { for(int i=0; i<10; ++i) { m_descendants[i] = nullptr; } m_count = 0; } void Add(const char* str) { // Increment number of times we've reached this node m_count++; const int firstDigit = str[0] - '0'; if (!(0 <= firstDigit && firstDigit <= 9)) return; // recursion base case // Access descendant node correponding to current digit Node*& descendant = m_descendants[firstDigit]; if (descendant == 0) { // create descendant if necessary descendant = new Node; } // Recurse on rest of string descendant->Add(str +1); } void Print(const string& str, int countAdj=0) const // debug function { for(int nextDigit=0; nextDigit<10; ++nextDigit) { const Node* node = m_descendants[nextDigit]; if (node) { const int adjustedCount = node->Count() - countAdj; if (adjustedCount > 0) { char c = '0' + nextDigit; string strWithC = str; strWithC += c; cout << strWithC << ": " << adjustedCount << "\n"; node->Print(strWithC, countAdj); } } } } int SumRepeated() const { int sumRepeated = 0; for(int nextDigit=0; nextDigit<10; ++nextDigit) { const Node* node = m_descendants[nextDigit]; if (node) { const int repeatCount = node->Count() -1; if (repeatCount > 0) { sumRepeated += repeatCount; sumRepeated += node->SumRepeated(); } } } return sumRepeated; } int Count() const { return m_count; } }; int main(int argc, const char* argv[]) { // Construct Node* const root = new Node; for(const char* str = "12412415635"; *str; ++str) { root->Add(str); } // Print root->Print(string(), 1); cout << "Sum repeated is " << root->SumRepeated() << "\n"; return 0; // (nodes are leaked) }

Output

 1: 2 12: 1 124: 1 1241: 1 2: 1 24: 1 241: 1 4: 1 41: 1 5: 1 Sum repeated is 11

Note that there is one additional repeating substring, i.e. 1241.

As I said, this is a proof of concept, therefore, for example, you would like to use a dictionary instead of an array to store descendants with large m . In addition, this implementation is not optimal even with respect to the general algorithm: it is O (n ^ 2) in time and space. You can optimize by folding paths without branching to get O (n) space, and use smart building algorithms to get O (n) time. Also, as indicated in one of the other answers, you don't need to process substrings that are up to half the length of the original.

+1

Tooone Jul 14 '13 at 14:02

source share

There is something in JavaScript: (the output is more or less immediate, and JavaScript is one of the slowest, I think):

 <script> function f (s, m) { var i = 0, n = s.length, c = 0, H = {}, Hi = {} while (i < n-1) { for (var j=i+1; j<=Math.min (i + 1 + m,n - 1); j++) { if (s[i] == s[j]) { var i_= i, j_= j, tmp = '' while (s[i_] == s[j_] && i_ < j) { tmp += String (s[i_]) i_++ j_++ } var k = tmp.length - 1 while (k >= 0) { if (!H[tmp]) { H[tmp] = 1 Hi[tmp] = i c++ } else if (i == Hi[tmp]) { H[tmp]++ c++ } tmp = tmp.substr(0,k--) } } } i++ } var t1 = '' for (var i in H) t1 += i + ", " return [t1,c] } var a1 = [1,2,4,1,2,4,1,5,6,3,5], a2 = [] for (var i=0;i<100000;i++) a2.push(Math.floor(Math.random()*10)) console.log(a1 +"\n"+ f(a1,10)) console.log(f(a2,10)) </script>

Output:

 1,2,4,1,2,4,1,5,6,3,5 1, 2, 4, 5, 12, 24, 41, 124, 241, ,10 ["0...6358, 3582, 038, 4304, 2068, 095, 9252, 37866, 3786, 7866, 7893, 8701, 0669, ", 771]

0

גלעד ברקן Jul 15 '13 at 2:33

source share

Dave · Accepted Answer · 2013-07-14T14:14:21+0000

Create a hash that has integers in the form of keys and extensible lists (linked lists, for example) of integers as values.
Read the input list. Consider each input element that is scanned as a key; add the index of the next item to the appropriate list of values.
Now your repeated 1-element counter - n - | keys |
Then go through your keys. For each key, treat the indexed items in the source list as a new input list. Create a new hash and continue, as in step 1. Note that when we save indexes, we continue to use them in the original list.

Example: 1 2 4 1 2 4 1 5 6 3 5 (suppose we are zero-indexed). Here n = 11.

1 → [1, 4, 7]
2 → [2, 5]
3 → [10]
4 → [3, 6]
5 → [8, nil]
6 → [9]

The number of repeated 1-elts is 11-6 = 5.

Now we go through the keys. Let's start with 1. The list of indices is [1, 4, 7], which corresponds to the elements [2, 2, 5].

2 → [2, 5]
5 → [8]

This displays a 3 - 2 = 1 duplicate 2-element list.

etc...

Runtime: linearly at input + output

Find the number of duplicate subarrays in an array

More articles: