When counting pairs of words that differ in one letter

Question

When counting pairs of words that differ in one letter

Consider n words, each of which is k. These words consist of letters above the alphabet (whose power is n) with a specific order. The task is to derive the O (nk) algorithm in order to count the number of word pairs that differ in one position (regardless of which one, so far this is only one position).
For example, in the following set of words (n = 5, k = 4):
abcd, abdd, adcb, adcd, aecd
There are 5 such pairs: (abcd, abdd), (abcd, adcd), (abcd, aecd), (adcb, adcd), (adcd, aecd).

So far, I managed to find an algorithm that solves a slightly easier task: counting the number of word pairs that differ in one GIVEN position (i-th). To do this, I replace the letter in the i-th position with the last letter in each word, perform Radix sorting (ignoring the last position in each word - previously the i-th position), linearly detecting words whose letters in the first 1 k-1 positions are the same, in ultimately, the number of occurrences of each letter in the last (initially it) position in each set of duplicates is calculated and the required pairs are calculated (the last part is simple).

However, the above algorithm does not seem to apply to the main problem (by O (nk) constraint) - at least without any changes. Any idea how to solve this?

+4

sorting algorithm

Quintofron Nov 07 '12 at 1:42

source share

4 answers

Dukeling · Answer 1 · 2013-01-04T20:15:04+0000

Assuming n and k are not too large to fit in memory:

You have a set with the first letter deleted, the second with the second letter deleted, the one with the third letter deleted, etc. Technically, this should be a map from lines to counts.

Run the list, just add the current item to each of the cards (obviously by deleting the corresponding letter first) (if it already exists, add the score to totalPairs and increase it by one).

Then totalPairs is the desired value.

EDIT:

Complexity:

It must be O(nklogn) .

You can use a map that uses hashing (e.g. HashMap in Java) instead of a sorted map for theoretical O(nk) complexity (although I usually found that the hash map will be slower than the sorted map-based tree).

Improvement:

A slight change on this is to have a card of the first two letters deleted by 2 cards, one with the first letter deleted and one with the second letter deleted, and have the same for the 3rd and 4th letters, etc. .

Then place them on cards with four letters deleted and on cards with 8 letters deleted, etc., up to half the letters deleted.

The complexity of this:

You perform 2 searches in 2 sorted sets containing a maximum of k elements (for each half).
For each of them, you again perform 2 searches in 2 sorted sets (for each quarter).
Thus, the number of searches is 2 + 4 + 8 + ... + k / 2 + k, which, in my opinion, is O(k) .
Perhaps I am mistaken, but in the worst case, the number of elements in any given map is n , but this will lead to the fact that all other maps will have only one element, therefore O(logn) , but for each n (not every nk ).

So, I think O(n.(logn + k)) .

.

EDIT 2:

Example of my cards (no improvement):

(x-1) means that x mapped to 1 .

Say we have abcd, abdd, adcb, adcd, aecd .

The first card will be (bcd-1), (bdd-1), (dcb-1), (dcd-1), (ecd-1) .

The second map will be (acd-3), (add-1), (acb-1) (for the 4th and 5th values already exists, therefore it increases).

The third card: (abd-2), (adb-1), (add-1), (aed-1) (the second already existed).

The fourth mapping: (abc-1), (abd-1), (adc-2), (aec-1) (the fourth already existed).

totalPairs = 0

For the second display - acd , for the fourth add 1, for the 5th add 2.

totalPairs = 3

For the third card - abd , for the 2nd add 1.

totalPairs = 4

For the fourth mapping - adc , for the fourth we add 1.

totalPairs = 5 .

A partial example of improved maps:

The same input as above.

The map of the first two letters deleted on the cards of the first and second letters has been deleted:

 (cd-{ {(bcd-1)}, {(acd-1)} }), (dd-{ {(bdd-1)}, {(add-1)} }), (cb-{ {(dcb-1)}, {(acb-1)} }), (cd-{ {(dcd-1)}, {(acd-1)} }), (cd-{ {(ecd-1)}, {(acd-1)} })

The above map consists of a cd element displayed on 2 displays, one of which contains one element (bcd-1) and the other contains (acd-1) .

But for the 4th and 5th cd already exists, therefore, instead of generating the above, it will be added to this map as follows:

 (cd-{ {(bcd-1, dcd-1, ecd-1)}, {(acd-3)} }), (dd-{ {(bdd-1)}, {(add-1)} }), (cb-{ {(dcb-1)}, {(acb-1)} })

Tyranicangel · Answer 2 · 2012-11-07T04:57:17+0000

You can put each word in an array. Run the elements from this array one by one. Then compare the resulting arrays. Finally, you add a popup back to return the original arrays. The selected items from both arrays should not be the same. Count the number of times this will happen, and finally divide it by 2 to get the exact solution.

Marcin · Answer 3 · 2013-01-05T19:40:50+0000

Think about how you would list the language - you are most likely to use a recursive algorithm. Recursive algorithms are mapped onto tree structures. If you are building such a tree, each divergence represents a difference in one letter, and each sheet will represent a word in the language.

Quintofron · Answer 4 · 2013-01-11T19:59:24+0000

Two months have passed since I introduced the problem here. I discussed this with my peers at the same time and would like to share the results.

The basic idea is similar to the idea of Dukeling. For each word A and for each i-th position inside this word, we consider a tuple: (prefix, suffix, letter in the i-th position), i.e. (A [1..i-1], A [i + 1..n], A [i]). If I is 1 or n, then the applicable substring is considered empty (these are simple boundary cases).

Having these tuples in hand, we should be able to apply the reasoning that I presented in my first message to count the number of pairs of different words. All we need to do is sort the tuples by the values of the prefix and suffix (separately for each i) - then words with letters that are equal on all but the i-th position will be adjacent to each other.

This is the technical part that I am missing. So that the sorting procedure (RadixSort seems to be the way to go) complies with the O (nk) constraint, we might want to assign labels to our prefixes and suffixes (we only need n labels for each i). I'm not quite sure how to do this. (Of course, we could do hashing instead, but I'm sure the previous solution is viable).

Although this is not a complete solution, I believe that it sheds light on a possible way to solve this problem, which is why I posted it here. If someone comes up with how to make part of the marking, I will implement it in this post.

When counting pairs of words that differ in one letter

More articles: