The most efficient way to find the index of matching values ​​in two sorted arrays using C ++

I currently have a solution, but I feel that it is not as effective as it might be in this problem, so I want to see if there is a faster way for this.

I have two arrays (e.g. std :: vectors). Both arrays contain only unique integer values ​​that are sorted but sparse by value, that is: 1,4,12,13 ... What I want to ask is a quick way to find INDEX for one of the arrays where the values ​​are the same. For example, array 1 has values ​​1,4,12,13, and array2 has values ​​2,12,14,16. The first index of the matching value is 1 in array2. An index into an array is what matters, as I have other arrays containing data that will use this index, which "matches".

I'm not limited to using arrays; maps are possible. I only compare two arrays once. They will not be reused after the first match. In any array, there may be a small or large number of values ​​(300,000+), but NOT always have the same number of values ​​(which will greatly simplify)

Even worse, this is a linear search of O (N ^ 2). Using a map will help me better O (log N), but I would still convert the array to a map of values, a pair of indices.

Currently, I do not need to do any container type conversions. Navigate the smaller of the two arrays. Compare the current element of a small array (array1) with the current element of a large array (array2). If the value of array1 is greater than the value of array2, increase the index for array2 until it is no longer greater than the value of array1 (while loop). Then, if the value of array1 is less than array2, go to the next iteration of the loop and start again. Otherwise, they should be equal, and I have an index for arrays of the corresponding value.

So, in this cycle I am at best O (N) if all values ​​have matches and worse than O (2N) if they do not match. So I wonder if there is anything faster? It is difficult to know exactly how often two arrays will coincide, but I would prefer to focus on most arrays more, basically they will have matches than not.

I hope I have explained the problem well enough, and I appreciate the feedback or tips for improving this.

Code example:

std::vector<int> array1 = {4,6,12,34}; std::vector<int> array2 = {1,3,6,34,40}; for(unsigned int i=0, z=0; i < array1.size(); i++) { int value1 = array1[i]; while(value1 > array2[z] && z < array2.size()) z++; if (z >= array2.size()) break; // reached end of array2 if (value1 < array2[z]) continue; // we have a match, i and z indices have same value } 

The result will correspond to the indices for array1 = [1,3], and for array2 = [2,3]

+5
source share
3 answers

I wrote an implementation of this function using an algorithm that works better with sparse distributions than trivial linear merging.

For distributions similar to †, it has complexity O (n), but where the distributions differ greatly, it should be performed below the linear approximating O (log n) in optimal cases. However, I could not prove that the worst case is not better than O (n log n). On the other hand, I also could not find this worst case.

I programmed it so that you can use any type of range, for example, subranges or raw arrays. Technically, it works with non-random access iterators, but the complexity is much greater, so it is not recommended. I think that in this case it is necessary to change the algorithm in order to return to linear search, but I was not worried.

By a similar distribution, I mean that a pair of arrays has many intersections. When crossing, I mean the point at which you would switch from one array to another if you combined the two arrays in sorted order.

 #include <algorithm> #include <iterator> #include <utility> // helper structure for the search template<class Range, class Out> struct search_data { // is any there clearer way to get iterator that might be either // a Range::const_iterator or const T*? using iterator = decltype(std::cbegin(std::declval<Range&>())); iterator curr; const iterator begin, end; Out out; }; template<class Range, class Out> auto init_search_data(const Range& range, Out out) { return search_data<Range, Out>{ std::begin(range), std::begin(range), std::end(range), out, }; } template<class Range, class Out1, class Out2> void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) { auto search_data1 = init_search_data(in1, out1); auto search_data2 = init_search_data(in2, out2); // initial order is arbitrary auto lesser = &search_data1; auto greater = &search_data2; // if either range is exhausted, we are finished while(lesser->curr != lesser->end && greater->curr != greater->end) { // difference of first values in each range auto delta = *greater->curr - *lesser->curr; if(!delta) { // matching value was found // store both results and increment the iterators *lesser->out++ = std::distance(lesser->begin, lesser->curr++); *greater->out++ = std::distance(greater->begin, greater->curr++); continue; // then start a new iteraton } if(delta < 0) { // set the order of ranges by their first value std::swap(lesser, greater); delta = -delta; // delta is always positive after this } // next crossing cannot be farther than the delta // this assumption has following pre-requisites: // range is sorted, values are integers, values in the range are unique auto range_left = std::distance(lesser->curr, lesser->end); auto upper_limit = std::min(range_left, static_cast<decltype(range_left)>(delta)); // exponential search for a sub range where the value at upper bound // is greater than target, and value at lower bound is lesser auto target = *greater->curr; auto lower = lesser->curr; auto upper = std::next(lower, upper_limit); for(int i = 1; i < upper_limit; i *= 2) { auto guess = std::next(lower, i); if(*guess >= target) { upper = guess; break; } lower = guess; } // skip all values in lesser, // that are less than the least value in greater lesser->curr = std::lower_bound(lower, upper, target); } } #include <iostream> #include <vector> int main() { std::vector<int> array1 = {4,6,12,34}; std::vector<int> array2 = {1,3,6,34}; std::vector<std::size_t> indices1; std::vector<std::size_t> indices2; match_indices(array1, array2, std::back_inserter(indices1), std::back_inserter(indices2)); std::cout << "indices in array1: "; for(std::vector<int>::size_type i : indices1) std::cout << i << ' '; std::cout << "\nindices in array2: "; for(std::vector<int>::size_type i : indices2) std::cout << i << ' '; std::cout << std::endl; } 
+1
source

Since the arrays are already sorted, you can just use something very similar to the mergesort merge step. It just looks at the head element of each array and discards the bottom element (the next element becomes the head). Stop when you find a match (or when the array is exhausted, indicating no match).

This is O (n) and the fastest thing you can do for arbitrary distubtions. With some clustered distributions, you can use the skip forward approach rather than always looking at the next element. This can lead to better O (n) runtimes for certain distributions. For example, given arrays 1,2,3,4,5 and 10,11,12,13,14 algorithm could determine that no matches were found in just one comparison (5 <10).

+2
source

What is the range of saved numbers?

I mean, what are you saying that numbers are integers, sorted and sparse (i.e. not sequential) and that there can be more than 300,000, but what is their actual range?

The reason I ask is because if there is a sufficiently small upper limit u (say u = 500,000), the fastest and most appropriate solution would be to simply use the values ​​as indices. Yes, you can waste memory, but really a lot of memory? It depends on your application and your target platform (i.e. if it is for an embedded system with a limited amount of memory, then it is unlikely to be a good idea than if you had a laptop with 32 GB of RAM).

Of course, if the values ​​are more or less evenly distributed over 0-2 ^ 31-1, this crude idea is not attractive, but perhaps there are properties of the input values ​​that you can use for others, a range. You might be able to manually write a fairly simple hash function.

Another thing to consider is whether you really need to quickly get the index, or if that helps, just determine if the index exists in another array quickly. Regardless of whether a value exists in a particular index, only one bit is required, so you can have a bitmap image of the input value range using 32x memory (i.e. mask the 5 least significant bits and use this as a bit position, and then shift the remaining 27 bits of 5 places to the right and use this as the index of the array).

Finally, you might think of a hybrid approach in which you decide how much memory you are ready to use (let's say you decide 256KiB, which corresponds to a 64Ki 4-byte integer), and then use this as a lookup-table for much smaller sub-problems. Say you have 300,000 values ​​whose LSBs are fairly evenly distributed. Then you can use the 16 low-order bits as indices in the list lookup table, which (on average) contain only 4 or 5 elements, which can then be searched in other ways. A couple of years ago, I was working on some modeling software that had ~ 200,000,000 cells, each with a cell identifier; some utility functions used binary search to identify cells by identifier. We were able to significantly accelerate it and not obsessively with this strategy. Not a perfect solution, but a big improvement. (If LSBs are not evenly distributed, perhaps this is a property you can use, or maybe you can choose a bit range or a bit of hashing.)

I suppose the result is “consider some kind of hashing”, even “hash identifier” or just mask / modulo with a little “your solution should not be absolutely general” on the side, and some “your solution” should not be perfect space-efficient sauce on top.

+1
source

Source: https://habr.com/ru/post/1247279/


All Articles