The easiest way to compare two sorted vectors is to iterate both of them at the same time and only increase, no matter which iterator matters. To do this, a minimum number of comparisons is guaranteed if both vectors are unique. In fact, you can use the same code for all kinds of linked link lists.
@Nikos Athanasiou (answer above) gives many useful tips for speeding up your code, for example using skip lists, simd comparisons. However, your data set is so small that even simple, naive code here is dazzlingly fast ...
template<typename CONT1, typename CONT2, typename OP_MARKIDENTICAL> inline void set_mark_overlap( const CONT1& cont1, const CONT2& cont2, OP_MARKIDENTICAL op_markidentical) { auto ii = cont1.cbegin(); auto end1 = cont1.cend(); auto jj = cont2.cbegin(); auto end2 = cont2.cend(); if (cont1.empty() || cont2.empty()) return; for (;;) { // increment iterator to container 1 if it is less if (*ii < *jj) { if (++ii == end1) break; } // increment iterator to container 2 if it is less else if (*jj < *ii) { if (++jj == end2) break; } // same values // increment both iterators else { op_markidentical(*ii); ++ii; if (ii == end1) break; // // Comment if container1 can contain duplicates // ++jj; if (jj == end2) break; } } }
Here is how you can use this code:
template<typename TT> struct op_store { vector<TT>& store; op_store(vector<TT>& store): store(store){} void operator()(TT val){store.push_back(val);} }; vector<unsigned> first{1,2,3,4,5,6}; vector<unsigned> second{1,2,5,6, 7,9}; vector<unsigned> overlap; set_mark_overlap( first, second, op_store<unsigned>(overlap)); for (const auto& ii : overlap) std::cout << ii << ","; std::cout << ii << "\n";
This code assumes that no vector contains duplicates. If any of your vecOfVec contains duplicates, and you want each duplicate to be printed, you need to comment on the code above. If your vecSearched vector contains duplicates, it is not clear what the corresponding answer will be ...
In your case, the code for storing matching values โโwill be only these three lines:
// results vector<vector<unsigned> > results(120); for (unsigned ii = 0; ii < vecOfVec.size(); ++ii) set_mark_overlap(vecSearched, vecOfVec[ii], op_store<unsigned>(results[ii]));
In terms of optimization, your problem has two characteristics: 1) One list is always much shorter than the other 2) A shorter list is reused, while a longer list is new for each comparison.
Front costs (preprocessing, for example, for skip lists, as suggested by @Nikos Athanasiou) are not related to the red list of 9000 (which is used again and again), but not to the longer list.
I assume that most of the skips are for longer lists, so skip lists may not be a panacea. How about making some kind of dynamic skip list so that you jump on N (j + = 4,000,000 / 9000) one (++ j) when the container catches up with the second (in the code above). If you jumped by two, you can use the mini-binary search to find the right amount to increase j.
Because of this asymmetry in the lengths of the lists, I donโt see the conversion from SIMD to help: we need to minimize the number of comparisons less (N + M), and not increase the speed of each comparison. However, it depends on your data. Code up and time for things!
Here is the test code for creating some random number vectors and check that they are present
#include <iostream> #include <vector> #include <unordered_set> #include <exception> #include <algorithm> #include <limits> #include <random> using namespace std; template<typename TT> struct op_store { std::vector<TT>& store; op_store(vector<TT>& store): store(store){} void operator()(TT val, TT val2){if (val != val2) std::cerr << val << " !- " << val2 << "\n"; store.push_back(val);} }; void fill_vec_with_unique_random_values(vector<unsigned>& cont, unordered_set<unsigned>& curr_values) { static random_device rd; static mt19937 e1(rd()); static uniform_int_distribution<unsigned> uniform_dist(0, std::numeric_limits<unsigned>::max()); for (auto& jj : cont) { for (;;) { unsigned new_value = uniform_dist(e1); // make sure all values are unique if (curr_values.count(new_value) == 0) { curr_values.insert(new_value); jj = new_value; break; } } } } int main (int argc, char *argv[]) { static random_device rd; static mt19937 e1(rd()); // vector searched in contains 9000+ elements. Vectors in vecSearched are sorted vector<unsigned> vecSearched(9000); unordered_set<unsigned> unique_values_9000; fill_vec_with_unique_random_values(vecSearched, unique_values_9000); // // Create 120 vectors of size ranging from 1 to 2000000 elements. All vectors in vecOfVec are sorted // vector<vector<unsigned> > vecOfVec(5); normal_distribution<> vec_size_normal_dist(1000000U, 500000U); for (unsigned ii = 0; ii < vecOfVec.size(); ++ii) { std::cerr << " Create Random data set" << ii << " ...\n"; auto vec_size = min(2000000U, static_cast<unsigned>(vec_size_normal_dist(e1))); vecOfVec[ii].resize(vec_size); // Do NOT share values with the 9000. We will manually add these later unordered_set<unsigned> unique_values(unique_values_9000); fill_vec_with_unique_random_values(vecOfVec[ii], unique_values); } // insert half of vecSearched in our 120 vectors so that we know what we are going to find vector<unsigned> correct_results(begin(vecSearched), begin(vecSearched) + 4500); for (unsigned ii = 0; ii < vecOfVec.size(); ++ii) vecOfVec[ii].insert(vecOfVec[ii].end(), begin(correct_results), end(correct_results)); // Make sure everything is sorted std::cerr << " Sort data ...\n"; for (unsigned ii = 0; ii < vecOfVec.size(); ++ii) sort(begin(vecOfVec[ii]), end(vecOfVec[ii])); sort(begin(vecSearched), end(vecSearched)); sort(begin(correct_results), end(correct_results)); std::cerr << " Match ...\n"; // results vector<vector<unsigned> > results(120); for (unsigned ii = 0; ii < vecOfVec.size(); ++ii) { std::cerr << ii << " done\n"; set_mark_overlap(vecSearched, vecOfVec[ii], op_store<unsigned>(results[ii])); // check all is well if (results[ii] != correct_results) throw runtime_error("Oops"); } return(0); }