Looking for an efficient algorithm (non-trivial)

Question

Looking for an efficient algorithm (non-trivial)

Specification problem:

This is Christmas! You must buy gifts!

You have a set of existing toy packages and the corresponding price of the kit:

1 0 0 1 0 1 1 1 0 => 58 0 1 0 0 1 1 1 0 0 => 27 1 1 1 0 0 0 1 0 0 => 46 0 0 0 0 1 1 1 1 0 => 73 ...

Where a 1 indicates that the toy is in a bundle, and a 0 that it is not.

Now the Santa Claus promotion is coming up, and the rest of X offered to you at a special special price. We say that X is a bad deal if there is another bundle Y , so that:

Edit: to make this simpler, I reset condition 3, but changed condition 1 from “subset” to “strict subset”

X is a strict subset of Y
X more expensive than Y

The goal is to implement the bool isBadSubset(X) function, which effectively detects whether X good or not.

Given that there are millions of beams, comparing them with each of them is obviously impossible. Moreover, you can assume that in the existing bundle collection, a subset of toys is always cheaper than a superset.

Tips:

comparing whether a set is a subset or not another set is easy
you can limit the comparison to a set that contains at least N more toys and cheaper. However, the list may be large.
something in the direction of the sieve will be fine
you don’t need to know which package is better ... just the one that is better exists

Objective: is it possible to achieve this in a constant time? (regardless of the number of packages currently in the collection) ... or at least in log (n)?

+6

algorithm data-structures

dagnelies Sep 11 '11 at 9:53

source share

6 answers

han · Answer 1 · 2011-09-11T14:04:53+0000

I managed to find the right literature for quick searches, and it seems that in the general case, your problem is not easy.

Charikar, Indyk and Panigrahy (2002) study the problem of a subset of queries: given a set P of N subsets of some universe U of m elements, and a set of queries Q, does P have a set that is a subset of Q? They represent two algorithms that trade storage for query speed. To achieve the query time O (N / k), they need to increase the use of space by the exponential factor in the square root of k.

Bevc and Sannik (2010) describe a simple trie-based data structure for subsets of queries without query speed analysis, but it seems clear that it is the worst linear among the N stored sets.

Matthieu M. · Answer 2 · 2011-09-12T13:37:02+0000

The question of a solution in O (1) is unrealistic, I think. The only solution I could think of was to create a complete list of bundles, and for each of them it is indicated whether it is good or not ... I doubt that this is what you are asking for.

A simple binary search can be interesting, although we don’t even go into details about which toys are interested, we can simply index the price and the number of items. Our product is a bad package if there is another with lower price and higher quantity of items.

Therefore, we can determine the key (price, nb items) and organize them efficiently. The search will be O(n log n) , and then the inclusion tests on the subset will still be linear.

Davefar · Answer 3 · 2011-09-11T10:12:37+0000

Well, therefore, the number of toys n, is constant and small, i.e. you have set {toy_0, .. toy_n-1}.

Then you can have an array Set[n] bundleContainingToy , and if the kit x has toy_i, then you save x to set bundleContainingToy [i].

If you get a new package 1 0 0 1 0 1, you only need to compute the intersection setContainingToy [0], bundleContainingToy [3] and bundleContainingToy [5]. If the intersection is O (1) (which you can probably assume, since you said that checking the property of the subset), you can do this checking in O (1), since n is constant (and small).

Is this the sieve you were looking for? The rest of your calculations depend only on the number of packages that contain toy_0, toy_3 and toy_5.

nulvinge · Answer 4 · 2011-09-11T12:42:35+0000

If the requirement is that part of the search must be O(1) , you can first build a map:

 map = {} for (bundle, price) in bundles: for subset in subsets(bundle): #including bundle if map.contains(bundle) map[bundle] = min(map[bundle],price) else map[bundle] = price

Now, to check if this is bad:

  def isBadDeal(bundle,price) return map[bundle] < price

Benjamin · Answer 5 · 2011-09-11T17:41:42+0000

Bit by bit, and between the bundle and all bonds, it is necessary to determine which bundles contain a subset of the bundle. After that, the inequality test will return if there is a more expensive set containing this subset. If any more expensive search, then the package is a bad deal.

In Python / numpy:

 import numpy def bad_deal(bundle, cost): return ((bundles & bundle == bundle) * (prices < cost)).any() # Generate some test data numpy.random.seed(11) global bundles, prices bundles = numpy.random.randint(0,511,(1000000,)) # 000000000 to 111111111 prices = numpy.random.randint(40,70, (1000000,)) # 40 to 70$

The best solution is to save only packages cheaper than the cost first, and then check if the package is present in this subset, which means we have a bad deal:

 def bad_deal_2(bundle, cost): less_exp_bundles = numpy.delete(bundles, numpy.where(prices > cost)) return (less_exp_bundles & bundle == bundle).any()

In this case, the worst case scenario is that this package is the most expensive product and that all other packages are its supernets, which requires checking all packages. In all other cases, we check less than the total number of beams. However, it is necessary to check all prices (but there is less data in this vector than in the bundle vector), so first checking prices makes sense to reduce the number of packages that we need to study.

Michael nett · Answer 6 · 2011-09-11T10:47:46+0000

three properties that make a vector a bad bargain determine the relationship. This set of power of all vectors together with this relation determines the lattice. To decide if the set X is bad, you just need to remember the smallest elements in this grid. The main approach:

You start with an empty database. You read vectors one by one. For each vector, you check it against your database to make sure that this is a bad deal or not. If you throw it away, otherwise add it to your database. The performance of this approach obviously depends on your data. If, as you said, the number of toys is a small fixed number, then you are probably lucky.

As mentioned above, constant time is not possible, simply because you can have a super-constant number of items to track, even if you are smart about how to do it.

Looking for an efficient algorithm (non-trivial)

More articles: