An efficient random selection algorithm from a distribution when resolving updates?

This is a question that I asked some time ago at the interview, I could not find the answer.

For some samples S1, S2, ... Sn and their probability distributions (or weights, regardless of what it is called) P1, P2, .. Pn, a design algorithm that randomly selects a sample based on its probability. The solution I came up with is as follows:

  • Create a cumulative array of Ci weights, such

    C0 = 0; Ci = C [i-1] + Pi.

at the same time, calculate T = P1 + P2 + ... Pn. It takes time O (n)

  1. Generate evenly random number R = T * random [0..1]
  2. Using the binary search algorithm, return the smallest self such Ci> = R. The result is Si. It takes O (logN) time.

Now the actual question: Suppose I want to change one of the initial weights Pj. How to do it better than O (n) time? other data structures are acceptable, but the random sampling algorithm should not degrade than O (logN).

+6
source share
3 answers

One way to solve this problem is to rethink how your binary search tree is constructed, containing aggregate totals. Instead of building a binary search tree, consider that each node is interpreted as follows:

  • Each node stores a range of values ​​intended for the node itself.
  • The nodes in the left subtree represent a sample from the probability distribution only to the left of this range.
  • The nodes in the right subtree represent a sample from the probability distribution to the right of this range.

For example, suppose our weights are 3, 2, 2, 2, 2, 1, and 1 for events A, B, C, D, E, F, and G. We are building this binary tree containing A, B, C, D, E, F and G:

D / \ BF / \ / \ ACEG 

Now we annotate the tree with probabilities. Since A, C, E and G are all sheets, each of them gives a lot of probability:

  D / \ BF / \ / \ ACEG 1 1 1 1 

Now look at the tree for B. B has a weight of 2 selected, A has a weight of 3 selected, and C has a probability of 2 choices. If we normalize them to the range [0, 1), then A takes into account 3/7 probabilities, and B and C - an account for 2 / 7s. Thus, we have a node for B that everything that is in the range [0, 3/7) goes to the left subtree, everything in the range [3/7, 5/7) maps to B, and everything in the range [ 5/7, 1) is displayed in the right subtree:

  D / \ BF [0, 3/7) / \ [5/7, 1) / \ ACEG 1 1 1 1 

Similarly, let a process F. E has a weight of 2 selected, and F and G have a probability weight of 1. Thus, the subtree for E here takes into account 1/2 of the mass of probability, node F takes into account 1/4, and the subtree for G - 1 /4. This means that we can assign probabilities as

  D / \ BF [0, 3/7) / \ [5/7, 1) [0, 1/2) / \ [3/4, 1) ACEG 1 1 1 1 

Finally, look at the root. The total weight of the left subtree is 3 + 2 + 2 = 7. The total weight of the right subtree is 2 + 1 + 1 = 4. The weight of D itself is 2. Thus, the left subtree has a probability of 7/13 of being selected, D has a probability of 2 / 13 be selected, and the right subtree has a probability of 4/13 to be selected. Thus, we can refine the probabilities as

  D [0, 7/13) / \ [9/13, 1) BF [0, 3/7) / \ [5/7, 1) [0, 1/2) / \ [3/4, 1) ACEG 1 1 1 1 

To create a random value, you must repeat the following:

  • Starting at the root:
    • Select a uniformly random value in the range [0, 1].
    • If it is in the range for the left subtree, go down into it.
    • If it is in the range for the right subtree, go down into it.
    • Otherwise, return the value corresponding to the current node value.

Probabilities themselves can be determined recursively when building a tree:

  • The left and right probabilities are 0 for any leaf node.
  • If the interior of the node itself has a weight W, its left tree has a total weight W L , and its right tree has a total weight W R , then the probability on the left is (W L ) / (W + W L + W R ), and the correct probability is (W R ) / (W + W L + W R ).

The reason this reformulation is useful is because it allows us to update probabilities in O (log n) time for each updated probability. In particular, think about which invariants will change if we update some node weight. For simplicity, suppose node is now a leaf. When we update the weight of the node leaf, the probabilities are still true for the node leaf, but they are incorrect for the node just above it, because the weight of one of these node subtrees has changed. Thus, we can (in O (1) times) recalculate the probabilities of the parent node using the same formula as above. But then the parent element node no longer has the correct values, because one of its subtree weights has changed, so we can also recalculate the probability there. This process is repeated right up to the root of the tree, and we do an O (1) calculation per level to correct the weights assigned to each edge. Assuming the tree is balanced, we have to do the O (log n) common work to update one probability. The logic is identical if node is not a leaf node; we just start somewhere in the tree.

In short, it gives

  • O (n) time for building a tree (using the bottom-up approach),
  • O (log n) to generate a random value and
  • O (log n) time to update any value.

Hope this helps!

+5
source

Instead of an array, save the search structured as a balanced binary tree. Each node of the tree must store the total weight of the elements contained in it. Depending on the value of R the search procedure returns the current node or performs a search on the left or right subtree.

When the weight of an element changes, updating the search structure depends on adjusting the weights in the path from the element to the root of the tree.

Since the tree is balanced, the search and weight update operations are O (log N).

+4
source

For those of you who want to get the code, the python implementation is implemented here:

 import numpy class DynamicProbDistribution(object): """ Given a set of weighted items, randomly samples an item with probability proportional to its weight. This class also supports fast modification of the distribution, so that changing an item weight requires O(log N) time. Sampling requires O(log N) time. """ def __init__(self, weights): self.num_weights = len(weights) self.weights = numpy.empty((1+len(weights),), 'float32') self.weights[0] = 0 # Not necessary but easier to read after printing self.weights[1:] = weights self.weight_tree = numpy.zeros((1+len(weights),), 'float32') self.populate_weight_tree() def populate_weight_tree(self): """ The value of every node in the weight tree is equal to the sum of all weights in the subtree rooted at that node. """ i = self.num_weights while i > 0: weight_sum = self.weights[i] twoi = 2*i if twoi < self.num_weights: weight_sum += self.weight_tree[twoi] + self.weight_tree[twoi+1] elif twoi == self.num_weights: weight_sum += self.weights[twoi] self.weight_tree[i] = weight_sum i -= 1 def set_weight(self, item_idx, weight): """ Changes the weight of the given item. """ i = item_idx + 1 self.weights[i] = weight while i > 0: weight_sum = self.weights[i] twoi = 2*i if twoi < self.num_weights: weight_sum += self.weight_tree[twoi] + self.weight_tree[twoi+1] elif twoi == self.num_weights: weight_sum += self.weights[twoi] self.weight_tree[i] = weight_sum i /= 2 # Only need to modify the parents of this node def sample(self): """ Returns an item index sampled from the distribution. """ i = 1 while True: twoi = 2*i if twoi < self.num_weights: # Two children val = numpy.random.random() * self.weight_tree[i] if val < self.weights[i]: # all indices are offset by 1 for fast traversal of the # internal binary tree return i-1 elif val < self.weights[i] + self.weight_tree[twoi]: i = twoi # descend into the subtree else: i = twoi + 1 elif twoi == self.num_weights: # One child val = numpy.random.random() * self.weight_tree[i] if val < self.weights[i]: return i-1 else: i = twoi else: # No children return i-1 def validate_distribution_results(dpd, weights, samples_per_item=1000): import time bins = numpy.zeros((len(weights),), 'float32') num_samples = samples_per_item * numpy.sum(weights) start = time.time() for i in xrange(num_samples): bins[dpd.sample()] += 1 duration = time.time() - start bins *= numpy.sum(weights) bins /= num_samples print "Time to make %s samples: %s" % (num_samples, duration) # These should be very close to each other print "\nWeights:\n", weights print "\nBins:\n", bins sdev_tolerance = 10 # very unlikely to be exceeded tolerance = float(sdev_tolerance) / numpy.sqrt(samples_per_item) print "\nTolerance:\n", tolerance error = numpy.abs(weights - bins) print "\nError:\n", error assert (error < tolerance).all() #@test def test_DynamicProbDistribution(): # First test that the initial distribution generates valid samples. weights = [2,5,4, 8,3,6, 6,1,3, 4,7,9] dpd = DynamicProbDistribution(weights) validate_distribution_results(dpd, weights) # Now test that we can change the weights and still sample from the # distribution. print "\nChanging weights..." dpd.set_weight(4, 10) weights[4] = 10 dpd.set_weight(9, 2) weights[9] = 2 dpd.set_weight(5, 4) weights[5] = 4 dpd.set_weight(11, 3) weights[11] = 3 validate_distribution_results(dpd, weights) print "\nTest passed" if __name__ == '__main__': test_DynamicProbDistribution() 
+1
source

Source: https://habr.com/ru/post/906505/


All Articles