Random sampling of a character vector without elements prefixing each other

Consider a character vector, pool , whose elements (with zero padding) are binary numbers with digits max_len .

 max_len <- 4 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) pool ## [1] "0" "1" "00" "10" "01" "11" "000" "100" "010" "110" ## [11] "001" "101" "011" "111" "0000" "1000" "0100" "1100" "0010" "1010" ## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111" 

I would like to try n these elements with the restriction that none of the selected elements are prefixes of any of the other sampling elements (i.e. if we project 1101 , we forbid 1 , 11 and 110 , whereas if we try 1 , we disable items starting with 1 , such as 10 , 11 , 100 , etc.).

Below is my attempt to use while , but of course it is slow when n large (or when it approaches 2^max_len ).

 set.seed(1) n <- 10 chosen <- sample(pool, n) while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) { prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1 pool <- pool[rowSums(Vectorize(grepl, 'pattern')( paste0('^', chosen[!prefixes]), pool)) == 0] chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes))) } chosen ## [1] "0100" "0101" "0001" "0011" "1000" "111" "0000" "0110" "1100" "0111" 

This can be improved slightly by initially removing from the pool those elements whose inclusion means that there are fewer elements left in the pool to take a common sample of size n . For example, when max_len = 4 and n > 9 we can immediately remove 0 and 1 from the pool , since by including either the maximum pattern will be 9 (or 0 , and eight 4-character elements starting with 1 or 1 and eight 4- character elements starting with 0 ).

Based on this logic, we could then omit the elements from the pool before taking the original sample, with something like:

 pool <- pool[ nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)] 

Can anyone think of a better approach? I feel like I'm missing something much simpler.




EDIT

To clarify my intentions, I will describe the pool as a set of branches, where the nodes and ends are nodes ( pool elements). Suppose the yellow node in the following figure (i.e. 010) has been drawn. Now the entire red "branch", consisting of nodes 0, 01 and 010, is removed from the pool. This is what I had in mind by prohibiting the selection of nodes that are “prefixed” nodes are already in our example (as well as those that are already preceded by nodes in our example).

enter image description here

If the node selection was halfway along the branch, for example, 01 in the following figure, then all red nodes (0, 01, 010 and 011) are forbidden, since the prefixes 0 and 01 of the prefixes are both 010 and 011.

enter image description here

I do not want to try either 1 or 0 on each connection (i.e. walking along the branches flipping coins on forks) - it is good to have both in the sample and: (1) parents (or grandiose, parents, etc.) etc.), or children (grandchildren, etc.) are no longer selected from node; and (2) when selecting a node, sufficient nodes will remain to achieve the desired pattern of size n .

In the second figure above, if 010 was the first choice, all nodes in the black nodes are still (currently) valid, assuming n <= 4 . For example, if n==4 and we selected node 1 next (and therefore our samples now included 01 and 1), we would subsequently disable node 00 (due to rule 2 above), but still we could choose 000 and 001, giving us our 4-element sample. If n==5 , on the other hand, node 1 will be canceled at this point.

+43
performance r combinatorics
Jun 10 '15 at 23:42
source share
8 answers

Introduction

This is a numerical change to the string algorithm implemented in another answer. It is faster and does not require creating or sorting a pool.

Algorithm design

We can use integers to represent your binary strings, which greatly simplifies the task of creating a pool and sequentially eliminating values. For example, with max_len==3 we can take the number 1-- (where - represents the indentation) to mean 4 in decimal form. In addition, we can determine that the numbers that need to be eliminated, if we choose this number, are numbers between 4 and 4 + 2 ^ x - 1 . Here x is the number of filling elements (in this case 2), therefore the numbers to be excluded are between 4 and 4 + 2 ^ 2 - 1 (or between 4 and 7 , expressed as 100 , 110 and 111 ).

To fit your problem exactly, we need a little wrinkle, as you process numbers that are potentially the same in binary format as separate for some parts of your algorithm. For example, 100 , 10- and 1-- are all the same number, but in your scheme you need to handle it differently. In the world max_len==3 , we have 8 possible numbers, but 14 representations are possible:

 0 - 000: 0--, 00- 1 - 001: 2 - 010: 01- 3 - 011: 4 - 100: 1--, 10- 5 - 101: 6 - 110: 11- 7 - 111: 

So, 0 and 4 have three possible encodings, 2 and 6 have two, and the rest are only one. We need to create an integer pool that represents a higher probability of choice for numbers with multiple representations, as well as a mechanism to track the number of spaces that a number includes. We can do this by adding a few bits at the end of the number to indicate the scales we want. So, our numbers become (we use two bits here):

 jbaum | int | bin | bin.enc | int.enc 0-- | 0 | 000 | 00000 | 0 00- | 0 | 000 | 00001 | 1 000 | 0 | 000 | 00010 | 2 001 | 1 | 001 | 00100 | 3 01- | 2 | 010 | 01000 | 4 010 | 2 | 010 | 01001 | 5 011 | 3 | 011 | 01101 | 6 1-- | 4 | 100 | 10000 | 7 10- | 4 | 100 | 10001 | 8 100 | 4 | 100 | 10010 | 9 101 | 5 | 101 | 10100 | 10 11- | 6 | 110 | 11000 | 11 110 | 6 | 110 | 11001 | 12 111 | 7 | 111 | 11100 | 13 

Some useful features:

  • enc.bits represents the number of bits we need to encode (two in this case)
  • int.enc %% enc.bits indicates how many cells are explicitly specified
  • int.enc %/% enc.bits returns int
  • int * 2 ^ enc.bits + explicitly.specified returns int.enc

Please note that explicitly.specified here is between 0 and max_len - 1 in our implementation, since at least one digit is always specified. Now we have an encoding that fully reflects your data structure using only integers. We can choose from integers and reproduce the desired results with the correct weights, etc. One limitation of this approach is that we use 32-bit integers in R, and we need to reserve some bits for encoding, so we are limited to max_len==25 or so pools. You could go more if you used double precision integer numbers, but we didn’t.

Avoid duplicate messages

There are two rough ways to ensure that we do not select the same value twice.

  • Keep track of which values ​​remain available for selection, and randomly choose from them
  • Sample randomly from all possible values, and then check if this value was previously selected, and if it has, again the sample

While the first option seems the cleanest, it is actually a very expensive computing one. This requires either a vector scan of all possible values ​​for each choice in order to pre-disqualify the selected values, or create a reduction vector containing non-disqualified values. The shrink option is more efficient than vector scanning if the vector is compressed by reference through C code, but even then it requires repeated translations of potentially large parts of the vector, and this requires C.

Here we use method # 2. This allows us to randomly shuffle all possible values ​​once, and then select each value sequentially, check that it has not been disqualified, and if it is, select another, etc. This works because it is trivial to check whether the value was chosen as a result of our encoding of values; we can determine the location of a value in a sorted table based on only the value . Thus, we record the status of each value in a sorted table and can either update or search for this state through direct access to the index (no verification is required).

Examples

The implementation of this algorithm in the R database is available as an entity . This particular implementation only pulls complete draws. Here is an example of 10 drawings of 8 elements from the pool max_len==4 :

 # each column represents a draw from a `max_len==4` pool set.seed(6); replicate(10, sample0110b(4, 8)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "1000" "1" "0011" "0010" "100" "0011" "0" "011" "0100" "1011" [2,] "111" "0000" "1101" "0000" "0110" "0100" "1000" "00" "0101" "1001" [3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10" "1100" [4,] "0100" "0010" "0000" "0101" "1101" "101" "1011" "1101" "0110" "1101" [5,] "101" "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111" [6,] "110" "0111" "1011" "111" "1011" "110" "1111" "0100" "0011" "000" [7,] "0101" "0101" "111" "011" "1010" "1000" "1100" "101" "0001" "0101" [8,] "011" "0001" "01" "1010" "0011" "1110" "1110" "1001" "110" "1000" 

We also initially had two implementations that relied on method # 1 to avoid duplication, one in the R base and one in C, but even version C is not as fast as the new base version of R when n big. This function implements the ability to draw incomplete drawings, so we give them here for reference:

Comparative tests

Here is a set of tests comparing several functions that are displayed in this Q / A. Time in milliseconds. The brodie.b version is the version described in this answer. brodie is the original implementation, brodie.C is the original with some C. All this ensures compliance with the requirements for complete samples. brodie.str is the string version in another answer.

  size n jbaum josilber frank tensibai brodie.b brodie brodie.C brodie.str 1 4 10 11 1 3 1 1 1 1 0 2 4 50 - - - 1 - - - 1 3 4 100 - - - 1 - - - 0 4 4 256 - - - 1 - - - 1 5 4 1000 - - - 1 - - - 1 6 8 10 1 290 6 3 2 2 1 1 7 8 50 388 - 8 8 3 4 3 4 8 8 100 2,506 - 13 18 6 7 5 5 9 8 256 - - 22 27 13 14 12 6 10 8 1000 - - - 27 - - - 7 11 16 10 - - 615 688 31 61 19 424 12 16 50 - - 2,123 2,497 28 276 19 1,764 13 16 100 - - 4,202 4,807 30 451 23 3,166 14 16 256 - - 11,822 11,942 40 1,077 43 8,717 15 16 1000 - - 38,132 44,591 83 3,345 130 27,768 

It scales pretty well for large pools

 system.time(sample0110b(18, 100000)) user system elapsed 8.441 0.079 8.527 

Notes:

  • frank and brodie (minus brodie.str) do not require pooling, which will affect the comparison (see below).
  • Josilber is the LP version
  • jbaum - OP example
  • tensibai is slightly modified to exit instead of failing if the pool is empty.
  • not configured to run python so cannot compare / account to buffer
  • - are either impracticable options, or too slow in time reasonably.

Timing does not include drawing pools ( 0.8 , 2.5 , 401 milliseconds respectively for sizes 4 , 8 and 16 )), which is necessary for jbaum , josilber and brodie.str starts or sorts them ( 0.1 , 2.7 , 3700 milliseconds for sizes 4 , 8 and 16 ), which is necessary for brodie.str in addition to the draw. Whether you want to enable these or not depends on how many times you run the function for a particular pool. In addition, there are almost certainly better ways to generate / sort pools.

This is the average time of three runs with a microbenchmark . The code is available as an entity , but note that you will need to download sample0110b , sample0110 , sample01101 and sample01 .

+19
Jun 11 '15 at 12:40
source share

I found the problem interesting, so I tried and got this with very low skills in R (so this can be improved):

Newly edited version thanks to @Franck :

 library(microbenchmark) library(lineprof) max_len <- 16 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) n<-100 library(stringr) tree_sample <- function(samples,pool) { results <- vector("integer",samples) # Will be used on a regular basis, compute it in advance PoolLen <- str_length(pool) # Make a mask vector based on the length of each entry of the pool masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2) # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8 # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it a parent. integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2) # Create a vector to filter the available value to sample ok <- rep(TRUE,length(pool)) #Precompute the result of the bitwise and betwwen our integer pool and the masks MaskedPool <- bitwAnd(integerPool,masks) while(samples) { samp <- sample(pool[ok],1) # Get a sample results[samples] <- samp # Store it as result ok[pool == samp] <- FALSE # Remove it from available entries vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample mlen <- str_length(samp) # Get sample len #Creation of unitary mask to remove childs of sample mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2) # Get the result of bitwise And between the integerPool and the sample mask FilterVec <- bitwAnd(integerPool,mask) # Get the bitwise and result of the sample and it mask Childm <- bitwAnd(vsamp,mask) ok[FilterVec == Childm] <- FALSE # Remove from available entries the childs of the sample ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching samples <- samples -1 } print(results) } microbenchmark(tree_sample(n,pool),times=10L) 

The main idea is to use a bitmask comparison to find out if one pattern is the parent of another (the common part of the bits), if so, suppress this element from the pool.

Now it takes 1.4 s to draw 100 samples from a pool of length 16 on my machine.

+15
Jun 17 '15 at 8:44
source share

You can sort the pool to decide which items to disqualify. For example, looking at a sorted pool of three elements:

  [1] "0" "00" "000" "001" "01" "010" "011" "1" "10" "100" "101" "11" [13] "110" "111" 

I can say that I can disqualify everything that follows my selected item, which has more characters than my item, to the first item that has the same number or fewer characters. For example, if I select "01", I immediately see that the next two elements ("010", "011") need to be deleted, but not after that, because "1" has less characters. Removing "0" after that is easy. Here is the implementation:

 library(fastmatch) # could use `match`, but we repeatedly search against same hash # `pool` must be sorted! sample01 <- function(pool, n) { picked <- logical(length(pool)) chrs <- nchar(pool) pick.list <- character(n) pool.seq <- seq_along(pool) for(i in seq(n)) { # Make sure pool not exhausted left <- which(!picked) left.len <- length(left) if(!length(left)) break # Sample from pool seq.left <- seq.int(left) pool.left <- pool[left] chrs.left <- chrs[left] pick <- sample(length(pool.left), 1L) # Find all the elements with more characters that are disqualified # and store their indices in `valid` (bad name...) valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick first.invalid <- which(!valid.tmp & seq.left > pick) valid <- if(length(first.invalid)) { pick:(first.invalid[[1L]] - 1L) } else pick:left.len # Translate back to original pool indices since we're working on a # subset in `pool.left` pool.seq.left <- pool.seq[left] pool.idx <- pool.seq.left[valid] val <- pool[[pool.idx[[1L]]]] # Record the picked value, and all the disqualifications pick.list[[i]] <- val picked[pool.idx] <- TRUE # Disqualify shorter matches to.rem <- vapply( seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L ) to.rem.idx <- fmatch(to.rem, pool, nomatch=0) picked[to.rem.idx] <- TRUE } pick.list } 

And the function to create sorted pools (exactly the same as your code, but returns sorting):

 make_pool <- function(size) sort( unlist( lapply( seq_len(size), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) ) ) ) 

Then, using the max_len 3 pool (useful for visual inspection of things, behaves as expected):

 pool3 <- make_pool(3) set.seed(1) sample01(pool3, 8) # [1] "001" "1" "010" "011" "000" "" "" "" sample01(pool3, 8) # [1] "110" "111" "011" "10" "00" "" "" "" sample01(pool3, 8) # [1] "000" "01" "11" "10" "001" "" "" "" sample01(pool3, 8) # [1] "011" "101" "111" "001" "110" "100" "000" "010" 

Please note that in the latter case we get all three-digit binary combinations (2 ^ 3), because by chance we saved a selection of three-digit ones. In addition, with just a 3-dimensional pool, there are many samples that prevent a total of 8 draws; you can refer to this with your suggestion to eliminate combinations that prevent a complete draw from the pool.

This is pretty fast. Considering the example max_len==9 , which took 2 seconds with an alternative solution:

 pool9 <- make_pool(9) microbenchmark(sample01(pool9, 4)) # Unit: microseconds # expr min lq median uq max neval # sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663 100 

About half a millisecond. You can reasonably try quite large pools:

 pool16 <- make_pool(16) # 131K entries system.time(sample01(pool16, 100)) # user system elapsed # 3.407 0.146 3.552 

This is not very fast, but we are talking about a pool with 130K elements. There is also the possibility of additional optimization.

Note that the sorting step becomes relatively slow for large pools, but I do not consider this because you need to ever do this once, and you could probably develop a reasonable algorithm for creating a pre-sorted pool.

There is also the possibility of a faster integer approach to the binary base, which I studied in the now deleted answer, but it requires more fair work to tie it to what you are looking for.

+13
Jun 11 '15 at 2:39 on
source share

Display identifiers in rows. . You can match numbers with your 0/1 vectors, as @BrodieG mentioned:

 # some key objects n_pool = sum(2^(1:max_len)) # total number of indices cuts = cumsum(2^(1:max_len-1)) # new group starts inds_by_g = mapply(seq,cuts,cuts*2) # indices grouped by length # the mapping to strings (one among many possibilities) library(data.table) get_01str <- function(id,max_len){ cuts = cumsum(2^(1:max_len-1)) g = findInterval(id,cuts) gid = id-cuts[g]+1 data.table(g,gid)[,s:= do.call(paste,c(list(sep=""),lapply( seq(g[1]), function(x) (gid-1) %/% 2^(x-1) %% 2 ))) ,by=g]$s } 

Search for identifiers to delete. We sequentially drop id from the fetch pool:

  # the mapping from one index to indices of nixed strings get_nixstrs <- function(g,gid,max_len){ cuts = cumsum(2^(1:max_len-1)) gids_child = { x = gid%%2^sequence(g-1) ifelse(x,x,2^sequence(g-1)) } ids_child = gids_child+cuts[sequence(g-1)]-1 ids_parent = if (g==max_len) gid+cuts[g]-1 else { gids_par = vector(mode="list",max_len) gids_par[[g]] = gid for (gg in seq(g,max_len-1)) gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg) unlist(mapply(`+`,gids_par,cuts-1)) } c(ids_child,ids_parent) } 

Indexes are grouped by g , number of characters, nchar(get_01str(id)) . Since indexes are sorted by g , g=findInterval(id,cuts) is a faster route.

The index in the group g , 1 < g < max_len has one "child" index of size g-1 and two parent indexes of size g+1 . For each child node, we take its child node until we press g==1 ; and for each node parent we take their pair of parent nodes until we press g==max_len .

The tree structure is the simplest in terms of identifier within the group, gid . gid displays two parents, gid and gid+2^g ; and reversing this mapping, you will find it.

Sampling

 drawem <- function(n,max_len){ cuts = cumsum(2^(1:max_len-1)) inds_by_g = mapply(seq,cuts,cuts*2) oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ] okinds = unlist(inds_by_g[oklens]) mysamp = rep(0,n) for (i in 1:n){ id = if (length(okinds)==1) okinds else sample(okinds,1) g = findInterval(id,cuts) gid = id-cuts[g]+1 nixed = get_nixstrs(g,gid,max_len) # print(id); print(okinds); print(nixed) mysamp[i] = id okinds = setdiff(okinds,nixed) if (!length(okinds)) break } res <- rep("",n) res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len) res } 

The oklens part combines the idea of ​​OP to exclude rows that ensure that fetching is not possible. However, even doing this, we can follow the sampling path, which leaves us without additional parameters. Taking the example of OP max_len=4 and n=10 , we know that we should refuse to consider 0 and 1 , but what happens if our first four draws are 00 , 01 , 11 and 10 ? Well, I think we're out of luck. That is why you must determine the probabilities of the sample. (The OP has another idea for determining at each step which nodes will lead to an impossible state, but this seems to be a high order.)

Illustration

 # how the indices line up n_pool = sum(2^(1:max_len)) pdt <- data.table(id=1:n_pool) pdt[,g:=findInterval(id,cuts)] pdt[,gid:=1:.N,by=g] pdt[,s:=get_01str(id,max_len)] # example run set.seed(4); drawem(5,5) # [1] "01100" "1" "0001" "0101" "00101" set.seed(4); drawem(8,4) # [1] "1100" "0" "111" "101" "1101" "100" "" "" 

Tests (older than @BrodieG's answer)

 require(rbenchmark) max_len = 8 n = 8 benchmark( jos_lp = { pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) sample.lp(pool, n)}, bro_string = {pool <- make_pool(max_len);sample01(pool,n)}, fra_num = drawem(n,max_len), replications=5)[1:5] # test replications elapsed relative user.self # 2 bro_string 5 0.05 2.5 0.05 # 3 fra_num 5 0.02 1.0 0.02 # 1 jos_lp 5 1.56 78.0 1.55 n = 12 max_len = 12 benchmark( bro_string={pool <- make_pool(max_len);sample01(pool,n)}, fra_num=drawem(n,max_len), replications=5)[1:5] # test replications elapsed relative user.self # 1 bro_string 5 0.54 6.75 0.51 # 2 fra_num 5 0.08 1.00 0.08 

. :

 jos_enum = {pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) get.template(pool, n)} bro_num = sample011(max_len,n) 

@josilber, ; @BrodieG / , , . . @BrodieG .

. @josilber ( , -, ), n . @BrodieG n . max_len , , .

, bro_string , pool .

+13
11 . '15 16:32
source share

, r, jbaums , .

, , . .
c t S , combs . , - ? .

100 16 8 . , , , combBuffer .

 import random class Tree(object): """ :param level: The distance of this node from the root. :type level: int :param parent: This trees parent node :type parent: Tree :param isleft: Determines if this is a left or a right child node. Can be omitted if this is the root node. :type isleft: bool A binary tree representing possible strings which match r'[01]{1,n}'. Its purpose is to be able to sample n of its nodes where none of the sampled nodes' ids is a prefix for another one. It is possible to change Tree.maxdepth and then reuse the root. All children are created ON DEMAND, which means everything is lazily evaluated. If the Tree gets too big anyway, you can call 'prune' on any node to delete its children. >>> t = Tree() >>> t.sample(8, toString=True, depth=3) ['111', '110', '101', '100', '011', '010', '001', '000'] >>> Tree.maxdepth = 2 >>> t.sample(4, toString=True) ['11', '10', '01', '00'] """ maxdepth = 10 _combBuffer = {} def __init__(self, level=0, parent=None, isleft=None): self.parent = parent self.level = level self.isleft = isleft self._left = None self._right = None @classmethod def setMaxdepth(cls, depth): """ :param depth: The new depth :type depth: int Sets the maxdepth of the Trees. This basically is the depth of the root node. """ if cls.maxdepth == depth: return cls.maxdepth = depth @property def left(self): """This tree left child, 'None' if this is a leave node""" if self.depth == 0: return None if self._left is None: self._left = Tree(self.level+1, self, True) return self._left @property def right(self): """This tree right child, 'None' if this is a leave node""" if self.depth == 0: return None if self._right is None: self._right = Tree(self.level+1, self, False) return self._right @property def depth(self): """ This tree depth. (maxdepth-level) """ return self.maxdepth-self.level @property def id(self): """ This tree id, string of '0 and '1 equal to the path from the root to this subtree. Where '1' means going left and '0' means going right. """ # level 0 is the root node, it has no id if self.level == 0: return '' # This takes at most Tree.maxdepth recursions. Therefore # it is save to do it this way. We could also save each nodes # id once it is created to avoid recreating it every time, however # this won't save much time but use quite some space. return self.parent.id + ('1' if self.isleft else '0') @property def leaves(self): """ The amount of leave nodes, this tree has. (2**depth) """ return 2**self.depth def __str__(self): return self.id def __len__(self): return 2*self.leaves-1 def prune(self): """ Recursively prune this tree children. """ if self._left is not None: self._left.prune() self._left.parent = None self._left = None if self._right is not None: self._right.prune() self._right.parent = None self._right = None def combs(self, n): """ :param n: The amount of samples to be taken from this tree :type n: int :returns: The amount of possible combinations to choose n samples from this tree Determines recursively the amount of combinations of n nodes to be sampled from this tree. Subsequent calls with same n on trees with same depth will return the result from the previous computation rather than computing it again. >>> t = Tree() >>> Tree.maxdepth = 4 >>> t.combs(16) 1 >>> Tree.maxdepth = 3 >>> t.combs(6) 58 """ # important for the amount of combinations is only n and the depth of # this tree key = (self.depth, n) # We use the dict to save computation time. Calling the function with # equal values on equal nodes just returns the alrady computed value if # possible. if key not in Tree._combBuffer: leaves = self.leaves if n < 0: N = 0 elif n == 0 or self.depth == 0 or n == leaves: N = 1 elif n == 1: return (2*leaves-1) else: if n > leaves/2: # if n > leaves/2, at least n-leaves/2 have to stay on # either side, otherweise the other one would have to # sample more nodes than possible. nMin = n-leaves/2 else: nMin = 0 # The rest n-2*nMin is the amount of samples that are free to # fall on either side free = n-2*nMin N = 0 # sum up the combinations of all possible splits for addLeft in range(0, free+1): nLeft = nMin + addLeft nRight = n - nLeft N += self.left.combs(nLeft)*self.right.combs(nRight) Tree._combBuffer[key] = N return N return Tree._combBuffer[key] def sample(self, n, toString=False, depth=None): """ :param n: How may samples to take from this tree :type n: int :param toString: If 'True' result will direclty be turned into a list of strings :type toString: bool :param depth: If not None, will overwrite Tree.maxdepth :type depth: int or None :returns: List of n nodes sampled from this tree :throws ValueError: when n is invalid Takes n random samples from this tree where none of the sample ids is a prefix for another one's. For an example see Tree docstring. """ if depth is not None: Tree.setMaxdepth(depth) if toString: return [str(e) for e in self.sample(n)] if n < 0: raise ValueError('Negative sample size is not possible!') if n == 0: return [] leaves = self.leaves if n > leaves: raise ValueError(('Cannot sample {} nodes, with only {} ' + 'leaves!').format(n, leaves)) # Only one sample to choose, that is nice! We are free to take any node # from this tree, including this very node itself. if n == 1 and self.level > 0: # This tree has 2*leaves-1 nodes, therefore # the probability that we keep the root node has to be # 1/(2*leaves-1) = P_root. Lets create a random number from the # interval [0, 2*leaves-1). # It will be 0 with probability 1/(2*leaves-1) P_root = random.randint(0, len(self)-1) if P_root == 0: return [self] else: # The probability to land here is 1-P_root # A child tree size is (leaves-1) and since it obeys the same # rule as above, the probability for each of its nodes to # 'survive' is 1/(leaves-1) = P_child. # However all nodes must have equal probability, therefore to # make sure that their probability is also P_root we multiply # them by 1/2*(1-P_root). The latter is already done, the # former will be achieved by the next condition. # If we do everything right, this should hold: # 1/2 * (1-P_root) * P_child = P_root # Lets see... # 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1) # (1-1/(2*leaves-1)) * (1/(2*(leaves-1))) # (1-1/(2*leaves-1)) * (1/(2*leaves-2)) # (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1)) # (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1)) # (2*leaves-2)/((2*leaves-2) * (2*leaves-1)) # 1/(2*leaves-1) # There we go! if random.random() < 0.5: return self.right.sample(1) else: return self.left.sample(1) # Now comes the tricky part... n > 1 therefore we are NOT going to # sample this node. Its probability to be chosen is 0! # It HAS to be 0 since we are definitely sampling from one of its # children which means that this node will be blocked by those samples. # The difficult part now is to prove that the sampling the way we do it # is really random. if n > leaves/2: # if n > leaves/2, at least n-leaves/2 have to stay on either # side, otherweise the other one would have to sample more # nodes than possible. nMin = n-leaves/2 else: nMin = 0 # The rest n-2*nMin is the amount of samples that are free to fall # on either side free = n-2*nMin # Let have a look at an example, suppose we were to distribute 5 # samples among two children which have 4 leaves each. # Each child has to get at least 1 sample, so the free samples are 3. # There are 4 different ways to split the samples among the # children (left, right): # (1, 4), (2, 3), (3, 2), (4, 1) # The amount of unique sample combinations per child are # (7, 1), (11, 6), (6, 11), (1, 7) # The amount of total unique samples per possible split are # 7 , 66 , 66 , 7 # In case of the first and last split, all samples have a probability # of 1/7, this was already proven above. # Lets suppose we are good to go and the per sample probabilities for # the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the # overall per sample probabilities for the splits would be: # 1/7 , 1/66 , 1/66 , 1/7 # If we used uniform random to determine the split, all splits would be # equally probable and therefore be multiplied with the same value (1/4) # But this would mean that NOT every sample is equally probable! # We need to know in advance how many sample combinations there will be # for a given split in order to find out the probability to choose it. # In fact, due to the restrictions, this becomes very nasty to # determine. So instead of solving it analytically, I do it numerically # with the method 'combs'. It gives me the amount of possible sample # combinations for a certain amount of samples and a given tree depth. # It will return 146 for this node and 7 for the outer and 66 for the # inner splits. # What we now do is, we take a number from [0, 146). # if it is smaller than 7, we sample from the first split, # if it is smaller than 7+66, we sample from the second split, # ... # This way we get the probabilities we need. r = random.randint(0, self.combs(n)-1) p = 0 for addLeft in xrange(0, free+1): nLeft = nMin + addLeft nRight = n - nLeft p += (self.left.combs(nLeft) * self.right.combs(nRight)) if r < p: return self.left.sample(nLeft) + self.right.sample(nRight) assert False, ('Something really strange happend, p did not sum up ' + 'to combs or r was too big') def main(): """ Do a microbenchmark. """ import timeit i = 1 main.t = Tree() template = ' {:>2} {:>5} {:>4} {:<5}' print(template.format('i', 'depth', 'n', 'time (ms)')) N = 100 for depth in [4, 8, 15, 16, 17, 18]: for n in [10, 50, 100, 150]: if n > 2**depth: time = '--' else: time = timeit.timeit( 'main.t.sample({}, depth={})'.format(n, depth), setup= 'from __main__ import main', number=N)*1000./N print(template.format(i, depth, n, time)) i += 1 if __name__ == "__main__": main() 

:

  i depth n time (ms) 1 4 10 0.182511806488 2 4 50 -- 3 4 100 -- 4 4 150 -- 5 8 10 0.397620201111 6 8 50 1.66054964066 7 8 100 2.90236949921 8 8 150 3.48146915436 9 15 10 0.804011821747 10 15 50 3.7428188324 11 15 100 7.34910964966 12 15 150 10.8230614662 13 16 10 0.804491043091 14 16 50 3.66818904877 15 16 100 7.09567070007 16 16 150 10.404779911 17 17 10 0.865840911865 18 17 50 3.9999294281 19 17 100 7.70257949829 20 17 150 11.3758206367 21 18 10 0.915451049805 22 18 50 4.22935962677 23 18 100 8.22361946106 24 18 150 12.2081303596 

10 10 10:

 ['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010'] ['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111'] ['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101'] ['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110'] ['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110'] ['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101'] ['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111'] ['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111'] ['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100'] ['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000'] 
+13
14 . '15 15:06
source share

, (, , ), , . , pool , . , , . , .

lpSolve :

 library(lpSolve) sample.lp <- function(pool, max_len) { pool <- sort(pool) pml <- max(nchar(pool)) runs <- c(rev(cumsum(2^(seq(pml-1)))), 0) banned.from <- rep(seq(pool), runs[nchar(pool)]) banned.to <- banned.from + unlist(lapply(runs[nchar(pool)], seq_len)) banned.constr <- matrix(0, nrow=length(banned.from), ncol=length(pool)) banned.constr[cbind(seq(banned.from), banned.from)] <- 1 banned.constr[cbind(seq(banned.to), banned.to)] <- 1 mod <- lp(direction="max", objective.in=runif(length(pool)), const.mat=rbind(banned.constr, rep(1, length(pool))), const.dir=c(rep("<=", length(banned.from)), "=="), const.rhs=c(rep(1, length(banned.from)), max_len), all.bin=TRUE) pool[which(mod$solution == 1)] } set.seed(144) pool <- unlist(lapply(seq_len(4), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) sample.lp(pool, 4) # [1] "0011" "010" "1000" "1100" sample.lp(pool, 8) # [1] "0000" "0100" "0110" "1001" "1010" "1100" "1101" "1110" 

. , 20 510 2 :

 pool <- unlist(lapply(seq_len(8), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) length(pool) # [1] 510 system.time(sample.lp(pool, 20)) # user system elapsed # 0.232 0.008 0.239 

, - , lpSolve , gurobi cplex ( , ).

+11
11 . '15 2:00
source share

, :

  • 1 ( pool )
  • pool
  • pool
  • ,
  • ,

( pool 30, max_len 4):

 get.template <- function(pool, max_len) { banned <- which(outer(paste0('^', pool), pool, Vectorize(grepl)), arr.ind=T) banned <- banned[banned[,1] != banned[,2],] banned <- paste(banned[,1], banned[,2]) vals <- matrix(seq(length(pool))) for (k in 2:max_len) { vals <- cbind(vals[rep(1:nrow(vals), each=length(pool)),], rep(1:length(pool), nrow(vals))) # Can't sample same value more than once vals <- vals[apply(vals, 1, function(x) length(unique(x)) == length(x)),] # Sort rows to ensure unique only vals <- t(apply(vals, 1, sort)) vals <- unique(vals) # Can't have banned pair combos <- combn(ncol(vals), 2) for (k in seq(ncol(combos))) { c1 <- combos[1,k] c2 <- combos[2,k] vals <- vals[!paste(vals[,c1], vals[,c2]) %in% banned,] } } return(matrix(pool[vals], nrow=nrow(vals))) } max_len <- 4 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) system.time(template <- get.template(pool, 4)) # user system elapsed # 4.549 0.050 4.614 

, , template ( ), , .

+9
11 . '15 1:40
source share

Introduction

, . , , , , , ++ ( R).

, , , , : ( 1) .

, . , , .

1

. node , , . , .

, , max_len-1 .

2

. , :

 Let x be the array index. x = 0 is the root of the entire tree left_child(x) = 2x + 1 right_child(x) = 2x + 2 parent(x) = floor((n-1)/2) 

node : breadth-first

( ), ( ) , ( , ). , , . .

1, , , boolean: true , false . node , 0 false. , - :

, , , , , . , , , true ( ). - . , n , , .

, , , , : O (2 ^ n). n , , - .

3

: .

. , , , . , , , .

, , , , . , , . , . , .

depth-first

, , , : .

4

, / , , . node . , .

- , OP. , - : , >= , - . , , , "0" ; "1" .

 #include <stdint.h> #include <algorithm> #include <cmath> #include <list> #include <deque> #include <ctime> #include <cstdlib> #include <iostream> /* * A range of values of the form (a, b), where a <= b, and is inclusive. * Ex (1,1) is the range from 1 to 1 (ie: just 1) */ class Range { private: friend bool operator< (const Range& lhs, const Range& rhs); friend std::ostream& operator<<(std::ostream& os, const Range& obj); int64_t m_start; int64_t m_end; public: Range(int64_t start, int64_t end) : m_start(start), m_end(end) {} int64_t getStart() const { return m_start; } int64_t getEnd() const { return m_end; } int64_t size() const { return m_end - m_start + 1; } bool canMerge(const Range& other) const { return !((other.m_start > m_end + 1) || (m_start > other.m_end + 1)); } int64_t merge(const Range& other) { int64_t change = 0; if (m_start > other.m_start) { change += m_start - other.m_start; m_start = other.m_start; } if (other.m_end > m_end) { change += other.m_end - m_end; m_end = other.m_end; } return change; } }; inline bool operator< (const Range& lhs, const Range& rhs){return lhs.m_start < rhs.m_start;} std::ostream& operator<<(std::ostream& os, const Range& obj) { os << '(' << obj.m_start << ',' << obj.m_end << ')'; return os; } /* * Stuct to allow returning of multiple values */ struct NodeInfo { int64_t subTreeSize; int64_t depth; std::list<int64_t> ancestors; std::string representation; }; /* * Collection of functions representing a complete binary tree * as an array created using pre-order depth-first search, * with 0 as the root. * Depth of the root is defined as 0. */ class Tree { private: int64_t m_depth; public: Tree(int64_t depth) : m_depth(depth) {} int64_t size() const { return (int64_t(1) << (m_depth+1))-1; } int64_t getDepthOf(int64_t node) const{ if (node == 0) { return 0; } int64_t searchDepth = m_depth; int64_t currentDepth = 1; while (true) { int64_t rightChild = int64_t(1) << searchDepth; if (node == 1 || node == rightChild) { break; } else if (node > rightChild) { node -= rightChild; } else { node -= 1; } currentDepth += 1; searchDepth -= 1; } return currentDepth; } int64_t getSubtreeSizeOf(int64_t node, int64_t nodeDepth = -1) const { if (node == 0) { return size(); } if (nodeDepth == -1) { nodeDepth = getDepthOf(node); } return (int64_t(1) << (m_depth + 1 - nodeDepth)) - 1; } int64_t getLeftChildOf(int64_t node, int64_t nodeDepth = -1) const { if (nodeDepth == -1) { nodeDepth = getDepthOf(node); } if (nodeDepth == m_depth) { return -1; } return node + 1; } int64_t getRightChildOf(int64_t node, int64_t nodeDepth = -1) const { if (nodeDepth == -1) { nodeDepth = getDepthOf(node); } if (nodeDepth == m_depth) { return -1; } return node + 1 + ((getSubtreeSizeOf(node, nodeDepth) - 1) / 2); } NodeInfo getNodeInfo(int64_t node) const { NodeInfo info; int64_t depth = 0; int64_t currentNode = 0; while (currentNode != node) { if (currentNode != 0) { info.ancestors.push_back(currentNode); } int64_t rightChild = getRightChildOf(currentNode, depth); if (rightChild == -1) { break; } else if (node >= rightChild) { info.representation += '1'; currentNode = rightChild; } else { info.representation += '0'; currentNode = getLeftChildOf(currentNode, depth); } depth++; } info.depth = depth; info.subTreeSize = getSubtreeSizeOf(node, depth); return info; } }; // random selection amongst remaining allowed nodes int64_t selectNode(const std::deque<Range>& eliminationList, int64_t poolSize, std::mt19937_64& randomGenerator) { std::uniform_int_distribution<> randomDistribution(1, poolSize); int64_t selection = randomDistribution(randomGenerator); for (auto const& range : eliminationList) { if (selection >= range.getStart()) { selection += range.size(); } else { break; } } return selection; } // determin how many nodes have been elimintated int64_t countEliminated(const std::deque<Range>& eliminationList) { int64_t count = 0; for (auto const& range : eliminationList) { count += range.size(); } return count; } // merge all the elimination ranges to listA, and return the number of new elimintations int64_t mergeEliminations(std::deque<Range>& listA, std::deque<Range>& listB) { if(listB.empty()) { return 0; } if(listA.empty()) { listA.swap(listB); return countEliminated(listA); } int64_t newEliminations = 0; int64_t x = 0; auto listA_iter = listA.begin(); auto listB_iter = listB.begin(); while (listB_iter != listB.end()) { if (listA_iter == listA.end()) { listA_iter = listA.insert(listA_iter, *listB_iter); x = listB_iter->size(); assert(x >= 0); newEliminations += x; ++listB_iter; } else if (listA_iter->canMerge(*listB_iter)) { x = listA_iter->merge(*listB_iter); assert(x >= 0); newEliminations += x; ++listB_iter; } else if (*listB_iter < *listA_iter) { listA_iter = listA.insert(listA_iter, *listB_iter) + 1; x = listB_iter->size(); assert(x >= 0); newEliminations += x; ++listB_iter; } else if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) { listA_iter->merge(*(listA_iter+1)); listA_iter = listA.erase(listA_iter+1); } else { ++listA_iter; } } while (listA_iter != listA.end()) { if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) { listA_iter->merge(*(listA_iter+1)); listA_iter = listA.erase(listA_iter+1); } else { ++listA_iter; } } return newEliminations; } int main (int argc, char** argv) { std::random_device rd; std::mt19937_64 randomGenerator(rd()); int64_t max_len = std::stoll(argv[1]); int64_t num_samples = std::stoll(argv[2]); int64_t samplesRemaining = num_samples; Tree tree(max_len); int64_t poolSize = tree.size() - 1; std::deque<Range> eliminationList; std::deque<Range> eliminated; std::list<std::string> foundList; while (samplesRemaining > 0 && poolSize > 0) { // find a valid node int64_t selectedNode = selectNode(eliminationList, poolSize, randomGenerator); NodeInfo info = tree.getNodeInfo(selectedNode); foundList.push_back(info.representation); samplesRemaining--; // determine which nodes this choice eliminates eliminated.clear(); for( auto const& ancestor : info.ancestors) { Range r(ancestor, ancestor); if(eliminated.empty() || !eliminated.back().canMerge(r)) { eliminated.push_back(r); } else { eliminated.back().merge(r); } } Range r(selectedNode, selectedNode + info.subTreeSize - 1); if(eliminated.empty() || !eliminated.back().canMerge(r)) { eliminated.push_back(r); } else { eliminated.back().merge(r); } // add the eliminated nodes to the existing list poolSize -= mergeEliminations(eliminationList, eliminated); } // Print some stats // std::cout << "tree: " << tree.size() << " samplesRemaining: " // << samplesRemaining << " poolSize: " // << poolSize << " samples: " << foundList.size() // << " eliminated: " // << countEliminated(eliminationList) << std::endl; // Print list of binary strings // std::cout << "list:"; // for (auto const& s : foundList) { // std::cout << " " << s; // } // std::cout << std::endl; } 

max_len. n , , , , , .

, , , "0" "1". - , .

+9
22 . '15 16:01
source share



All Articles