Introduction
This is a numerical change to the string algorithm implemented in another answer. It is faster and does not require creating or sorting a pool.
Algorithm design
We can use integers to represent your binary strings, which greatly simplifies the task of creating a pool and sequentially eliminating values. For example, with max_len==3 we can take the number 1-- (where - represents the indentation) to mean 4 in decimal form. In addition, we can determine that the numbers that need to be eliminated, if we choose this number, are numbers between 4 and 4 + 2 ^ x - 1 . Here x is the number of filling elements (in this case 2), therefore the numbers to be excluded are between 4 and 4 + 2 ^ 2 - 1 (or between 4 and 7 , expressed as 100 , 110 and 111 ).
To fit your problem exactly, we need a little wrinkle, as you process numbers that are potentially the same in binary format as separate for some parts of your algorithm. For example, 100 , 10- and 1-- are all the same number, but in your scheme you need to handle it differently. In the world max_len==3 , we have 8 possible numbers, but 14 representations are possible:
0 - 000: 0--, 00- 1 - 001: 2 - 010: 01- 3 - 011: 4 - 100: 1--, 10- 5 - 101: 6 - 110: 11- 7 - 111:
So, 0 and 4 have three possible encodings, 2 and 6 have two, and the rest are only one. We need to create an integer pool that represents a higher probability of choice for numbers with multiple representations, as well as a mechanism to track the number of spaces that a number includes. We can do this by adding a few bits at the end of the number to indicate the scales we want. So, our numbers become (we use two bits here):
jbaum | int | bin | bin.enc | int.enc 0-- | 0 | 000 | 00000 | 0 00- | 0 | 000 | 00001 | 1 000 | 0 | 000 | 00010 | 2 001 | 1 | 001 | 00100 | 3 01- | 2 | 010 | 01000 | 4 010 | 2 | 010 | 01001 | 5 011 | 3 | 011 | 01101 | 6 1-- | 4 | 100 | 10000 | 7 10- | 4 | 100 | 10001 | 8 100 | 4 | 100 | 10010 | 9 101 | 5 | 101 | 10100 | 10 11- | 6 | 110 | 11000 | 11 110 | 6 | 110 | 11001 | 12 111 | 7 | 111 | 11100 | 13
Some useful features:
enc.bits represents the number of bits we need to encode (two in this case)int.enc %% enc.bits indicates how many cells are explicitly specifiedint.enc %/% enc.bits returns intint * 2 ^ enc.bits + explicitly.specified returns int.enc
Please note that explicitly.specified here is between 0 and max_len - 1 in our implementation, since at least one digit is always specified. Now we have an encoding that fully reflects your data structure using only integers. We can choose from integers and reproduce the desired results with the correct weights, etc. One limitation of this approach is that we use 32-bit integers in R, and we need to reserve some bits for encoding, so we are limited to max_len==25 or so pools. You could go more if you used double precision integer numbers, but we didn’t.
Avoid duplicate messages
There are two rough ways to ensure that we do not select the same value twice.
- Keep track of which values remain available for selection, and randomly choose from them
- Sample randomly from all possible values, and then check if this value was previously selected, and if it has, again the sample
While the first option seems the cleanest, it is actually a very expensive computing one. This requires either a vector scan of all possible values for each choice in order to pre-disqualify the selected values, or create a reduction vector containing non-disqualified values. The shrink option is more efficient than vector scanning if the vector is compressed by reference through C code, but even then it requires repeated translations of potentially large parts of the vector, and this requires C.
Here we use method # 2. This allows us to randomly shuffle all possible values once, and then select each value sequentially, check that it has not been disqualified, and if it is, select another, etc. This works because it is trivial to check whether the value was chosen as a result of our encoding of values; we can determine the location of a value in a sorted table based on only the value . Thus, we record the status of each value in a sorted table and can either update or search for this state through direct access to the index (no verification is required).
Examples
The implementation of this algorithm in the R database is available as an entity . This particular implementation only pulls complete draws. Here is an example of 10 drawings of 8 elements from the pool max_len==4 :
# each column represents a draw from a `max_len==4` pool set.seed(6); replicate(10, sample0110b(4, 8)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "1000" "1" "0011" "0010" "100" "0011" "0" "011" "0100" "1011" [2,] "111" "0000" "1101" "0000" "0110" "0100" "1000" "00" "0101" "1001" [3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10" "1100" [4,] "0100" "0010" "0000" "0101" "1101" "101" "1011" "1101" "0110" "1101" [5,] "101" "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111" [6,] "110" "0111" "1011" "111" "1011" "110" "1111" "0100" "0011" "000" [7,] "0101" "0101" "111" "011" "1010" "1000" "1100" "101" "0001" "0101" [8,] "011" "0001" "01" "1010" "0011" "1110" "1110" "1001" "110" "1000"
We also initially had two implementations that relied on method # 1 to avoid duplication, one in the R base and one in C, but even version C is not as fast as the new base version of R when n big. This function implements the ability to draw incomplete drawings, so we give them here for reference:
Comparative tests
Here is a set of tests comparing several functions that are displayed in this Q / A. Time in milliseconds. The brodie.b version is the version described in this answer. brodie is the original implementation, brodie.C is the original with some C. All this ensures compliance with the requirements for complete samples. brodie.str is the string version in another answer.
size n jbaum josilber frank tensibai brodie.b brodie brodie.C brodie.str 1 4 10 11 1 3 1 1 1 1 0 2 4 50 - - - 1 - - - 1 3 4 100 - - - 1 - - - 0 4 4 256 - - - 1 - - - 1 5 4 1000 - - - 1 - - - 1 6 8 10 1 290 6 3 2 2 1 1 7 8 50 388 - 8 8 3 4 3 4 8 8 100 2,506 - 13 18 6 7 5 5 9 8 256 - - 22 27 13 14 12 6 10 8 1000 - - - 27 - - - 7 11 16 10 - - 615 688 31 61 19 424 12 16 50 - - 2,123 2,497 28 276 19 1,764 13 16 100 - - 4,202 4,807 30 451 23 3,166 14 16 256 - - 11,822 11,942 40 1,077 43 8,717 15 16 1000 - - 38,132 44,591 83 3,345 130 27,768
It scales pretty well for large pools
system.time(sample0110b(18, 100000)) user system elapsed 8.441 0.079 8.527
Notes:
- frank and brodie (minus brodie.str) do not require pooling, which will affect the comparison (see below).
- Josilber is the LP version
- jbaum - OP example
- tensibai is slightly modified to exit instead of failing if the pool is empty.
- not configured to run python so cannot compare / account to buffer
- are either impracticable options, or too slow in time reasonably.
Timing does not include drawing pools ( 0.8 , 2.5 , 401 milliseconds respectively for sizes 4 , 8 and 16 )), which is necessary for jbaum , josilber and brodie.str starts or sorts them ( 0.1 , 2.7 , 3700 milliseconds for sizes 4 , 8 and 16 ), which is necessary for brodie.str in addition to the draw. Whether you want to enable these or not depends on how many times you run the function for a particular pool. In addition, there are almost certainly better ways to generate / sort pools.
This is the average time of three runs with a microbenchmark . The code is available as an entity , but note that you will need to download sample0110b , sample0110 , sample01101 and sample01 .