An efficient algorithm for combining small overlapping blocks into larger adjacent blocks?

I ran into a rather interesting problem. I have a (fairly large) number of blocks. A block is just something that starts with an offset and has length and color. Displacement and length are limited - the space in which these blocks are located is <0-N>, where N is from hundreds of thousands to several millions. An invalid block is any block with an offset greater than N or the sum of an offset and a length greater than N. There are about 16 different colors that a block can have (only one of them).

There may be thousands of blocks, and there are always such situations:

block_X: off: 100, len: 50, color: blue
block_Y: off: 148, len: 50, color: blue
block_Z: off: 200, len: 30, color: red

As you can see, blocks X and Y can be connected to one larger block, resulting in:

block_XY: off: 100, len 98, color: blue
block_Z: off 200, len 30, color: red

I would like to create an algorithm that will go through all the blocks and bind those that overlap and have the same color. In fact, if the gap between the blocks is small enough (a constant that I could choose, say about 16 or so, but could be any number ...), then I would like to connect these blocks anyway. Note that in the above example, there are only two blocks that are connected. In fact, the sequence of blocks that can be connected is much longer in most cases.

There is also an interesting twist, consider the following:

block_A: off: 100, len: 200, color: blue
block_B: off: 200, len: 200, color: blue
block_C: off: 300, len: 150, color: red
block_D: off: 400, len: 200, color: blue
block_E: off: 500, len: 200, color: blue

As you can see, we have a beautiful sequence of blue blocks that can be combined into one big blue block. However, there is a small red block in the middle. This should not deceive the algorithm, the correct result is:

block_ABDE: off: 100, len: 600, color: blue
block_C: off: 300, len: 150, color: red

The data is stored in std :: multimap, where the key is the offset, and the value is a structure containing the length and color of the block, something like: multimap<uint32_t,pair<uint32_t, uint8_t> > . Note that there may be blocks starting at the same offset. However, it is guaranteed that if this happens, blocks starting at the same offset have different colors and different lengths.

Can anyone come up with how to solve this problem effectively? The purpose of the algorithm is to reduce the number of blocks that we have, but it is NOT required to reduce it to the smallest possible number.

+4
source share
4 answers

I would suggest the following:

  • Create a separate list for each of the colors.
  • For each specific color, calculate the beginning (offset) and end (offset + length) of all blocks
  • Sort each color list by start value
  • Now go through each of the lists, combining the elements: if the beginning of the next element is less than or equal to the previous end of the object (plus a valid space), delete the next element and continue the previous one.
+7
source

Separate the list of blocks in the list for each color and sort all of these lists using the initial block offset. (In fact, you probably want to do an insertion sort, as you filter based on color or sort by offset, and then perform stable color sorting and work with sections of your array.)

After you have lists for each color, you will iterate over each of them: starting from the first block, check if the offset of the next block to the end of your current block is enough, and if so, merge them. Then go to the bottom of the list.

Now you have combined all the blocks for each color so that you can simply combine your lists to get the final list of all blocks of all colors. The most expensive step (asymptotically) will be sorting, so this is probably quite efficient. You may be able to come up with something that is faster on average using more complex data structures than arrays and linked lists, but you should not evaluate the effectiveness of this simpler approach.

Please note that since the rules for combining two blocks depend only on the endpoint of one block and the starting point of the other and do not depend on the size of the blocks, any solution that can identify any potential merges probably identifies all the merges, and it does not matter in which merges are performed (i.e., a union is an associative operation).

+6
source

You do not need to break the blocks into one list for each color. You can sort the entire list by the starting address of the block, and then process it in one pass, supporting a separate pair {start, length} for each color.

+4
source

This is quite simple, since for each color it is exactly the same as the numbered range, for example

blue: 100-299, 200-400, 500-670, 671-702

with which you merge:

blue: 100-400, 500-702

Just sort by the first value in the range and look at the final value (actually one after the last element). If this is greater than or equal to the beginning of the next, then they merge.

In the above 300> 200, so that they merge, 401 <500, so that they do not do this, 671 == 671, so they merge.

Do the same for each color.

0
source

Source: https://habr.com/ru/post/1337284/


All Articles