Data structure for large ranges of consecutive integers?

Suppose you have a large number of consecutive integers in memory, each of which belongs to the same category. Two operations must be O (log n): moving a range from one category to another and finding the number of categories for a given range.

I am sure that the second operation is solved trivially, given the correct implementation for the first operation.

Each integer starts in a category, so I started with a set of balanced BSTs. Moving a subtree from one BST to another (for example, moving a range to another category) has a runtime equivalent to combining two BSTs, that is, O (n1 * n2) [ 1 ].

This is too slow (in python, and C is not an option), and I could not figure out how to take advantage of my own data structure to create an efficient BST merge operation.

Now I watch AVL, red-black and interval trees, heaps and somersaults. The comparison of their properties is staggering. What structure should I use?

Change (clarification of the problem) . I am flexible about how I store these values ​​and create my data structures. I am inflexible about how I get my input, which comes from another application, and looks like this: CATEGORY(cat3, I, J) . My current solution is creating a tree with node for every integer in the range. It's too slow for the size of my dataset, so I'm happy to reverse engineer if you are offered a better method.

Any given query can move any possible range of integers into any category. In other words, the ranges overlap in the sense of CATEGORY(cat1, 1, 10) and then CATEGORY(cat3, 5, 15) , but do not overlap in the sense that each integer will be in the same category at any given time.

+6
source share
5 answers

As far as I understood the question, you have a range of [A, B] and form requests -

  • Assign a specific category range
  Eg
 R1 R2 C1
 R3 R4 C2
  • Request a range for the total number of items in specific categories. For instance. find the number of categories in R1 R4

A simple implementation using dictionaries as above will not work, as I will describe in this example -

suppose we have a range of [1000, 5000]

and we make the assignment as follows:

  1 2 C1
 2 3 C2
 3 4 C3
 ......
 4999 5000 C4999

Now we do the following assignment

  1 5000 C5555

This will update / change / delete previously assigned range ranges O (N), where N is the maximum range size (B - A)

D ['category'] = set (of_all_the_ranges_you_have_in_category)

In this case, for the last assignment (1 5000 C5555), you will need to delete the previous ranges from the corresponding sets in categories C1, C2 ... C4999

OR

1: {"stop": 5, "category": "C1"}, 6: {"stop": 19, "category": "C23"},

Here, for the last assignment (1,500 C5555), a category update will be required for each initial value (1,2,3,4 ..., 4999)

The best option that will make updating ranges in O (lg n) would be segment trees ( http://en.wikipedia.org/wiki/Segment_tree )

In the above example, the segment tree will look something like this:

  1000:5000 | --------------------- | | 1000:3000 3001:5000 | | ---------------- -------------------- | | | | 1000:2000 2001:3000 3001:4000 4001:5000 

.................................................. ............... ................................... ............................ etc.

Leaf nodes will have ranges (1: 2, 2: 3, ...)

You can assign a category value to each node and set the interval by going through the tree, dividing the interval accordingly (for example, from 2500 to 4500 divided by 2500: 3000 and 3001: 4500 and move in both directions until the nodes coincide ranges).

Now the interesting thing is to update child nodes when you need them. For example, do not start updating children immediately while completing activities such as type 1 5000 C5555. This thing is called lazy propaganda, and you can learn more about it ( http://www.spoj.pl/forum/viewtopic.php?f=27&t=8296 ).

Now for the request part. If the number of categories is very small, you can maintain a frequency table in each node and update the ranges when necessary, and if necessary use lazy propagation, you will need to cross the entire tree from the sheet to node, counting and complexity will become O (n).

I think a better query solution may exist. It does not occur to me.

UPDATE Let's take a small example.

Range [1.8]

Allowed Categories {C1, C2}

  1:8 1:4 5:8 1:2 3:4 5:6 7:8 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 

Each node will have 3 fields [category, category_counts [], children_update_required = false]

1 5 C1

The request will be split, and category 1: 4 nodes will be set to C1, and child_update_required will be set to true, its children will no longer be updated (remember updating only when necessary or lazy distribution). In addition, the node 5: 5 category will be set to C1

3 4 C2

The request will propagate along the tree in the 3: 4 direction (and in the process of reaching 3: 4, 1: 2 and 3: 4 the categories will be updated to C1, 1: 4 children_update_required will be set to false, 1: 2 and 3: 4 children_update_required will be set to true) and will now update the category from 3: 4 to C2 according to the current requirement. It will then be established that child_update is required from 3: 4 to be true for the future update of its children (which is already set in this case .. no harm).

+2
source

You may have a simple python dictionary of the following form

 1 : { "stop" : 5, "category" : "C1"}, 6 : { "stop" : 19, "category" : "C23"}, etc 

The keys here are the beginning of your range, and the values ​​contain the end of the range and the category this range belongs to.

Since dictionaries have a constant time for accessing elements, you can write code that easily and efficiently moves a range to another category: in the worst case, you will have to somehow change your dictionary if your range breaks the previous ranges into two or more. For example, if you want to assign the range (4, 8) to another category, you will get:

 1 : { "stop" : 3, "category" : "C1"}, 4 : { "stop" : 8, "category" : "new category"}, 9 : { "stop" : 19, "category" : "C23"}, etc 

Finding the number of categories is trivial, just collect all the ranges you need at a constant time and count the categories.

EDIT: To successfully find the lowest (highest) key to start performing calculations / changes, you also need a simple python list with all sorted keys and a bisect module. This will help to find the index in the list in order to “put” the beginning of the range in O (logn), then everything can be done in constant time, with the exception of O (n) time needed to insert a new key into the list with bisect.insort_left.

0
source

You can easily build a tree structure on arrays of consecutive integers, which should help with your constant factor. Renumber the sequence first to start from scratch, and work out which is the least power of two, which is greater than the range of the sequence.

Now consider the following tree, formed from integers 0-7, which I can hold in the form of four arrays, each array will go horizontally.

  (all) 0-3 4-7 0-1 2-3 4-5 6-7 0 1 2 3 4 5 6 7 

Given the number and level, I can find the offset in the array for that level by simply moving the number to the right of the sum depending on the level.

In each element, I can put a marker that either says "mixed" or gives a category for each element in or under this node of the tree. I can determine which category the node is in, following the path down from the root of the tree to the leaf, stopping as soon as I see a node that does not say “mixed”. I can change the category for the interval of numbers in time lg n, because I have no more than two nodes at each level to represent the category: if I had three, two of them would have the same parent, and I could would combine them. You may have to play around a bit to get adjacent ranges, but I think it works in lg n time.

0
source

Assumptions:

  • Any integer can belong to exactly one category, so the ranges cannot overlap.
  • All integers in the input range are in the same category.
  • Sometimes you need to split a range in order to move a subrange to another category.

Represents the ranges (start, end, category) tuples. Ranges do not intersect, so you can build their tree. This is much more economical than a single integer tree. To order ranges (i.e. nodes), you can simply use the starting value, since it is unique and cannot belong to another range.

If you need to move the range [a, b] to another category, you must:

Scan the tree and update each node / range that is fully included in the range [a, b] . This is a simple passage in depth. During crawl

  • If current_node.start < a or current_node.start > b , stop the search.
  • If current_node.start >= a and current_node.end > b , you need to split current_node into two; [current_node.start, b] will belong to the new category, the rest will have their original category.
  • If current_node.start < a and current_node.end <= b , you split it in the opposite direction.

The tree search is logarithmic, and node is O (1).

If your tree is ever too fragmented, you might consider joining nodes with adjacent ranges. It will always be a parent and child, the result of which is either insertion or separation; checks and connections always seem O (1).

0
source

We can represent the current states as something like:

 0:cat1 200:cat2 500: cat0 700:cat6 800:cat1 

This means that 0-200 is cat1, 200-500 is cat2, etc. We store the values ​​in the binary search tree given by the starting number. Each node will also contain dictionary display categories for counting for all children of this node.

These dictionaries should make it easy to get counts in O (log) time. We just need to add the correct numbers on the way to start and stop the sequence in question.

What should we do when we need to set the XY sequence to category Z?

  • Define your current category YO (logn)
  • Remove all values ​​between X -YO (k), but since the cost of inserting these nodes is more expensive, we can call it O (1) amortized.
  • Insert new node in XO (log n)
  • Update dictionaries with a list of categories. We only need to update the parents of the O (log n) of the affected sections, so this is O (log n)

In general, this is O (log n) time.

0
source

Source: https://habr.com/ru/post/898570/


All Articles