Reordering data segments in a file

I am trying to find an algorithm with, possibly, an example in C, C ++, C #, Java, or indeed any language, to help solve the reordering problem that I am facing.

The goal is to take a number of ranges in the file and reorganize the new template, basically moving parts of the data without destroying the integrity of the data. I would prefer to find one that can execute it in place, and use one buffer to exchange or directly move from one place to another. The reorganization process can split ranges into parts if the ranges have the same length and data integrity when they are completed.

As an example, given a set of values:

Length SrcStart Src End Dst Start Dst End 9178 274054 283231 0 9177 274051 0 274050 9178 283228 582929 283229 866157 283229 866157 399208 874397 1273604 866158 1265365 8239 14675709 14683947 1265366 1273604 986980 1273605 2260584 1273605 2260584 602862 2811144 3414005 2260585 2863446 138712 4092072 4230783 2863447 3002158 116210 3414007 3530216 3002159 3118368 550559 2260585 2811143 3118369 3668927 561856 3530217 4092072 3668928 4230783 24319165 4230784 28549948 4230784 28549948 578539 30246149 30824687 28549949 29128487 491856 28549949 29041804 29128488 29620343 593580 29639113 30232692 29620344 30213923 597308 29041805 29639112 30213924 30811231 13456 30232693 30246148 30811232 30824687 633513 31407949 32041461 30824688 31458200 583261 30824688 31407948 31458201 32041461 40117358 32041462 72158819 32041462 72158819 

All content in the ranges SrcStart β†’ SrcEnd must be transferred to the range DstStart β†’ DstEnd. Please note that in many cases, moving from a source to a destination will lead to a change in the Destination content, from which you cannot copy from this location anymore, since the original data that was needed was destroyed.

The goal is to move each data segment from SrcStart to DstStart with a length in the first column. Each line corresponding to "End" is simply the beginning plus the length minus one (therefore its actual offset).

I did a lot of research and looked at variable values ​​and broke down areas that intersect with other values, as well as with the container in the container exchange area, but they don't seem to cope. So, as a result, this brings me back to my first statement, which I was hoping for, maybe there was an algorithm or some source from which I could learn to help solve this problem, and the general knowledge of the community just seemed to go.

Thanks!

+4
source share
3 answers

You can use the approach that disk defragmenters use.

  • Copy the data you want to transfer to a free area.
  • Change any indexes referencing this data to indicate a new location, so a copy will be used in the future.
  • Perhaps you should pay attention if any blocks become "unused", if the system has a concept of this.

However, if the indexes are in bytes, this means that the entire file is only 80 MB. A file that can be very small can be copied very quickly (takes less than two seconds). Perhaps the real example is much longer. How big is the file as a whole?

0
source

You ask the question as if it were an opaque binary file in which for some reason you want to exchange blocks. It seems very unlikely that this is true. Does the file really have its own structure? Can you use this to help accounting?

  • Does the file have the concept of β€œused” and β€œunused” areas?
  • Does the file have an internal block header structure?
  • Does the file have any related index or something else that needs to be synchronized? (If not, where did you get the list of blocks to move from?)
  • Can I block the movement by overlapping each other? Please note that if they can make the order of operations significant.

However, the approach recommended by @Peter Lawrey is good. If you overwrite a block, first copy it to another location in the file, update any overlapping indexes.

In everything, it seems to me that you tried to solve a complex problem by breaking it into two steps, one simple, the other ... even more difficult. What was the original problem?

(Required: Windows can use IO Transactional APIs).

0
source

I think the following algorithm can handle this, no more than twice the maximum cartridge memory for caching data. You will need a book containing FIFO and your source list in addition to the data cache. It looks something like this:

  • If both the FIFO and the movement table are empty, complete.
  • If FIFO is empty, move the top entry from the move table to FIFO, also write the entry data into the data cache.
  • Check to see if there are any blocks that overlap the destination area of ​​the first record in the FIFO in the movement table.
  • If there is a block, read the block data in the cache, move the entry to FIFO, go to 3.
  • If there are no blocks, write the FIFO record data from the cache to the destination, delete the first FIFO record, go to 1.

The idea is to find busy blocks, cache them to open space, and get rid of the data in the cache as quickly as possible. You may need to add some health checks to see if the source and destination addresses are identical. You can also perform additional checks to make sure the source table makes sense (two blocks move to the same place, etc.).


EDIT: I may have been an optimist, indicating a maximum of the largest blocks times two scores. I think this can go beyond that. I think times 3 are a simple (but weak) upper bound.

Since you have fairly large blocks in the source table, you can split them to reduce cache usage. Say you want to use no more than 1 GB of cache: divide all blocks larger than 1/3 GB into several 1/3 GB records before running the algorithm. Alternatively, you can make the algorithm work in swap sizes (instead of reading the full fragments in the cache, you only read the relevant parts and save the changed records in the original table), but I think it would be more difficult to manage / implement.

0
source

Source: https://habr.com/ru/post/1385583/


All Articles