Data Deduplication System?

I want to integrate data deduplication into the software that I write for vmware image backup. I could not find anything suitable for what seems to me necessary. There seem to be many complete solutions that include one form of deduplication or another. These include storage or backup solutions that use public or private clouds, specialized file systems, storage networks or devices, etc. However, I need to develop my own solution and integrate dedup into it. My software will be written in C #, and I would like to be able to call the API to tell him what to do with grandfather.

The type of deduplication that I am talking about does not deduplicate one image against another image — this is usually an approach for creating incremental or differential backups of two “versions” of something or what is called “Client backup deduplication ” on a Wikipedia entry for data deduplication , since I already have a decision to do this, and you want to take a step further.

I suggest an approach that would allow me to somehow deduce pieces of data at the global level (i.e., in some form, global deduplication ). Being global, I assume that there will be some kind of central search table (for example, a hash index) that will tell the deduper that a copy of the checked data is already held and does not need to be saved again. Columns can be file (Single Instance Storage or SIS) or file / block deduplication. The latter should be more efficient (which is more important for our purposes than, say, handling overhead), and will be my preferred option, but I could do the SIS work too if I had to.

Now I read a lot about other people's software that deduplicates, as I said above. I will not give examples here because I am not trying to imitate anyone else specifically. Rather, I could not find a solution for programmers and want to know if there is anything like that. An alternative would be to quit my own solution, but that would be a pretty big task, to say the least.

Thanks.

+4
source share
1 answer

Global deduplication, as you described, is usually processed outside of most typical virtual machine backup programs, because CBT already tells you which blocks were changed in the VM so you do not need to make a full backup every time. Global dedup, as a rule, is also resource-intensive, so most people just get a “Data Domain” and use hardware (SSD) and software (custom file systems, variable-length deduplications) that are dedicated, configured and optimized for deduplication. Of course, the backup program you create can use both CBT and Data Domain offers in such a way that commercially available backup software is already available, for example [Veeam] [3]. Additional information on the data domain deduplication strategy ([variable length segments] [4]).

Well, I had to remove my two URLs in order to post this answer, because apparently I am missing a rep ... w / e

+2
source

Source: https://habr.com/ru/post/1381512/


All Articles