I want to integrate data deduplication into the software that I write for vmware image backup. I could not find anything suitable for what seems to me necessary. There seem to be many complete solutions that include one form of deduplication or another. These include storage or backup solutions that use public or private clouds, specialized file systems, storage networks or devices, etc. However, I need to develop my own solution and integrate dedup into it. My software will be written in C #, and I would like to be able to call the API to tell him what to do with grandfather.
The type of deduplication that I am talking about does not deduplicate one image against another image — this is usually an approach for creating incremental or differential backups of two “versions” of something or what is called “Client backup deduplication ” on a Wikipedia entry for data deduplication , since I already have a decision to do this, and you want to take a step further.
I suggest an approach that would allow me to somehow deduce pieces of data at the global level (i.e., in some form, global deduplication ). Being global, I assume that there will be some kind of central search table (for example, a hash index) that will tell the deduper that a copy of the checked data is already held and does not need to be saved again. Columns can be file (Single Instance Storage or SIS) or file / block deduplication. The latter should be more efficient (which is more important for our purposes than, say, handling overhead), and will be my preferred option, but I could do the SIS work too if I had to.
Now I read a lot about other people's software that deduplicates, as I said above. I will not give examples here because I am not trying to imitate anyone else specifically. Rather, I could not find a solution for programmers and want to know if there is anything like that. An alternative would be to quit my own solution, but that would be a pretty big task, to say the least.
Thanks.
source share