Sorting giant binaries using C #

I have a large file about 400 GB in size. It is produced daily by an external closed system. This is a binary file with the following format:

byte[8]byte[4]byte[n] 

Where n is equal to the value of int32 bytes [4].

There are no delimiters in this file and for reading the whole file that you just repeat before EOF. With each "element" represented as byte [8] byte [4] byte [n].

File looks like

 byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF 

byte [8] is a 64-bit number representing the time period represented by .NET Ticks. I need to sort this file, but it may not seem like this is the fastest way to do this.

I am currently loading Ticks into the structure and byte [n] of the start and end positions and read at the end of the file. After that, I sort the list in memory using the Ticks property, and then open the BinaryReader and look for each position in the Ticks order, read the value of byte [n] and write to an external file.

At the end of the process, I get a sorted binary, but it accepts FOREVER. I am using C # .NET and a pretty promising server, but the problem with the IO disk is the problem.

Server Features:

  • 2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24 threads)
  • RAM 32 GB
  • 500 GB RAID 1 + 0
  • 2TB RAID 5

I looked all over the Internet and can find examples when a huge file is 1 GB (makes me smile).

Does anyone have any tips?

+6
source share
4 answers

A great way to speed up this file access is to map the entire file into the address space and let the OS take care of reading any bits from the file that it needs. Do the same as now, except for reading from memory instead of using BinaryReader / seek / read.

You have a lot of main memory, so this should provide pretty good performance (if you are using a 64-bit OS).

+7
source

Use merge sort. He is online and well versed.

http://en.wikipedia.org/wiki/Merge_sort

+5
source

If you can recognize Erlang or Go, they can be very powerful and scale very well, since you have 24 threads. Use Async I / O. Merge sort. And since you have 32 GB of RAM, try downloading as much as you can into RAM and sorting it there and then burning it back to disk.

+3
source

I would do this in a few passes. On the first pass, I would create a list of ticks, and then distribute them evenly into many (hundreds?) Buckets. If you know in advance that the ticks are evenly distributed, you can skip this initial pass. In the second pass, I divided the records into these several hundred separate files of approximately the same size (these much smaller files represent groups of ticks in the order you want). Then I will sort each file individually in memory. Then merge the files.

This is somewhat similar to hashsort (I think).

+1
source

Source: https://habr.com/ru/post/898332/


All Articles