To talk more about randomization / IO:
All corpus / data are always split into pieces. Chunks help make IO effective because all fragment sequences are read at one time (usually a 32 / 64MB chunk).
When it comes to randomization, there are two steps:
- all pieces are randomized
- given the randomization window of N samples, the randomizer creates a window for rolling M pieces, which in total have approximately N samples in them. All sequences inside this rolling window are randomized. When all sequences of a fragment are processed, the randomizer can free it and start loading the next one asynchronously.
source share