What is the CNTKWindow randomization behavior?

I have a quick question about the randomizationWindow parameter of the reader. He says that in the documentation he controls how much data is in memory, but Im a little unclear what effect this will have on the randomness of the data. If the training data file starts with one data distribution and ends with another completely different distribution, will the randomization window be smaller than the data size so that the data transmitted by the trainer are not from a uniform distribution? I just wanted to double check.

+5
source share
2 answers

When the randomizationWindow parameter is set to a smaller window than the entire data size, the entire data size is divided into randomningWindow fragments and the order of the fragments is randomized. Then, in each fragment, the samples are randomized.

+3
source

To talk more about randomization / IO:

All corpus / data are always split into pieces. Chunks help make IO effective because all fragment sequences are read at one time (usually a 32 / 64MB chunk).

When it comes to randomization, there are two steps:

  • all pieces are randomized
  • given the randomization window of N samples, the randomizer creates a window for rolling M pieces, which in total have approximately N samples in them. All sequences inside this rolling window are randomized. When all sequences of a fragment are processed, the randomizer can free it and start loading the next one asynchronously.
+4
source

Source: https://habr.com/ru/post/1262323/


All Articles