You basically need a built-in algorithm for calculating (approximate) medians. Given the large file, find its median and then split it into two small files. The kd tree is the result of a recursive application of this process in different sizes (and when smaller files begin to fit into memory, you no longer have to worry about outdated algorithms).
To approximate the median of a large file, use the collector sample to capture a large, but in-memory sample, and then run the built-in median search algorithm. Alternatively, for the exact median, calculate (for example, approximately) the 45th and 55th percentiles, then take another pass to extract the data points between them and accurately calculate the median (if the sample was not unusually random, and this case, try again). For more information, see Motwani-Raghavan's book on randomized algorithms.
source share