Sharing a lot of 3D data

I need to break up a large set of three-dimensional points (using C ++). Points are stored on the hard disk as a binary floating-point array, and files are usually larger than 10 GB. I need to split the set into smaller subsets less than 1 GB in size. The points in the subset must still have the same neighborhood, because I need to execute certain algorithms on the data (for example, detecting objects).

I thought I could use KD-Tree. But how can I efficiently build a KD-Tree if I cannot load all the points in RAM? Perhaps I could map the file as virtual memory. Then I could save a pointer to each three-dimensional point belonging to the segment and save it in the KD-Tree node. Will this work? Any other ideas?

Thank you for your help. I hope you can fix the problem: D

+6
source share
1 answer

You basically need a built-in algorithm for calculating (approximate) medians. Given the large file, find its median and then split it into two small files. The kd tree is the result of a recursive application of this process in different sizes (and when smaller files begin to fit into memory, you no longer have to worry about outdated algorithms).

To approximate the median of a large file, use the collector sample to capture a large, but in-memory sample, and then run the built-in median search algorithm. Alternatively, for the exact median, calculate (for example, approximately) the 45th and 55th percentiles, then take another pass to extract the data points between them and accurately calculate the median (if the sample was not unusually random, and this case, try again). For more information, see Motwani-Raghavan's book on randomized algorithms.

+1
source

Source: https://habr.com/ru/post/989543/


All Articles