Is there a way in Java to randomize a file too large to fit in memory?

What I would like to do is shuffle the lines (read from CSV) and then print the first randomized 10,000 lines per csv, and the rest on a separate csv. With a smaller file, I can do something like

java.util.Collections.shuffle(...) for (int i=0; i < 10000; i++) printcsv(...) for (int i=10000; i < data.length; i++) printcsv(...) 

However, with very large files, I get OutOfMemoryError

+6
source share
6 answers

Here is one of the possible algorithms:

  • Let MAX_LINES be the maximum number of lines in a managed file;
  • Read MAX_LINES from the input file, randomize them using the original algorithm and write them to a temporary file;
  • Repeat 2. until there are lines in your input file;
  • Let N be a random number between 0 and the number of temporary files that you wrote; read the next line from the Nth temporary file;
  • Repeat 4. until you read all the lines from all the files; the first 10,000 times write each line to the first output file, write all the other lines to another file.
+1
source

You can:

  • Use more memory or

  • Shuffle not the actual CSV lines, but only the set of line numbers, and then read the input file in turn (buffered, of course) and write the line to one of the desired output files.

+3
source

You can write a memory card and find all new lines, save it in an int or long array, where they are. Create an array of int indices and shuffle them. This should use about 8-32 bytes per line. If this does not fit into memory, you can also use memory mapping files for these arrays.

+2
source

Use some kind of indexing scheme. Parse your CSV file once to get the number of lines (do not store anything in memory, just parse it) and select 10,000 numbers from this range at random (make sure you avoid duplicates, for example with Set<Integer> or something more difficult). Then analyze your CSV a second time, saving the counter for the lines again. If the line number corresponds to one of your randomly selected numbers, output it to a single CSV file. Print lines with an inconsistent number in another file.

+1
source
  • First of all, count the number of lines in the input file by reading its contents (but not saving it in memory). Call the number of rows N
  • Take an arbitrary sample of size 10,000 from the numbers 1 .. N
  • Read the source file from the beginning. For each line, if the line number is in the sample taken in step 2, write the line to file1 ; otherwise write it to file2 .

Step 2 can be performed by performing step 1 using a collector sample .

+1
source

If you know the number of lines in your file, and if you randomize complete lines, you can simply randomize by line number and then read the selected line. Just select a random string using the Random class and save the list of random numbers, so you don't select once.

 BufferedReader reader = new BufferedReader(new FileReader(new File("file.cvs"))); BufferedWriter chosen = new BufferedWriter(new FileWriter(new File("chosen.cvs"))); BufferedWriter notChosen = new BufferedWriter(new FileWriter(new File("notChosen.cvs"))); int numChosenRows = 10000; long numLines = 1000000000; Set<Long> chosenRows = new HashSet<Long>(numChosenRows+1, 1); for(int i = 0; i < numChosenRows; i++) { while(!chosenRows.add(nextLong(numLines))) { // add returns false if the value already exists in the Set } } String line; for(long lineNo = 0; (line = reader.readLine()) != null; lineNo++){ if(chosenRows.contains(lineNo)){ // Do nothing for the moment } else { notChosen.write(line); } } // Randomise the set of chosen rows // Use RandomAccessFile to write the rows in that order 

See this answer for the nextLong method, which produces a random long scale up to a specific size.

Edit: Like most people, I overlooked the requirement for writing randomly selected lines in random order. I assume that RandomAccessFile will help with this. Just rank the list with the selected rows and access them in that order. As for unchosen, I edited the code above to just ignore the selected ones.

0
source

Source: https://habr.com/ru/post/899956/


All Articles