Randomly select items from a large text file

I have a large file with a 30 million user ID. This large file will look something like this, and each line will have a user ID.

149905320 1165665384 66969324 886633368 1145241312 286585320 1008665352 1135545396 186217320 132577356 

Now I plan to get any random string from this large text file. I know the total number of user IDs in this large text file. I'm not sure the best way to select random elements from this large text file. I thought of storing all these 30 million user identifiers in a set, and then randomly selected elements from a hastset, but with this approach, it would go with a memory error.

This is why I am trying to randomly select elements from a large text file.

 final String id = generateRandomUserId(random); /** * Select random elements from the a big text file * * @param userIdsSet2 * @param r * @return */ private String generateRandomUserId(Random r) { File bigFile = new File("C:\\bigfile.txt"); //randomly select elements from a big text file } 

What is the best way to do this?

+4
source share
3 answers

You can do this:

  • Get file size (in bytes)
  • Select byte (randomly selected number in [0..file.length ()] - RandomAccessFile )
  • Look for this position in the file ( file.seek(number) )
  • Look for a position immediately after the next \n ( file.seek(1) )
  • Read line ( file.readLine() )

eg...

This way you do not need to store anything.

An example of a theoretical snippet might look like this (contains some side effects):

 File f = new File("D:/abc.txt"); RandomAccessFile file; try { file = new RandomAccessFile(f, "r"); long file_size = file.length(); long chosen_byte = (long)(Math.random() * file_size); file.seek(chosen_byte); for (;;) { byte a_byte = file.readByte(); char wordChar = (char)a_byte; if (chosen_byte >= file_size || wordChar == '\n' || wordChar == '\r' || wordChar == -1) break; else chosen_byte += 1; System.out.println("\"" + Character.toString(wordChar) + "\""); } int chosen = -1; if (chosen_byte < file_size) { String s = file.readLine(); chosen = Integer.parseInt(s); System.out.println("Chosen id : \"" + s + "\""); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } 


EDIT: Full working (theoretically) class

 import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import java.io.RandomAccessFile; public class Main { /** * WARNING : This piece of code requires that the input file terminates by a BLANK line ! * * @param args * @throws Exception */ public static void main(String[] args) throws Exception { File f = new File("D:/abc.txt"); RandomAccessFile file; try { file = new RandomAccessFile(f, "r"); long file_size = file.length(); // Let start long chosen_byte = (long)(Math.random() * (file_size - 1)); long cur_byte = chosen_byte; // Goto starting position file.seek(cur_byte); String s_LR = ""; char a_char; // Get left hand chars for (;;) { a_char = (char)file.readByte(); if (cur_byte < 0 || a_char == '\n' || a_char == '\r' || a_char == -1) break; else { s_LR = a_char + s_LR; --cur_byte; if (cur_byte >= 0) file.seek(cur_byte); else break; } } // Get right hand chars cur_byte = chosen_byte + 1; file.seek(cur_byte); for (;;) { a_char = (char)file.readByte(); if (cur_byte >= file_size || a_char == '\n' || a_char == '\r' || a_char == -1) break; else { s_LR += a_char; ++cur_byte; } } // Parse ID if (cur_byte < file_size) { int chosen_id = Integer.parseInt(s_LR); System.out.println("Chosen id : " + chosen_id); } else { throw new Exception("Ran out of bounds. But this usually never happen..."); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } } 


Hope this is not too wrong as the implementation (I'm more C ++ these days) ...

+7
source

Instead of storing user identifiers in a hash, you can analyze the file and save only the offsets in the int [] array - 30M will take about 120 MB of RAM.

Alternatively, if you can modify or pre-process the file in any way, you can change the format to a fixed width by filling in user IDs or using a binary format.

0
source

The OP indicates: "I know the total number of user IDs in this large text file." Call it N.

  • Generate a random number between 1 and N inclusive.
  • Read the lines (BufferedReader) until you reach the Nth line.
  • Done
0
source

Source: https://habr.com/ru/post/1487174/


All Articles