How to read the second column in a large file

I have a huge file with millions of columns, separated by a space, but it has a limited number of lines:

examples.txt:

1 2 3 4 5 ........ 3 1 2 3 5 ......... l 6 3 2 2 ........ 

Now I just want to read in the second column:

 2 1 6 

How to do it in Java with high performance.

thanks

Update: The file is usually 1.4G containing hundreds of lines.

+6
source share
3 answers

If your file is not statically structured, your only option is naive: read the sequence of bytes of the file by the sequence of bytes, looking for new lines and take the second column after each of them. Use FileReader .

If your file has been statically structured, you can calculate where in the file the second column will be for a given row and seek() for it directly.

+2
source

Here is a small state machine that uses FileInputStream as its input and processes its own buffering. No language conversion.

On my 7-year-old 1.4 GHz laptop with 1/2 GB of memory, it takes 48 seconds to go through 1.28 billion bytes of data. Buffers exceeding 4 KB are slower.

On the new 1-year-old MacBook with 4Gb, it works after 14 seconds. After the file is in the cache, it starts in 2.7 seconds. Again, there is no difference with buffers larger than 4 KB. This is the same data file with a capacity of 1.2 billion bytes.

I expect memory-mapped IOs to be better, but this is probably more portable.

He will select any column that you tell him.

 import java.io.*; import java.util.Random; public class Test { public static class ColumnReader { private final InputStream is; private final int colIndex; private final byte [] buf; private int nBytes = 0; private int colVal = -1; private int bufPos = 0; public ColumnReader(InputStream is, int colIndex, int bufSize) { this.is = is; this.colIndex = colIndex; this.buf = new byte [bufSize]; } /** * States for a tiny DFA to recognize columns. */ private static final int START = 0; private static final int IN_ANY_COL = 1; private static final int IN_THE_COL = 2; private static final int WASTE_REST = 3; /** * Return value of colIndex'th column or -1 if none is found. * * @return value of column or -1 if none found. */ public int getNext() { colVal = -1; bufPos = parseLine(bufPos); return colVal; } /** * If getNext() returns -1, this can be used to check if * we're at the end of file. * * Otherwise the column did not exist. * * @return end of file indication */ public boolean atEoF() { return nBytes == -1; } /** * Parse a line. * The buffer is automatically refilled if p reaches the end. * This uses a standard DFA pattern. * * @param p position of line start in buffer * @return position of next unread character in buffer */ private int parseLine(int p) { colVal = -1; int iCol = -1; int state = START; for (;;) { if (p == nBytes) { try { nBytes = is.read(buf); } catch (IOException ex) { nBytes = -1; } if (nBytes == -1) { return -1; } p = 0; } byte ch = buf[p++]; if (ch == '\n') { return p; } switch (state) { case START: if ('0' <= ch && ch <= '9') { if (++iCol == colIndex) { state = IN_THE_COL; colVal = ch - '0'; } else { state = IN_ANY_COL; } } break; case IN_THE_COL: if ('0' <= ch && ch <= '9') { colVal = 10 * colVal + (ch - '0'); } else { state = WASTE_REST; } break; case IN_ANY_COL: if (ch < '0' || ch > '9') { state = START; } break; case WASTE_REST: break; } } } } public static void main(String[] args) { final String fn = "data.txt"; if (args.length > 0 && args[0].equals("--create-data")) { PrintWriter pw; try { pw = new PrintWriter(fn); } catch (FileNotFoundException ex) { System.err.println(ex.getMessage()); return; } Random gen = new Random(); for (int row = 0; row < 100; row++) { int rowLen = 4 * 1024 * 1024 + gen.nextInt(10000); for (int col = 0; col < rowLen; col++) { pw.print(gen.nextInt(32)); pw.print((col < rowLen - 1) ? ' ' : '\n'); } } pw.close(); } FileInputStream fis; try { fis = new FileInputStream(fn); } catch (FileNotFoundException ex) { System.err.println(ex.getMessage()); return; } ColumnReader cr = new ColumnReader(fis, 1, 4 * 1024); int val; long start = System.currentTimeMillis(); while ((val = cr.getNext()) != -1) { System.out.print('.'); } long stop = System.currentTimeMillis(); System.out.println("\nelapsed = " + (stop - start) / 1000.0); } } 
0
source

I need to agree with @gene, first try with BufferedReader and getLine, simple and easy to encode. Just be careful not to use the alias of the support array between the result of getLine and any substring you use. String.substring () is a particularly common criminal, and byte arrays with several megabytes were blocked in my memory because I was referred to by the 3 char substring.

Assuming ASCII, my preference for this is to go down to the byte level. Use mmap to view the file as ByteBuffer , and then do a linear scan for 0x20 and 0x0A (assuming unix-style line separators). Then convert the corresponding bytes to a string. If you use 8-bit encoding, it is extremely difficult to be faster than this.

If you are using Unicode, the problem is quite complicated, so I highly recommend using BufferedReader if this performance is not acceptable. If getLine() does not work, consider only the loop when calling read() .

Regardless, you should always specify a Charset when initializing a String from an external byte stream. This clearly expresses your encoding assumption. Therefore, I recommend a slight modification of the gene proposal, therefore one of:

 int i = Integer.parseInt(new String(buffer, start, length, "US-ASCII")); int i = Integer.parseInt(new String(buffer, start, length, "ISO-8859-1")); int i = Integer.parseInt(new String(buffer, start, length, "UTF-8")); 

as needed.

0
source

Source: https://habr.com/ru/post/918423/


All Articles