Parsing large text files modified on the fly

I need to parse a large CSV file in real time, but changed (added) by another process. By and large, I mean ~ 20 GB at the moment and is growing slowly. The application should detect and report some anomalies in the data stream, for which it needs to store only a little status information ( O(1) space).

I thought about polling file attributes (size) every couple of seconds, opening a read-only stream, striving for the previous position, and then I continue to parse where I first stopped. But since this is a text file (CSV), I obviously have to keep track of the newlines, ever continue, so that I always parse the whole line.

If I am not mistaken, this should not be such a problem for implementation, but I wanted to know if there is a common way / library that solves some of these problems already?

Note: I do not need a CSV parser. I need library information that makes it easy to read lines from a file that changes on the fly.

+6
source share
3 answers

There is a small problem here:

  • Reading and analyzing CSV requires a TextReader
  • Positioning does not work (well) with TextReaders.

First thought: keep it open. If the manufacturer and the analyzer are operating in non-exclusive mode, it should be possible to ReadLine-until-null, pause, ReadLine-to-null, etc.


it should be 7-bit ASCII, only some guides and numbers

This allows you to track the Position file (pos + = line.Length + 2). Make sure you open it using Encoding.ASCII . Then you can reopen it as a simple binary stream, search for the last position and only then attach the StreamReader to this stream.

+1
source

I have not tested it, but I think you can use FileSystemWatcher to detect when another process has changed your file. In the "Changed" event, you can find a previously saved position and read additional content.

+2
source

Why don't you just split up a separate process / thread every time you start parsing - this way you move the parallel (on the fly) part away from the data source and to your data stream - so now you just need to figure out how to collect results from all your threads ...

This will mean re-reading the entire file for every thread you rotate, though ...

You can run the diff program on two versions and choose from there, depending on how well the csv data source is formed: does it modify already recorded records? Or is he just adding new entries? If so, you can simply separate the new material (last-position to current-eof) into a new file and process them at your leisure in the background thread:

  • poll stream remembers last file size
  • when the file becomes larger: search from the last position to the end, save to temp file
  • background thread processes all temporary files remaining in the creation / modification order
0
source

Source: https://habr.com/ru/post/914324/


All Articles