I need to parse a large CSV file in real time, but changed (added) by another process. By and large, I mean ~ 20 GB at the moment and is growing slowly. The application should detect and report some anomalies in the data stream, for which it needs to store only a little status information ( O(1)
space).
I thought about polling file attributes (size) every couple of seconds, opening a read-only stream, striving for the previous position, and then I continue to parse where I first stopped. But since this is a text file (CSV), I obviously have to keep track of the newlines, ever continue, so that I always parse the whole line.
If I am not mistaken, this should not be such a problem for implementation, but I wanted to know if there is a common way / library that solves some of these problems already?
Note: I do not need a CSV parser. I need library information that makes it easy to read lines from a file that changes on the fly.
source share