Let's say you have a huge (> 1 GB) CSV record ids:
655453 4930285 493029 4930301 493031 ...
And for each id
you want to make a REST API call to retrieve the record data, convert it locally and paste it into the local database.
How do you do this with Node.js' Readable Stream
?
My question is basically this: how do you read a very large file in turn, run the async function for each line, and [optional] can you start reading the file from a specific line?
From the following Quora question, I'm starting to learn how to use fs.createReadStream
:
http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js
var fs = require('fs'); var lazy = require('lazy'); var stream = fs.createReadStream(path, { flags: 'r', encoding: 'utf-8' }); new lazy(stream).lines.forEach(function(line) { var id = line.toString();
But this pseudocode does not work, because the lazy
module forces you to read the entire file (like a stream, but there is no pause). Thus, this approach does not seem to work.
Another thing, I would like to be able to start processing this file from a specific line. The reason for this is that processing each id
(making an api call, clearing data, etc.) can take up to half a second to write, so I donβt want to start from the beginning of the file every time. The naive approach I'm going to use is to just grab the line number of the last id processed and save it. Then, when you parse the file again, you go through all the identifiers, in turn, until you find the line number that you stopped at, and then you do the makeAPICall
business. Another naive approach is to write small files (say, 100 identifiers) and process each file one at a time (a small enough data set to do everything in memory without an I / O stream). Is there a better way to do this?
I see how this gets complicated (and where node-lazy comes in) because chunk
in stream.on('data', function(chunk) {});
can contain only part of the line (if the buffer size is small, each fragment can be 10 lines, but since id
is a variable length, it can only be 9.5 lines or something else). That is why I am wondering what is the best approach to the above question.