Using fread () to select rows and columns, the read.csv.sql () method

I know it is freadrelatively new, but it really gives big performance improvements. I want to know if you can select rows and columns from the file you are reading? A bit like what it does read.csv.sql? I know, using the option select fread, you can select the columns to read, but what about reading only rows that meet certain criteria.

For example, could something like below be implemented with fread?

read.csv.sql(file, sql = "select V2,V4,V7,V8,V9, V10 from file where V5=='CE' and V10 >= 500",header = FALSE, sep= '|', eol ="\n")

If this is not yet possible, is it advisable to read the entire amount of data, and then use subset, etc., to get the final result? Or will it defeat the purpose of use fread?

For reference, I have to read about 800 files, each of which contains about 100,000 rows and 10 columns. Any input is welcome.

Thank.

+4
source share
1 answer

So far it is not possible to select lines with fread(), as with read.csv.sql(). But it’s better to read all the data (allow memory) and then multiply them according to your criteria. For a 200 mb file fread()+ subset()gave ~ 4 times better performance than read.csv.sql().

So, using the @Arun clause,

ans = rbindlist(lapply(files, function(x) fread(x)[, fn := x]))
subset(ans, 'your criteria')

better than the approach in the original question.

+3
source

Source: https://habr.com/ru/post/1539481/


All Articles