Work with large CSV (20G) files in ruby

I am working on a small problem and will have some tips on how to solve it: Given a csv file with an unknown number of columns and rows, display a list of columns with values ​​and the number of repetitions of each value. without using any library.

If the file is small, this should not be a problem, but when it's a few Gigs, I get a NoMemoryError: memory allocation failed. Is there a way to create a hash and read from disk rather than loading a file into memory? you can do it in perl with bound hashes

EDIT: Will IO # foreach load the file into memory? what about File.open (filename) .each?

+4
source share
2 answers

Read the file one line at a time, discarding each line:

open("big.csv") do |csv| csv.each_line do |line| values = line.split(",") # process the values end end 

Using this method, you should never run out of memory.

+20
source

Do you read the entire file at once? Reading it based on a string, i.e. Using ruby -pe , ruby -ne or $stdin.each should reduce memory usage on garbage collection lines that have been processed.

 data = {} $stdin.each do |line| # Process line, store results in the data hash. end 

Save it as script.rb and move the huge CSV file to this standard input script:

 ruby script.rb < data.csv 

If you do not like reading from standard input, we will need a small change.

 data = {} File.open("data.csv").each do |line| # Process line, store results in the data hash. end 
+6
source

Source: https://habr.com/ru/post/978923/


All Articles