What is the best way to parse a tab delimited file in Ruby?

What is the best (most efficient) way to parse tab delimited file in Ruby?

+63
ruby tsv
Dec 10 '10 at 1:14
source share
4 answers

The Ruby CSV library allows you to specify a field separator. Ruby 1.9 uses FasterCSV . Something like this will work:

require "csv" parsed_file = CSV.read("path-to-file.csv", { :col_sep => "\t" }) 
+106
Dec 10 '10 at 1:22
source share

TSV rules are actually slightly different from CSV rules. The main difference is that CSV has provisions for inserting a comma inside the field, and then using quotation characters and escaping quotes inside the field. I wrote a quick example to show how a simple answer fails:

 require 'csv' line = 'boogie\ttime\tis "now"' begin line = CSV.parse_line(line, col_sep: "\t") puts "parsed correctly" rescue CSV::MalformedCSVError puts "failed to parse line" end begin line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ζ‚") puts "parsed correctly with random quote char" rescue CSV::MalformedCSVError puts "failed to parse line with random quote char" end #Output: # failed to parse line # parsed correctly with random quote char 

If you want to use the CSV library, you can use a random quote that you don't expect to see if your file (an example shows this), but you can also use a simpler methodology like the StrictTsv class shown below to get that the same effect without worrying about field quotes.

 # The main parse method is mostly borrowed from a tweet by @JEG2 class StrictTsv attr_reader :filepath def initialize(filepath) @filepath = filepath end def parse open(filepath) do |f| headers = f.gets.strip.split("\t") f.each do |line| fields = Hash[headers.zip(line.split("\t"))] yield fields end end end end # Example Usage tsv = Vendor::StrictTsv.new("your_file.tsv") tsv.parse do |row| puts row['named field'] end 

The choice of using a CSV library or something more stringent depends only on who sends the file to you and whether they expect to adhere to the strict TSV standard.

Details on the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values

+30
Apr 25 '13 at 15:57
source share

I like mmmris answer. HOWEVER, I hate the way a ruby ​​removes all empty values ​​from the end of a split. It also does not remove the newline at the end of the line.

In addition, I had a file with potential newlines in the field. So, I rewrote it "parsing" as follows:

 def parse open(filepath) do |f| headers = f.gets.strip.split("\t") f.each do |line| myline=line while myline.scan(/\t/).count != headers.count-1 myline+=f.gets end fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))] yield fields end end end 

This concatenates any rows needed to get a complete row of data, and always returns a complete set of data (without potential null entries at the end).

0
Jan 31 '18 at 17:15
source share

There are actually two different types of TSV files.

  1. TSV files, which are actually delimited CSV files set to Tab. This is what you get when, for example, save an Excel spreadsheet as "UTF -1 6 Unicode Text". Such files use CSV quoting rules, which means that fields can contain tabs and newlines if they are enclosed in quotation marks and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:

     use 'csv' parsed = CSV.read("file.tsv", col_sep: "\t") 
  2. IANA compliant TSV files. Tabs and line breaks are not allowed as field values, and there are no quotes. This is what you get when, for example, you select an entire Excel spreadsheet and paste it into a text file (be careful: this can go bad if some cells contain tabs or line breaks). Such TSV files can be easily analyzed line by line using a simple line.split("\t", -1) (note -1 , which does not allow split remove empty trailing fields). If you want to use the csv gem, just set quote_char to nil :

     use 'csv' parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil) 
0
Jul 18 '19 at 8:23
source share



All Articles