My program uses two large text files (Millions of lines). These files are parsed and loaded into hashes so that you can quickly access data. The problem I am facing is that parsing and loading are currently the slowest part of the program. Below is the code where this is done.
database = extractDatabase(@type).chomp("fasta") + "yml"
revDatabase = extractDatabase(@type + "-r").chomp("fasta.reverse") + "yml"
@proteins = Hash.new
@decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
@proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
@decoyProteins[parts[0]] = parts[1]
end
And the files look like an example below. It started as a YAML file, but the format was changed to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I got confused with different ways of setting up the file and parsing them, and so far this is the fastest way, but it is still terribly slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that do not work :