Ruby: Comparing two hash arrays

I am definitely new to ruby ​​(and using 1.9.1), so any help is appreciated. Everything I learned about Ruby is related to using Google. I try to compare two arrays of hashes and because of the sizes, it goes long and flirts with out of memory. Any help would be appreciated.

I have a class (ParseCSV) with several methods (initialization, opening, comparison, strip, output). The way I am working now looks like this (and it passes the tests I wrote just using a much smaller dataset):


file1 = ParseCSV.new("some_file")
file2 = ParseCSV.new("some_other_file")

file1.open #this reads the file contents into an Array of Hash’s through the CSV library 
file1.strip #This is just removing extra hash’s from each array index.  So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.  

file2.open 
file2.compare("file1.storage") #@storage is The array of hash’s from the open method

file2.output

Now what I'm struggling with is a comparison method. Work on smaller data sets is not very important, it works fast enough. However, in this case, Im compares about 400,000 records (all are read into an hash array) against one that contains about 450,000 records. I'm trying to speed it up. Also, I cannot run the strip method in file2. Here's how I do it now:


def compare(x)
    #obviously just a verbose message
    puts "Comparing and leaving behind non matching entries"

    x.each do |row|
        #@storage is the array of hashes
        @storage.each_index do |y|       
            if row[@opts[:field]] == @storage[y][@opts[:field]]
               @storage.delete_at(y)
            end
       end
    end
end

Hope this makes sense. I know that this will be a slow process only because it has to repeat 400,000 lines 440,000 times each. But do you have any other ideas on how to speed it up and possibly reduce memory consumption?

+3
source share
2 answers

Yikes, this will be the O (n) runtime. Nasty.

Set.

:

require 'set'

file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")

file1_set = Set[file1_content]

unique_elements = file1_set - file2_content

, . , , , ==, .

, .

( , , ~ 2 , - . , Ruby )

+7

a script : compare() new_compare(). New_compare Enumerable. C, .

Test:: SIZE, . . .

require 'benchmark'

class Test
  SIZE = 20000
  attr_accessor :storage
  def initialize
    file1 = []
    SIZE.times { |x| file1 << { :field => x, :foo => x } }
    @storage = file1
    @opts = {}
    @opts[:field] = :field
  end

  def compare(x)
    x.each do |row|
      @storage.each_index do |y|
        if row[@opts[:field]] == @storage[y][@opts[:field]]
          @storage.delete_at(y)
        end
      end
    end
  end

  def new_compare(other)
    other_keys = other.map { |x| x[@opts[:field]] }
    @storage.reject! { |s| other_keys.include? s[@opts[:field]] }
  end

end

storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }

Benchmark.bm do |b|
  b.report("original compare") do
    t1 = Test.new
    t1.compare(storage2)
  end
end

Benchmark.bm do |b|
  b.report("new compare") do
    t1 = Test.new
    t1.new_compare(storage2)
  end
end

:

Test::SIZE = 500
      user     system      total        real
original compare  0.280000   0.000000   0.280000 (  0.285366)
      user     system      total        real
new compare  0.020000   0.000000   0.020000 (  0.020458)

Test::SIZE = 1000
     user     system      total        real
original compare 28.140000   0.110000  28.250000 ( 28.618907)
      user     system      total        real
new compare  1.930000   0.010000   1.940000 (  1.956868)

Test::SIZE = 5000
ruby test.rb
      user     system      total        real
original compare113.100000   0.440000 113.540000 (115.041267)
      user     system      total        real
new compare  7.680000   0.020000   7.700000 (  7.739120)

Test::SIZE = 10000
      user     system      total        real
original compare453.320000   1.760000 455.080000 (460.549246)
      user     system      total        real
new compare 30.840000   0.110000  30.950000 ( 31.226218)
+1

Source: https://habr.com/ru/post/1720224/


All Articles