Is there an efficient way to do hundreds of text replacements in Ruby?

I am trying to use a list of hundreds of common spelling errors to clear some data before finding duplicates.

This is a critical process, so I hope there is a faster way than having hundreds of regular expressions (or one hundred branches).

Is there an efficient way to do hundreds of text replacements in Ruby?

+4
source share
3 answers

An alternative approach, if your input is separated by words, will simply build a hash table {error => correction} .

Finding a Hash table is quick, so if you can bend your input in this format, it will almost certainly be fast enough.

+3
source

I'm glad to say that I just found " RegexpTrie ", which is a useful replacement for the code and is needed for Perl Regexp :: Collect.

Install it and try:

 require 'regexp_trie' foo = %w(miss misses missouri mississippi) RegexpTrie.union(foo) # => /miss(?:(?:es|ouri|issippi))?/ RegexpTrie.union(foo, option: Regexp::IGNORECASE) # => /miss(?:(?:es|ouri|issippi))?/i 

Here is a comparison of the outputs. The first, commented out outputs in the array are from Regexp :: Assemble, and the final output is from RegexpTrie:

 require 'regexp_trie' [ 'how now brown cow', # /(?:[chn]ow|brown)/ 'the rain in spain stays mainly on the plain', # /(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)/ 'jackdaws love my giant sphinx of quartz', # /(?:jackdaws|quartz|sphinx|giant|love|my|of)/ 'fu foo bar foobar', # /(?:f(?:oo(?:bar)?|u)|bar)/ 'ms miss misses missouri mississippi' # /m(?:iss(?:(?:issipp|our)i|es)?|s)/ ].each do |s| puts "%-43s # /%s/" % [s, RegexpTrie.union(s.split).source] end # >> how now brown cow # /(?:how|now|brown|cow)/ # >> the rain in spain stays mainly on the plain # /(?:the|rain|in|s(?:pain|tays)|mainly|on|plain)/ # >> jackdaws love my giant sphinx of quartz # /(?:jackdaws|love|my|giant|sphinx|of|quartz)/ # >> fu foo bar foobar # /(?:f(?:oo(?:bar)?|u)|bar)/ # >> ms miss misses missouri mississippi # /m(?:iss(?:(?:es|ouri|issippi))?|s)/ 

Regarding how to use the Wikipedia link and misspelled words:

 require 'nokogiri' require 'open-uri' require 'regexp_trie' URL = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines' doc = Nokogiri::HTML(open(URL)) corrections = doc.at('div#mw-content-text pre').text.lines[1..-1].map { |s| a, b = s.chomp.split('->', 2) [a, b.split(/,\s+/) ] }.to_h # {"abandonned"=>["abandoned"], # "aberation"=>["aberration"], # "abilityes"=>["abilities"], # "abilties"=>["abilities"], # "abilty"=>["ability"], # "abondon"=>["abandon"], # "abbout"=>["about"], # "abotu"=>["about"], # "abouta"=>["about a"], # ... # } misspelled_words_regex = /\b(?:#{RegexpTrie.union(corrections.keys, option: Regexp::IGNORECASE).source})\b/i # => /\b(?:(?:a(?:b(?:andonned|eration|il(?:ityes|t(?:ies|y))|o(?:ndon(?:(?:ed|ing|s))?|tu|ut(?:it|the|a)... 

At this point, you can use gsub(misspelled_words_regex, corrections) , however the values ​​in the corrections contain some arrays because several words or phrases could be used to replace a misspelled word. You will need to do something to determine which option to use.


Ruby lacks a very useful module found in Perl called Regexp :: Assemble . Python has a hachoir-regex , which seems to do the same.

Regexp :: Assemble creates a very effective regular expression based on word lists and simple expressions. Is it really wonderful ... or ... diabolically?

See an example for a module; It is extremely easy to use in basic form:

 use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add( 'ab+c' ); $ra->add( 'ab+-' ); $ra->add( 'a\w\d+' ); $ra->add( 'a\d+' ); print $ra->re; # prints a(?:\w?\d+|b+[-c]) 

Notice how it combines patterns. It will do the same with ordinary words, only it will be even more efficient, because common lines will be merged:

 use Regexp::Assemble; my $lorem = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'; my $ra = Regexp::Assemble->new('flags' => 'i'); $lorem =~ s/[^a-zA-Z ]+//g; $ra->add(split(' ', lc($lorem))); print $ra->anchor_word(1)->as_string, "\n"; 

What outputs:

 \b(?:a(?:dipisicing|liqua|met)|(?:consectetu|tempo)r|do(?:lor(?:emagna)?)?|e(?:(?:li)?t|iusmod)|i(?:ncididunt|psum)|l(?:abore|orem)|s(?:ed|it)|ut)\b 

This code ignores word case and respects word boundaries.

I would recommend writing a small Perl application that can take a list of words and use this module to output a string version of a regular expression pattern. You should be able to import this template into Ruby. This will allow you to quickly find misspelled words. You could even output the template to a YAML file and then load that file into your Ruby code. Periodically parse error word pages, run output through Perl code, and your Ruby code will have an update pattern.

You can use this template for a piece of text to see if there are any misspelled words. If so, then you break the text into sentences or words and check the regular expression again. Do not check words immediately, because most words will be spelled correctly. It is almost like a binary search against your text - check everything if there is a hit, and then go to smaller blocks to narrow the search until you find individual spelling errors. How you break the pieces depends on the amount of incoming text. A regex pattern can test the entire text block and return nil or index in addition to individual words in the same way, so you get more speed by making large chunks of text.

Then, if you know that you have a misspelled word, you can do a hash search for the correct spelling. It would be a big hash, but the task of eliminating good and bad spelling is what takes the longest. Search will be very fast.


Here is a sample code:

get_words.rb

 #!/usr/bin/env ruby require 'open-uri' require 'nokogiri' require 'yaml' words = {} ['0-9', *('A'..'Z').to_a].each do |l| begin print "Reading #{l}... " html = open("http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/#{l}").read puts 'ok' rescue Exception => e puts "got \"#{e}\"" next end doc = Nokogiri::HTML(html) doc.search('div#bodyContent > ul > li').each do |n| n.content =~ /^(\w+) \s+ \(([^)]+)/x words[$1] = $2 end end File.open('wordlist.yaml', 'w') do |wordfile| wordfile.puts words.to_yaml end 

regex_assemble.pl

 #!/usr/bin/env perl use Regexp::Assemble; use YAML; use warnings; use strict; my $ra = Regexp::Assemble->new('flags' => 'i'); my %words = %{YAML::LoadFile('wordlist.yaml')}; $ra->add(map{ lc($_) } keys(%words)); print $ra->chomp(1)->anchor_word(1)->as_string, "\n"; 

Run the first one, then run the second pipe with its output to a file to capture the emitted regular expression.


And more examples of words and generated output:

 'how now brown cow' => /\b(?:[chn]ow|brown)\b/ 'the rain in spain stays mainly on the plain' => /\b(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)\b/ 'jackdaws love my giant sphinx of quartz' => /\b(?:jackdaws|quartz|sphinx|giant|love|my|of)\b/ 'fu foo bar foobar' => /\b(?:f(?:oo(?:bar)?|u)|bar)\b/ 'ms miss misses missouri mississippi' => /\bm(?:iss(?:(?:issipp|our)i|es)?|s)\b/ 

Ruby Regexp.union nowhere close to the complexity of Regexp::Assemble . After capturing a list of words with errors, 4225 words, consisting of 41,817 characters, were recorded. After running Perl Regexp :: Assemble, 30,954 regular expression characters were created with this list. I would say that effective.

+3
source

Try the opposite. Instead of correcting spelling errors and checking for duplicate results, drop everything into an audio format (for example, Metafon or Sound) and check for duplicates in this format.

Now I don’t know which path is likely to be faster - on the one hand, you have hundreds of regular expressions, each of which will not be able to match and return almost instantly. On the other hand, you have more than 30 potential substitutes for regular expressions, one or two of which will definitely correspond to each word.

Now, the metaphone is pretty fast - in fact it is not so much for the algorithm, so I can only suggest that you try and measure whether it is enough for your use.

+1
source

Source: https://habr.com/ru/post/1344809/