Ruby removes the substring matching any element of the array

I have a string say str ="this is the string " , and I have an array of strings:

 array =["this is" ,"second element", "third element"] 

I want to process a string so that the substring corresponding to any element of the array should be deleted and the rest of the string should be returned. Therefore, I want to get the following conclusion.

 output: "the string " 

How can I do this in ruby.

+3
source share
3 answers

You do not say whether you want to have the correct substring match or substring matching word boundaries. There is a difference. Here's how to do it, following the word boundaries:

 str = "this is the string " array = ["this is" ,"second element", "third element"] pattern = /\b(?:#{ Regexp.union(array).source })\b/ # => /\b(?:this\ is|second\ element|third\ element)\b/ str[pattern] # => "this is" str.gsub(pattern, '').squeeze(' ').strip # => "the string" 

Here, what happens to the union and union.source :

 Regexp.union(array) # => /this\ is|second\ element|third\ element/ Regexp.union(array).source # => "this\\ is|second\\ element|third\\ element" 

source returns a merged array in a form that can be more easily used by Regex when creating a template without inserting holes in the template. Consider these differences and what they can do according to the pattern:

 /#{ Regexp.union(%w[a . b]) }/ # => /(?-mix:a|\.|b)/ /#{ Regexp.union(%w[a . b]).source }/ # => /a|\.|b/ 

The first creates a separate template, with its own flags for the case, multi-line and white space, which will be built into the external template. This may be a mistake that is very difficult to track and correct, so only do this when you intend to have a subtask.

Also note what happens if you try to use:

 /#{ %w[a . b].join('|') }/ # => /a|.|b/ 

The resulting template has a wildcard . built into it, which will blur your template, as a result of which it will correspond to something. Do not go there.

If we don’t tell the regular expression engine to keep the word boundaries, then unexpected / unwanted / terrible things can happen:

 str = "this isn't the string " array = ["this is" ,"second element", "third element"] pattern = /(?:#{ Regexp.union(array).source })/ # => /(?:this\ is|second\ element|third\ element)/ str[pattern] # => "this is" str.gsub(pattern, '').squeeze(' ').strip # => "n't the string" 

It is important to think in terms of words when working with substrings containing complete words. The engine does not know the difference, so you need to say what to do. This is a situation that is too often missed by people who did not have to do text processing.

+5
source

Here is one way -

 array =["this is" ,"second element", "third element"] str = "this is the string " str.gsub(Regexp.union(array),'') # => " the string " 

To enable case insensitivity - str.gsub(/#{array.join('|')}/i,'')

+5
source

I saw two kinds of solutions, and at first I prefer Brad. But I think that these two approaches are so different that there should be a performance parameter, so I created the file below and ran it.

 require 'benchmark/ips' str = 'this is the string ' array =['this is' ,'second element', 'third element'] def by_loop(str, array) array.inject(str) { |result , substring| result.gsub substring, '' } end def by_regex(str, array) str.gsub(Regexp.union(array),'') end def by_loop_large(str, array) array = array * 100 by_loop(str, array) end def by_regex_large(str, array) array = array * 100 by_regex(str, array) end Benchmark.ips do |x| x.report("loop") { by_loop(str, array) } x.report("regex") { by_regex(str, array) } x.report("loop large") { by_loop_large(str, array) } x.report("regex large") { by_regex_large(str, array) } end 

Result:

 ------------------------------------------------- loop 16719.0 (Β±10.4%) i/s - 83888 in 5.073791s regex 18701.5 (Β±4.2%) i/s - 94554 in 5.063600s loop large 182.6 (Β±0.5%) i/s - 918 in 5.027865s regex large 330.9 (Β±0.6%) i/s - 1680 in 5.076771s 

Conclusion:

The Arup approach is much more efficient when the array becomes large.

Regarding the single quote question in Tin Man, I think this is very important, but it will be the responsibility of the OP, not the existing algorithms. And these two approaches produce the same thing on this line.

0
source

Source: https://habr.com/ru/post/1265987/


All Articles