Splitting a string in Ruby into a word list using regex

I am trying to break a string in Ruby into smaller substrings or phrases based on a list of stop words. The split method works when I define a regular expression pattern directly; however, it does not work when I try to define a template by evaluating it in the split method itself.

In practice, I want to read an external stop word file and use it to separate my sentences. Thus, I want to be able to build a template from an external file, and not specify it directly. I also notice that when I use "pp" compared to "puts", I get a completely different behavior, and I'm not sure why. I am using Ruby 2.0 and Notepad ++ on Windows.

require 'pp' str = "The force be with you." pp str.split(/(?:\bthe\b|\bwith\b)/i) => ["", " force be ", " you."] pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?) => ["force be", "you."] 

The last array is my desired result. However, this does not work below:

  require 'pp' stop_array = ["the", "with"] str = "The force be with you." pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")" puts pattern => (?thwit) puts str.split(/#{pattern}/i) => The force be with you. pp pattern => "(?:\bthe\b|\bwith\b)" pp str.split(/#{pattern}/i) => ["The force be with you."] 

Update:. Using the comments below, I modified my original script. I also created a line splitting method.

  require 'pp' class String def splitstop(stopwords=[]) stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i return split(stopwords_regex).collect(&:strip).reject(&:empty?) end end stop_array = ["the", "with", "over"] pp "The force be with you.".splitstop stop_array => ["force be", "you."] pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array => ["quick brown fox jumps", "lazy dog."] 
+4
source share
4 answers

I would do it like this:

 str = "The force be with you." stop_array = %w[the with] stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."] 

When using Regexp.union it is important to keep an eye on the template created:

 /(?:#{ Regexp.union(stop_array) })/i => /(?:(?-mix:the|with))/i 

Nested (?-mix: disables the case-insensitive flag inside the pattern, which can break the pattern by causing it to capture the wrong things. Instead, you should tell the engine to return only the pattern without the flags:

 /(?:#{ Regexp.union(stop_array).source })/i => /(?:the|with)/i 

Here's why pattern = "(?:\bthe\b|\bwith\b)" doesn't work:

 /#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i 

Ruby sees "\b" as a backspace character. Use instead:

 pattern = "(?:\\bthe\\b|\\bwith\\b)" /#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i 
+3
source

You need to mask the backslash:

 "\\b#{i}\\b" 

i.e.

 pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")" 

And a slight improvement / simplification:

 pattern = "\\b(?:" + stop_array.join("|") + ")\\b" 

Then:

 str.split(/#{pattern}/i) # => ["", " force be ", " you."] 

If your stop list is short, I think this is the right approach.

0
source
 stop_array = ["the", "with"] re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i}) "The force be with you.".split(re) # => [ "", "force be", "you." ] 
0
source
 s = "the force be with you." stop_words = %w|the with is| # dynamically create a case-insensitive regexp regexp = Regexp.new stop_words.join('|'), true result = [] while(match = regexp.match(s)) word = match.pre_match unless match.pre_match.empty? result << word s = match.post_match end # the last unmatched content, if any result << s result.compact!.map(&:strip!) pp result => ["force be", "you."] 
0
source

Source: https://habr.com/ru/post/1485745/


All Articles