I am trying to break a string in Ruby into smaller substrings or phrases based on a list of stop words. The split method works when I define a regular expression pattern directly; however, it does not work when I try to define a template by evaluating it in the split method itself.
In practice, I want to read an external stop word file and use it to separate my sentences. Thus, I want to be able to build a template from an external file, and not specify it directly. I also notice that when I use "pp" compared to "puts", I get a completely different behavior, and I'm not sure why. I am using Ruby 2.0 and Notepad ++ on Windows.
require 'pp' str = "The force be with you." pp str.split(/(?:\bthe\b|\bwith\b)/i) => ["", " force be ", " you."] pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?) => ["force be", "you."]
The last array is my desired result. However, this does not work below:
require 'pp' stop_array = ["the", "with"] str = "The force be with you." pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")" puts pattern => (?thwit) puts str.split(/#{pattern}/i) => The force be with you. pp pattern => "(?:\bthe\b|\bwith\b)" pp str.split(/#{pattern}/i) => ["The force be with you."]
Update:. Using the comments below, I modified my original script. I also created a line splitting method.
require 'pp' class String def splitstop(stopwords=[]) stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i return split(stopwords_regex).collect(&:strip).reject(&:empty?) end end stop_array = ["the", "with", "over"] pp "The force be with you.".splitstop stop_array => ["force be", "you."] pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array => ["quick brown fox jumps", "lazy dog."]
source share