Best regular expression for splitting sentences in ruby?

I’m working on something that takes into account how often a word appears in a bunch of text, reports in which expression it appears, and sorts the result by the frequency of each word. For instance: sample input and out put

and this is what I still have:

File.open('sample_text.txt', 'r') do |f| # open a file named "sample_text.txt" content = f.read # turn the content into a long string # split the string by sentences sentences = content.split(/\.|\?|\!/).each do |es| es.split(/\W|\s/).each do |w| #split into individual words #and for each word, find matched words in the content end end end 

Questions:

1. Is there a better regex to separate sentences? Now split(/\.|\?|\!/) Will accept web 2.0 as two web 2 sentences web 2 and 0 .

2. Can someone give me some hint on how to make the part that returns an array of sentences containing the word?

+4
source share
4 answers
  • How about setting a space after a period (or punctuation, for example ? Or ! ), And then optionally preventing it from being preceded by some known abbreviations (e.g., vs. or Mr. ) Mrs. or ie or eg ), and, perhaps, after that it is required that after that there should be a capital letter?

  • Given an array of sentence strings and a method that breaks each sentence into an array of words (I will leave this to you), you can do this:

     sentences_for_word = Hash.new{ |h,k| h[k] = [] } sentences.each do |sentence| words_for_sentence(sentence).each do |word| sentences_for_word[word] << sentence end end 
+1
source

here is a complete working sample

 require 'pp' content = "Meet Mr. Jon. Jon is a computer programmer and lives in Connecticut. Jon is tall. Shouldn't take web 2.0 as two sentences. And this is a new sentence. " words = {} content.gsub!(/(Mr)\.|(Mrs)\./,"\\1{dot}").split(/\. |\? |\! /).each_with_index do |sentences, index| puts "\n#{index}: #{sentences}" sentences.split(/ +/).each do |word| word=word.gsub(/{dot}/,"\.").downcase puts word words[word]=words[word]||[0,[]] words[word][0]+=1 words[word][1]<<index end end pp words 

last pp gives

 {"meet"=>[1, [0]], "mr."=>[1, [0]], "jon"=>[3, [0, 1, 2]], "is"=>[3, [1, 2, 4]], "a"=>[2, [1, 4]], "computer"=>[1, [1]], "programmer"=>[1, [1]], "and"=>[2, [1, 4]], "lives"=>[1, [1]], "in"=>[1, [1]], "connecticut"=>[1, [1]], "tall"=>[1, [2]], "shouldn't"=>[1, [3]], "take"=>[1, [3]], "web"=>[1, [3]], "2.0"=>[1, [3]], "as"=>[1, [3]], "two"=>[1, [3]], "sentences"=>[1, [3]], "this"=>[1, [4]], "new"=>[1, [4]], "sentence"=>[1, [4]]} 

You can filter words of type ā€œaā€ based on the minimum length blacklisted. Curious what you are doing, I am creating an indexer for a wiki, since I cannot get Xapian to work on my windows / rubies. Grtz

+1
source

You can improve your regular expression by adding a positive statement ahead

 (?:\.|\?|\!)(?= [^az]|$) 

See here at Regexr

(?= [^az]|$) is a positive loookahead that checks if there is a space followed by a lowercase letter or the end of the line in front. This greatly improves compliance.

Another suggestion from Phrogz (avoiding matching common abbreviations) is not possible in regular expression in one step, because Ruby does not support lookbehind statements.

An opportunity that requires additional steps to achieve this goal is to search for these abbreviations at the first stage and replace them with a placeholder (for example, Mr. and Mr. # DOT #), and after you split up on the points, replace again place holders.

Just for fun, NOT working with Ruby! version appearance:

 (?<!\be\.g|\bi\.e|\bvs|\bMr|\bMrs|\bDr)(?:\.|\?|\!)(?= |$) 

See here at Regexr

0
source

Use the word separator: str.split (/ \ W + /). It will work for most texts (although, I think, it will be divided into "character").

0
source

Source: https://habr.com/ru/post/1384248/


All Articles