Find out which words in a large list appear in a small line

I have a static "big" word list, about 300-500 words, called "list1"

given the relatively short str string of about 40 words, what is the fastest method in ruby ​​to get:

  • the number of times a word in list1 occurs in str (counting multiple occurrences)
  • the list of words in list1 occurs one or more times in the string str
  • number of words in (2)

The "run" in str means either as a whole word in str , or as a partial inside a word in str . Therefore, if 'fred' is in list1 and str containing 'fred' and 'freddie' , it will be two matches.

Everything is lowercase, so any match should not take care of the case.

For instance:

 list1 ="fred sam sandy jack sue bill" str = "and so sammy went with jack to see fred and freddie" 

therefore str contains sam , jack , fred (twice)

for part (1), the expression will return 4 (sam + jack + fred + fred)
for part (2), the expression will return "sam jack fred"
and part (3) is 3

The “ruby path” for this eludes me after 4 hours ... with iteration, this is fairly easy (but slow). Any help would be appreciated!

+4
source share
3 answers

Here is my picture:

 def match_freq(exprs, strings) rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {} rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}} [f.values.inject(0){|a,x|a+x}, f, f.size] end list1 = "fred sam sandy jack sue bill" str = "and so sammy went with jack to see fred and freddie" x = match_freq(list1, str) x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3] 

The output of match_freq is an array of your output elements (a, b, c). The algorithm itself is O(n*m) , where n is the number of elements in the list1, and m is the size of the input string, I don’t think you can do better than this (in big-oh terms). But there are less optimizations that can pay off, for example, keeping a separate counter for the total number of matches, rather than calculating it later. It was just my quick hack.

You can output only the corresponding words from the output as follows:

 matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack" 

Please note that the order will not be saved necessarily, if it is important, you will need to keep a separate list of the order in which they were found.

+2
source

Here's an alternative implementation for your edification:

 def match_freq( words, str ) words = words.split(/\s+/) counts = Hash[ words.map{ |w| [w,str.scan(w).length] } ] counts.delete_if{ |word,ct| ct==0 } occurring_words = counts.keys [ counts.values.inject(0){ |sum,ct| sum+ct }, # Sum of counts occurring_words, occurring_words.length ] end list1 = "fred sam sandy jack sue bill" str = "and so sammy went with jack to see fred and freddie" x = match_freq(list1, str) px #=> [4, ["fred", "sam", "jack"], 3] 

Note that if I needed this data, I would most likely just return the "counts" hash from the method, and then do whatever analysis I need. If I was going to return a few “values” from the analysis method, I could return a Hash with named values. Although returning an array, you may not print the results:

 hits, words, word_count = match_freq(list1, str) p hits, words, word_count #=> 4 #=> ["fred", "sam", "jack"] #=> 3 
+2
source

For quick regular expressions, use https://github.com/mudge/re2 . This is a ruby ​​wrapper for Google re2 https://code.google.com/p/re2/

0
source

Source: https://habr.com/ru/post/1337910/


All Articles