Find out which words in a large list appear in a small line

Question

Find out which words in a large list appear in a small line

I have a static "big" word list, about 300-500 words, called "list1"

given the relatively short str string of about 40 words, what is the fastest method in ruby to get:

the number of times a word in list1 occurs in str (counting multiple occurrences)
the list of words in list1 occurs one or more times in the string str
number of words in (2)

The "run" in str means either as a whole word in str , or as a partial inside a word in str . Therefore, if 'fred' is in list1 and str containing 'fred' and 'freddie' , it will be two matches.

Everything is lowercase, so any match should not take care of the case.

For instance:

 list1 ="fred sam sandy jack sue bill" str = "and so sammy went with jack to see fred and freddie"

therefore str contains sam , jack , fred (twice)

for part (1), the expression will return 4 (sam + jack + fred + fred)
for part (2), the expression will return "sam jack fred"
and part (3) is 3

The “ruby path” for this eludes me after 4 hours ... with iteration, this is fairly easy (but slow). Any help would be appreciated!

+4

ruby regex

jpwynn Feb 01 '11 at 6:45

source share

3 answers

Here's an alternative implementation for your edification:

 def match_freq( words, str ) words = words.split(/\s+/) counts = Hash[ words.map{ |w| [w,str.scan(w).length] } ] counts.delete_if{ |word,ct| ct==0 } occurring_words = counts.keys [ counts.values.inject(0){ |sum,ct| sum+ct }, # Sum of counts occurring_words, occurring_words.length ] end list1 = "fred sam sandy jack sue bill" str = "and so sammy went with jack to see fred and freddie" x = match_freq(list1, str) px #=> [4, ["fred", "sam", "jack"], 3]

Note that if I needed this data, I would most likely just return the "counts" hash from the method, and then do whatever analysis I need. If I was going to return a few “values” from the analysis method, I could return a Hash with named values. Although returning an array, you may not print the results:

 hits, words, word_count = match_freq(list1, str) p hits, words, word_count #=> 4 #=> ["fred", "sam", "jack"] #=> 3

+2

Phrogz Feb 01 '11 at 17:49

source share

For quick regular expressions, use https://github.com/mudge/re2 . This is a ruby wrapper for Google re2 https://code.google.com/p/re2/

0

mattes Sep 13 '13 at 1:50

source share

maerics · Accepted Answer · 2011-02-01T07:07:08+0000

Here is my picture:

 def match_freq(exprs, strings) rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {} rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}} [f.values.inject(0){|a,x|a+x}, f, f.size] end list1 = "fred sam sandy jack sue bill" str = "and so sammy went with jack to see fred and freddie" x = match_freq(list1, str) x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3]

The output of match_freq is an array of your output elements (a, b, c). The algorithm itself is O(n*m) , where n is the number of elements in the list1, and m is the size of the input string, I don’t think you can do better than this (in big-oh terms). But there are less optimizations that can pay off, for example, keeping a separate counter for the total number of matches, rather than calculating it later. It was just my quick hack.

You can output only the corresponding words from the output as follows:

 matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack"

Please note that the order will not be saved necessarily, if it is important, you will need to keep a separate list of the order in which they were found.

Find out which words in a large list appear in a small line

More articles: