here is a complete working sample
require 'pp' content = "Meet Mr. Jon. Jon is a computer programmer and lives in Connecticut. Jon is tall. Shouldn't take web 2.0 as two sentences. And this is a new sentence. " words = {} content.gsub!(/(Mr)\.|(Mrs)\./,"\\1{dot}").split(/\. |\? |\! /).each_with_index do |sentences, index| puts "\n#{index}: #{sentences}" sentences.split(/ +/).each do |word| word=word.gsub(/{dot}/,"\.").downcase puts word words[word]=words[word]||[0,[]] words[word][0]+=1 words[word][1]<<index end end pp words
last pp gives
{"meet"=>[1, [0]], "mr."=>[1, [0]], "jon"=>[3, [0, 1, 2]], "is"=>[3, [1, 2, 4]], "a"=>[2, [1, 4]], "computer"=>[1, [1]], "programmer"=>[1, [1]], "and"=>[2, [1, 4]], "lives"=>[1, [1]], "in"=>[1, [1]], "connecticut"=>[1, [1]], "tall"=>[1, [2]], "shouldn't"=>[1, [3]], "take"=>[1, [3]], "web"=>[1, [3]], "2.0"=>[1, [3]], "as"=>[1, [3]], "two"=>[1, [3]], "sentences"=>[1, [3]], "this"=>[1, [4]], "new"=>[1, [4]], "sentence"=>[1, [4]]}
You can filter words of type āaā based on the minimum length blacklisted. Curious what you are doing, I am creating an indexer for a wiki, since I cannot get Xapian to work on my windows / rubies. Grtz