Here is a possible solution. I use regex
because I can easily get rid of punctuation characters. In addition, I use collections.Counter
, this can increase efficiency if your line contains many duplicate words.
tag_list = ["art","paint"] s = "This is such an nice artwork, very nice artwork. This is the best painting I've ever seen" from collections import Counter import re words = re.findall(r'(\w+)', s) dicto = Counter(words) def found(s, tag): return s.startswith(tag) words_found = [] for tag in tag_list: for k,v in dicto.iteritems(): if found(k, tag): words_found.append((k,v))
The last part can be done with a list:
words_found = [[(k,v) for k,v in dicto.iteritems() if found(k,tag)] for tag in tag_list]
Result:
>>> words_found [('artwork', 2), ('painting', 1)]
source share