This falls into a "naive" camp, but here is a method that uses sets as food for thought:
docs = [
"" "Here a sentence with dog and apple in it" "",
"" "Here a sentence with dog and poodle in it" "",
"" "Here a sentence with poodle and apple in it" "",
"" "Here a dog with and apple and a poodle in it" "",
"" "Here an apple with a dog to show that order is irrelevant" ""
]
query = ['dog', 'apple']
def get_similar (query, docs):
res = []
query_set = set (query)
for i in docs:
# if all n elements of query are in i, return i
if query_set & set (i.split ("")) == query_set:
res.append (i)
return res
This returns:
["Here a sentence with dog and apple in it",
"Here a dog with and apple and a poodle in it",
"Here an apple with a dog to show that order is irrelevant"]
Of course, the complexity of the time is not that big, but much, much faster than using lists as a whole due to the speed of hash operations.
Part 2 states that Elasticsearch is a great candidate for this if you are willing to make an effort and you are dealing with a lot of data.
source share