How can I parse free text (Twitter tweets) against a large database of values?

Suppose I have a database containing 500,000 records, each of which represents, say, an animal. What would be the best approach for parsing 140 character tweets to identify matching entries by animal name? For example, on this line ...

"I went down into the forest during the day and could not believe my eyes: I saw a giant polar bear who had a picnic with a red squirrel."

... I would like to mention the phrases "giant polar bear" and "red squirrel" as they appear in my database.

This poses me as a problem that has probably been resolved many times, but from where I am sitting, it looks overly intense - repeating every db entry that checks for consistency in a string is certainly a crazy way to do this.

Can any of those who have received a compi degree have saved me from my suffering? I work in C # if that matters. Hurrah!

+4
source share
4 answers

Assuming the database is pretty static, use a color filter . This is a degenerate form of a hash table that stores only bits indicating the presence of a value, without storing the value itself. This is probable, because hashes can collide, so every hit requires a full search. But a 1 MB flowering filter with 500,000 entries can have at least 0.03% false positives.

Some math: To get this low speed, up to 23 hash codes are required, each of which has 23 bits of entropy, a total of 529 bits. Bob Jenkins 64-bit hash function generates 192 bits of entropy in a single pass (if you use unregistered variables in hash() , which Bob quotes as possibly as “mediocre” hash), which requires no more than three passes. Due to how flowering filters work, you don’t need all the entropy for each request, as most search queries will report a skip before moving on to the 23rd hash code.

EDIT: You will obviously have to parse words from the text. A search for each instance of /\b\w+\b/ is likely to be performed for the first version.

To match phrases, you will need to test every subsequence of an n-word (aka n-gram), where n is any number from 2 to the largest phrase in your dictionary. You can do it much cheaper by adding any word that appears in the phrase to a separate flowering filter and only testing n-grams for which each word passes this second filter.

+3
source

Have you tried creating a trie for your vocabulary? If you split the tweet into pieces and map each part to a trie, you get linear complexity.

+2
source

Why reinvent the wheel. Use the text indexing tool to handle heavy lifting. Lucene.Net comes to mind.

0
source

What happened to Regex? =) This will be done for small text queries.

 string input = @"I went down to the woods to day and couldn't believe my eyes: I saw a bear having a picnic with a squirrel. I am a human though!"; Regex animalFilter = new Regex(@"\b(bear|squirrel|tiger|human)\b"); foreach (Match s in animalFilter.Matches(input)) { textBox1.Text += s.Value + Environment.NewLine; } 

It outputs the result:

bear
Squirrel
human

Several Yet:

 string input = @"I went down to the woods to day and couldn't believe my eyes: I saw a bear having a picnic with a squirrel. I am a human though!"; Regex animalFilter = new Regex(@"\b(bear|squirrel|tiger|human)\b"); Dictionary<string, int> animals = new Dictionary<string, int>(); foreach (Match s in animalFilter.Matches(input)) { int ctr = 1; if (animals.ContainsKey(s.Value)) { ctr = animals[s.Value] + 1; } animals[s.Value] = ctr; } foreach (KeyValuePair<string,int> k in animals) { textBox1.Text += k.Key + " ocurred " + k.Value + " times" + Environment.NewLine; } 

Results:

bear 1 time
squirrel is played 1 time
person 1 time

0
source

Source: https://habr.com/ru/post/1309912/


All Articles