Search for a document with many different queries

I am writing a script that introduces the news as input and returns a list of all publicly traded companies mentioned in the article, and their corresponding characters. There are ~ 6,500 unique company names that could be mentioned.

My first thought was to use regular expressions to get all the names that might be company names out of the article. Company names can be completely different, but almost always every word in the name begins with a capital letter, so I think that this can only work with a few false positives (situations where, apparently, people have a name with the company).

The next problem is comparing possible company names with a list of all companies and symbols. How to save the list? How is the table with each record with the company and symbol field? This seems like the perfect place to use a company mapping hash map per character. Would it be faster than mysql's decision to serialize an array with the aforementioned mapping and just non-esterize it at the beginning of my script, which finds the names in the articles?

+4
source share
2 answers

My first thought was to use regular expressions to get all the names that might be company names out of the article. Company names can be completely different, but almost always every word in the name begins with a capital letter, so I think that this can only work with a few false positives (situations where, apparently, people have a name with the company).

There is a reason why we use a prefix like # or @ for tags or referral names, this helps to create a pattern mapping. I think that you will shoot in the foot if you allow β€œfalse positives” on this scale.

I would have acted in accordance with the standard ticker article formats, including the name of the company name or background information on stock tickets, such as American Company Co. (ACCO) American Company Co. (ACCO) , this allows you to simply search for links (*) .

In addition to adhering to the format, it will be difficult for you to get fast, relevant and accurate results.

A comprehensive solution will be server-side processing for false positives, downloading a complete list of names and a crunch for matches, with some warning system with viewing warnings, but this is just too much overhead when a simple long format setup is possible)

+3
source

and returns a list of all publicly traded companies mentioned in the article and their corresponding symbols

Assuming there is no structure in the text, it will be very difficult.

The most effective solution would be to split the article into a list of words and keep a list of words that appear in company names, and for each entry in the last list - an additional list of regular expressions to match the full company names - this will allow you to reduce company names from 6,500 to a smaller list to search for potential matches. Then apply these regular expressions to the source code.

Yes, performing this operation in the database will be much faster - but this is far from a trivial task.

+2
source

Source: https://habr.com/ru/post/1390278/


All Articles