Finding an Experiment to Evaluate How Good the Keyword Retrieval Algorithm is

I have several algorithms that extract and rank keywords [both terms and bigrams ] from a paragraph [most of them are based on the tf-idf model]. I am looking for an experiment to evaluate these algorithms. This experiment should evaluate each algorithm by indicating "how good it is" [based on the rating, of course].

I am looking for an automatic / semi-automatic method for evaluating the results of each algorithm and an automatic / semi-automatic method for creating an evaluation set.

Note. . These experiments will be performed autonomously, so efficiency is not a problem.

+4
source share
1 answer

The classic way to do this is to identify the set of keywords that you want the algorithms to find each paragraph, and then check how well the algorithms perform with respect to that set, for example. (generate_correct - generated_not_correct) / total_generated (see update, this is nonsense). This is automatic if you have identified this basic truth. I guess this is what you want to automate when you talk about creating a score set? This is a little trickier.

As a rule, if there is a way to automatically generate keywords, which is a good way to use as the main truth - you should use this as your algorithm;). It sounds cheeky, but it's a common problem. When you evaluate one algorithm using the output of another algorithm, something is probably going wrong (unless you specifically want to compare this algorithm).

So, you can start collecting keywords from common sources. For instance:

  • Download scientific articles containing a keyword section. Check if these keywords really appear in the text, if they do, take a section of the text, including keywords, use the keyword section as true.

  • Get blog posts, check if the text displays the terms in the title, and then use the words in the title (always minus stop words, of course) as the truth of the truth

  • ...

You get the idea. If you do not want people to manually generate keywords, I think you will need to do something like the above.

Update The above rating function is stupid. It does not take into account how many of the available keywords were found. Instead, a way to evaluate a ranked list of relevant and irrelevant results is to use accuracy and recall. Precision rewards the absence of irrelevant results. Recall confirms the existence of relevant results. This again gives you two measures. To combine these two into one measure, use either the F-measure, which combines these two measures into a single measure with additional weighting. Alternatively, use Precision @X, where X is the number of results you want to consider. The accuracy of @X is interestingly equivalent to Recall @X. However, here you need a reasonable X, i.e. If in some cases you have less than X keywords, these results will be penalized for never containing the Xth keyword. In the literature on the recommendation of tags, for example, which is very similar to your case, F-measures and P @ 5 are often used.

http://en.wikipedia.org/wiki/F1_score

http://en.wikipedia.org/wiki/Precision_and_recall

+1
source

Source: https://habr.com/ru/post/1383693/


All Articles