The classic way to do this is to identify the set of keywords that you want the algorithms to find each paragraph, and then check how well the algorithms perform with respect to that set, for example. (generate_correct - generated_not_correct) / total_generated (see update, this is nonsense). This is automatic if you have identified this basic truth. I guess this is what you want to automate when you talk about creating a score set? This is a little trickier.
As a rule, if there is a way to automatically generate keywords, which is a good way to use as the main truth - you should use this as your algorithm;). It sounds cheeky, but it's a common problem. When you evaluate one algorithm using the output of another algorithm, something is probably going wrong (unless you specifically want to compare this algorithm).
So, you can start collecting keywords from common sources. For instance:
Download scientific articles containing a keyword section. Check if these keywords really appear in the text, if they do, take a section of the text, including keywords, use the keyword section as true.
Get blog posts, check if the text displays the terms in the title, and then use the words in the title (always minus stop words, of course) as the truth of the truth
...
You get the idea. If you do not want people to manually generate keywords, I think you will need to do something like the above.
Update The above rating function is stupid. It does not take into account how many of the available keywords were found. Instead, a way to evaluate a ranked list of relevant and irrelevant results is to use accuracy and recall. Precision rewards the absence of irrelevant results. Recall confirms the existence of relevant results. This again gives you two measures. To combine these two into one measure, use either the F-measure, which combines these two measures into a single measure with additional weighting. Alternatively, use Precision @X, where X is the number of results you want to consider. The accuracy of @X is interestingly equivalent to Recall @X. However, here you need a reasonable X, i.e. If in some cases you have less than X keywords, these results will be penalized for never containing the Xth keyword. In the literature on the recommendation of tags, for example, which is very similar to your case, F-measures and P @ 5 are often used.
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall
source share