Quick Java String Matching (to associate text with a category)

Suppose I have a message that looks like

  • TITLE: "WEB: SEO in 2011"
  • DESCRIPTION: "2011 Web SEO Conference"

In addition, I have a list of categories with keywords:

  • "IT" (cat) β†’ "Web Design", "SEO", "Development", "Web Development" (keywords)

I have several categories (this is art, medicine, literature, technology, etc.)

I need to use java to automatically update my posts with these categories and keywords (sort of tags) to improve my search in the future.

Example above: must match "seo" and "web", so the main_category field should be filled with "IT", and the subcategory field should be filled with "seo" or "web" (or maybe both that isn 'too bad)

my problem is that the only solution I can come up with is waaaaay in bruteforcing (check all the words when you match, you have a category and a list of keywords related to it) and this will slow down my actions ...

Is there a way to make the search better? I can also change the structure of categories-> keywords to make something better (I still don't know how ...)

thanks for everything in advance!

EDIT: Accuracy is not as important as the question asked in the comment. I don’t need 100% tagging accuracy since I know that I can have an honest amount of correctness based on raw string matching.

In addition, the logic that I was thinking about is: look at the title / description of the post, find suitable keywords, tag with a category, find additional keywords in this category, save from 3 to 5 suitable keywords

+4
source share
1 answer

You might want to try a different approach using Machine Learning .

Algorithm Description:
First, create training samples [documents that you know exactly how to mark them, you can mark the sample manually and specify it as an input to the algorithm]. Then create a Bag of words for these patterns using the k packet of words [you will need to decide which k is optimal by comparing quality, I will explain later].

Each word is a β€œfunction”, and then for each new document you will try to find which document from the sample is the closest neighbor [i.e. has most of the β€œwords” in your bag of words], the new document will be marked as its closest neighbor.

How to check the quality? You can check the quality by taking 10% of the documents from the sample and learn only about the remaining 90%. after completing the training, you can evaluate how accurate your algorithm is by checking the accuracy of the remaining 10%. Please note that you will probably need to do this several times to find the optimal [Bag Of Words] size as described above.

+1
source

Source: https://habr.com/ru/post/1369454/


All Articles