How to check the uniqueness (not duplication) of a message in an rss feed

when extracting and caching / saving (in the database) some messages from the RSS feed, how to determine that:

  • this is the same message (example: when some typos are fixed in the feed or if the title changes, the date changes, etc.)
  • find channels that talk about the same topic (example: the same story from different sources).

Are there any recommendations for these things?

thnx a lot

+3
source share
3 answers

RSS- . , , . RSS- URL-, , URL-. , URL- , Guid , , . , URL- . , , .

+3
0

Take a look at the clustering algorithms used in Google News. Although your requirements are not so high, they are vaguely connected with Google news - they group stories about the same event from different sources into one group. They use high-level algorithms in combination with NLP. But you can start by displaying the keywords in the title and URL.

0
source

Source: https://habr.com/ru/post/1763557/


All Articles