I am looking for ideas on a recommended approach.
I am trying to clear headings and body text from articles for several specific sites, similar to what Google does with Google news.
The problem is on different sites, they may have articles on the same subject, formulated somewhat differently.
Can someone tell me what I need to know in order to write a comparison algorithm to automatically detect similar articles? Is there any library right now that can be used to compare text and return some type of affinity rating?
Thank you very well in advance.
I am using Python.
source
share