Content Extraction is a very complex topic. There are no conventional standards for identifying the content of the “main article”. There are several approaches that make HTML more understandable for crawlers, such as schema.org , but none of them are used very widely).
So, it turns out that if you want to get good results, it is probably best to define your own XPath selector for each (news) site that you want to clear. Although there are some APIs for extracting HTML content, but, as I said, it is very difficult to develop an algorithm that works for each site.
Some APIs you could use:
alchemyapi.com
diffbot.com
boilerpipe-web.appspot.com
aylien.com
textracto.com
source share