Removing body text from websites, for example. extract only the title of the article and the text is not all the text on the site

I am looking for algorithms to extract text from websites. I do not mean "strip html" or any of the hundreds of libraries that allow this.

So, for example, for a news article, I would like to indicate the headline and all the text, but not the comment section, etc.

Are there any algorithms for this? Thanks!

+8
source share
5 answers

In the literature on computer science, this problem is usually called the problem of page segmentation or detection of a boiler plate. See the “ Detecting Boilerplate Using Small Text Functions ” Report and related blog post . In addition, I have several reports and bookmarking site software that solve this problem. Also, see this stackoverflow question.

+9
source

What you are trying to do is called "content extraction." It turns out that it is surprisingly difficult to solve the problem, and many naive solutions do it pretty badly.

Instapaper and Readability both have to solve this, and you can learn something by looking at their solutions. They also provide services that you can use - perhaps you can pass them your problem and let their API take care of it. :)

Otherwise, the search for html content extract "returns many useful results, including a series of articles on this subject.

+1
source

There are several open source tools available that perform similar article extraction tasks. https://github.com/jiminoc/goose , which was opened using Gravity.com

He has information about the wiki, as well as a source that you can view. There are dozens of unit tests that show text extracted from various articles.

+1
source

Content Extraction is a very complex topic. There are no conventional standards for identifying the content of the “main article”. There are several approaches that make HTML more understandable for crawlers, such as schema.org , but none of them are used very widely).

So, it turns out that if you want to get good results, it is probably best to define your own XPath selector for each (news) site that you want to clear. Although there are some APIs for extracting HTML content, but, as I said, it is very difficult to develop an algorithm that works for each site.

Some APIs you could use:

alchemyapi.com
diffbot.com
boilerpipe-web.appspot.com
aylien.com
textracto.com

0
source

I think that it’s best for you to study what information you can get from metadata and write a good html parser, oEmbed can be a good standard =)

https://oembed.com/#section7

-2
source

Source: https://habr.com/ru/post/886403/


All Articles