Removing body text from websites, for example. extract only the title of the article and the text is not all the text on the site

Question

Removing body text from websites, for example. extract only the title of the article and the text is not all the text on the site

I am looking for algorithms to extract text from websites. I do not mean "strip html" or any of the hundreds of libraries that allow this.

So, for example, for a news article, I would like to indicate the headline and all the text, but not the comment section, etc.

Are there any algorithms for this? Thanks!

+8

algorithm text web-scraping text-extraction

Scoox Apr 21 '11 at 15:02

source share

5 answers

What you are trying to do is called "content extraction." It turns out that it is surprisingly difficult to solve the problem, and many naive solutions do it pretty badly.

Instapaper and Readability both have to solve this, and you can learn something by looking at their solutions. They also provide services that you can use - perhaps you can pass them your problem and let their API take care of it. :)

Otherwise, the search for html content extract "returns many useful results, including a series of articles on this subject.

+1

Nick johnson Apr 22 '11 at 4:00

source share

There are several open source tools available that perform similar article extraction tasks. https://github.com/jiminoc/goose , which was opened using Gravity.com

He has information about the wiki, as well as a source that you can view. There are dozens of unit tests that show text extracted from various articles.

+1

James May 08 '11 at 16:06

source share

Content Extraction is a very complex topic. There are no conventional standards for identifying the content of the “main article”. There are several approaches that make HTML more understandable for crawlers, such as schema.org , but none of them are used very widely).

So, it turns out that if you want to get good results, it is probably best to define your own XPath selector for each (news) site that you want to clear. Although there are some APIs for extracting HTML content, but, as I said, it is very difficult to develop an algorithm that works for each site.

Some APIs you could use:

alchemyapi.com
diffbot.com
boilerpipe-web.appspot.com
aylien.com
textracto.com

0

David LR 30 sept '16 at 17:00

source share

I think that it’s best for you to study what information you can get from metadata and write a good html parser, oEmbed can be a good standard =)

https://oembed.com/#section7

-2

carlos hoffmann Apr 23 '18 at 21:52

source share

Jeff kubina · Accepted Answer · 2011-04-23T02:35:39+0000

In the literature on computer science, this problem is usually called the problem of page segmentation or detection of a boiler plate. See the “ Detecting Boilerplate Using Small Text Functions ” Report and related blog post . In addition, I have several reports and bookmarking site software that solve this problem. Also, see this stackoverflow question.

Removing body text from websites, for example. extract only the title of the article and the text is not all the text on the site

More articles: