Extract main content (maximum text density) From a news article Web page

I want to make code to extract the main news from a news site. The news websites have the main news, advertisements, reviews, copyright notice, so I want to get only the main news, for example, made in the boiler, but I want to know how to do it.

So, I want to get information about how the process of doing this work.

Sudhanshu

+6
source share
4 answers

water supply sites contain source code, quick start instructions, links to the original scientific article and the corresponding video presentation of the conference:

http://code.google.com/p/boilerpipe/

This should give you a fairly complete set of information about how this works and how you can apply this in your scenario.

Best

Christian

+8
source

We are trying to use many open source scanners such as Readability, Beautiful Soup, etc., but after testing the Diffbot API we decide to use it for AppMarkt. It quickly and efficiently retrieves news articles from multiple languages.

+2
source

JSOUP provides an API for parsing HTML

0
source

I would give htmlcleaner a try.

HTMLCleaner is a Java library used to safely parse and convert any HTML found on the Internet into well-formed XML. It is designed as small, fast, flexible and independent. HtmlCleaner can be used in Java code, as a command line tool or as an Ant task. The result of the parsing is a lightweight model of a document object, which can be easily converted to standards such as DOM or JDom, or serialized for XML output in various ways (compact, fairly printed, etc.).

You can use XPath with htmlcleaner to get the content in xml / html tags. example Xpath example

0
source

Source: https://habr.com/ru/post/909813/


All Articles