Extract main content (maximum text density) From a news article Web page

Question

Extract main content (maximum text density) From a news article Web page

I want to make code to extract the main news from a news site. The news websites have the main news, advertisements, reviews, copyright notice, so I want to get only the main news, for example, made in the boiler, but I want to know how to do it.

So, I want to get information about how the process of doing this work.

Sudhanshu

+6

java text html-parsing webpage

Sudhanshu gupta Mar 2 '12 at 12:01

source share

4 answers

We are trying to use many open source scanners such as Readability, Beautiful Soup, etc., but after testing the Diffbot API we decide to use it for AppMarkt. It quickly and efficiently retrieves news articles from multiple languages.

+2

Andrei Bourdine Mar 09 '14 at 12:08

source share

JSOUP provides an API for parsing HTML

0

Allan Mar 2 '12 at 12:14

source share

I would give htmlcleaner a try.

HTMLCleaner is a Java library used to safely parse and convert any HTML found on the Internet into well-formed XML. It is designed as small, fast, flexible and independent. HtmlCleaner can be used in Java code, as a command line tool or as an Ant task. The result of the parsing is a lightweight model of a document object, which can be easily converted to standards such as DOM or JDom, or serialized for XML output in various ways (compact, fairly printed, etc.).

You can use XPath with htmlcleaner to get the content in xml / html tags. example Xpath example

0

Rangag Mar 2 '12 at 12:30

source share

Christian kohlschütter · Accepted Answer · 2012-04-25T20:21:30+0000

water supply sites contain source code, quick start instructions, links to the original scientific article and the corresponding video presentation of the conference:

http://code.google.com/p/boilerpipe/

This should give you a fairly complete set of information about how this works and how you can apply this in your scenario.

Best

Christian

Extract main content (maximum text density) From a news article Web page

More articles: