How do you extract text from a web page (Java)?

I plan to write a simple J2SE application to aggregate information from several web sources.

The hardest part, I think, is to extract meaningful information from web pages if it is not available as RSS or Atom feeds. For example, I could extract a list of questions from stackoverflow, but I absolutely don't need a tag cloud or navbar.

What technique / library would you recommend?

Updates / Notes

  • Speed ​​doesn't matter - as long as it can parse about 5 MB of HTML in less than 10 minutes.
  • It can be very simple.
+3
source share
10 answers

HTMLParser (http://htmlparser.sourceforge.net/)in URL # getInputStream() HTML-, .

+3

, httpunit. html, - nekohtml. , jdk (httpurlconnection) apache

http://hc.apache.org/httpclient-3.x/

+2

, HTML XML XQuery . IBM developerWorks , ( HTML, , , ):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>
+2

, , ( , SAXParser), HTML , HTML... DOM, , .

0

, . , , / ( ), html . java io API URL- InputStreams.

0

-, -, :

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

Socket#getInputStream, BufferedReader , , .

0

nekohtml html-. DOM. XPATH .

0

"-" -, HTML ( XML, RSS), HTMLUnit.

, , "Java-". Apache httpclient, Nekohtml Rhino Javascript. API - .

0

, RSS/Atom? , ? , RSS , , , .

, microformats , ( WordPress) . , -.

, /, Yahoo Pipes, , .

0

Source: https://habr.com/ru/post/1696950/


All Articles