How do you extract text from a web page (Java)?

Question

How do you extract text from a web page (Java)?

I plan to write a simple J2SE application to aggregate information from several web sources.

The hardest part, I think, is to extract meaningful information from web pages if it is not available as RSS or Atom feeds. For example, I could extract a list of questions from stackoverflow, but I absolutely don't need a tag cloud or navbar.

What technique / library would you recommend?

Updates / Notes

Speed doesn't matter - as long as it can parse about 5 MB of HTML in less than 10 minutes.
It can be very simple.

+3

java html html-content-extraction

ansgri Sep 16 '08 at 11:48

source share

10 answers

jatanp · Answer 1 · 2008-09-16T11:57:03+0000

HTMLParser (http://htmlparser.sourceforge.net/)in URL # getInputStream() HTML-, .

James Law · Answer 2 · 2008-09-16T11:54:49+0000

, httpunit. html, - nekohtml. , jdk (httpurlconnection) apache

http://hc.apache.org/httpclient-3.x/

Joe Liversedge · Answer 3 · 2008-09-16T12:25:35+0000

, HTML XML XQuery . IBM developerWorks , ( HTML, , , ):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>

IcePhoenix · Answer 4 · 2008-09-16T11:51:56+0000

, , ( , SAXParser), HTML , HTML... DOM, , .

graham r · Answer 5 · 2008-09-16T11:52:13+0000

, . , , / ( ), html . java io API URL- InputStreams.

Vhaerun · Answer 6 · 2008-09-16T12:06:24+0000

-, -, :

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

Socket#getInputStream, BufferedReader , , .

Alexandre Victoor · Answer 7 · 2008-09-16T12:31:41+0000

nekohtml html-. DOM. XPATH .

Maxim · Answer 8 · 2008-09-16T13:05:42+0000

"-" -, HTML ( XML, RSS), HTMLUnit.

, , "Java-". Apache httpclient, Nekohtml Rhino Javascript. API - .

Eric DeLabar · Answer 9 · 2008-09-16T13:12:29+0000

, RSS/Atom? , ? , RSS , , , .

, microformats , ( WordPress) . , -.

, /, Yahoo Pipes, , .

VNVN · Answer 10 · 2011-01-30T07:41:28+0000

http://www.alchemyapi.com/api/demo.html

SDK . , ..

How do you extract text from a web page (Java)?

More articles: