I plan to write a simple J2SE application to aggregate information from several web sources.
The hardest part, I think, is to extract meaningful information from web pages if it is not available as RSS or Atom feeds. For example, I could extract a list of questions from stackoverflow, but I absolutely don't need a tag cloud or navbar.
What technique / library would you recommend?
Updates / Notes
HTMLParser (http://htmlparser.sourceforge.net/)in URL # getInputStream() HTML-, .
, httpunit. html, - nekohtml. , jdk (httpurlconnection) apache
http://hc.apache.org/httpclient-3.x/
, HTML XML XQuery . IBM developerWorks , ( HTML, , , ):
<table> { for $d in //td[contains(a/small/text(), "New York, NY")] for $row in $d/parent::tr/parent::table/tr where contains($d/a/small/text()[1], "New York") return <tr><td>{data($row/td[1])}</td> <td>{data($row/td[2])}</td> <td>{$row/td[3]//img}</td> </tr> } </table>
, , ( , SAXParser), HTML , HTML... DOM, , .
, . , , / ( ), html . java io API URL- InputStreams.
-, -, :
GET /file.html HTTP/1.0 Host: site.com <ENTER> <ENTER>
Socket#getInputStream, BufferedReader , , .
Socket#getInputStream
nekohtml html-. DOM. XPATH .
"-" -, HTML ( XML, RSS), HTMLUnit.
, , "Java-". Apache httpclient, Nekohtml Rhino Javascript. API - .
, RSS/Atom? , ? , RSS , , , .
, microformats , ( WordPress) . , -.
, /, Yahoo Pipes, , .
http://www.alchemyapi.com/api/demo.html
SDK . , ..
Source: https://habr.com/ru/post/1696950/More articles:What are the pros and cons of using RMI or JMS between websites and business layers? - java-eeHow to emulate / replace / restart classic Sound Mixer controls (or commands) in Windows Vista? - windowsdebug an embedded system containing gdb remotely using some kind of gui - gdbhttps://translate.googleusercontent.com/translate_c?depth=1&pto=aue&rurl=translate.google.com&sl=ru&sp=nmt4&tl=en&u=https://fooobar.com/questions/1696948/how-to-create-project-specific-respository-post-commit-actions&usg=ALkJrhinbOmifYzUi2kOaWdoMihH5pIDxASQL: counting unique votes with a limited number of votes per hour - sqlRelated builds on website - asp.netCapturing every 4th file - scriptingЭквивалент DataGridView.HitTestInfo в Infragistics.Win.UltraWinGrid.UltraGrid? - c#.NET Framework Version - .netHow to build unit tests in Guile that output to the TAP standard? - unit-testingAll Articles