Jsoup gets all the links from the page

Question

Jsoup gets all the links from the page

I am implementing a web robot that should get all the links from the page and select the ones that I need. I got all this, except that I ran into a problem where the link is inside the "table" or "span" tag. Here is my code snippet:

Document doc = Jsoup.connect(url) .timeout(TIMEOUT * 1000) .get(); Elements elts = doc.getElementsByTag("a");

And here is an example HTML:

 <table> <tr><td><a href="www.example.com"></a></td></tr> </table>

My code will not receive such links. Using doc.select doesn't help either. My question is: how to get all the links from the page?

EDIT: I think I know where the problem is. The page I am having problems with is very poorly written, the HTML validator produces a huge amount of errors. Could this cause problems?

+4

hyperlink jsoup

Marcin Krzysiak 21 sept '12 at 8:36

source share

2 answers

Try this code

  String url = "http://test.com"; Document doc = null; try { doc = Jsoup.connect(url).get(); Elements links = doc.select(<i>"a[href]"<i>); Element link; for(int j=0;j<150;j++){ link=links.get(j); System.out.println("a= " link.attr("abs:href").toString() ); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }

0

Ahmed abderrahman Nov 26 '17 at 15:18

source share

ollo · Accepted Answer · 2012-09-21T10:16:00+0000

In general, Jsoup can handle bad HTML. Dump the HTML as JSoup uses it (you can simply exit doc.toString() ).

Tip : use select() instead of getElementsByX() , faster and more flexible.

Elements elts = doc.select("a"); (change)

Here is an overview of the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup gets all the links from the page

More articles: