Jsoup gets all the links from the page

I am implementing a web robot that should get all the links from the page and select the ones that I need. I got all this, except that I ran into a problem where the link is inside the "table" or "span" tag. Here is my code snippet:

Document doc = Jsoup.connect(url) .timeout(TIMEOUT * 1000) .get(); Elements elts = doc.getElementsByTag("a"); 

And here is an example HTML:

 <table> <tr><td><a href="www.example.com"></a></td></tr> </table> 

My code will not receive such links. Using doc.select doesn't help either. My question is: how to get all the links from the page?

EDIT: I think I know where the problem is. The page I am having problems with is very poorly written, the HTML validator produces a huge amount of errors. Could this cause problems?

+4
source share
2 answers

In general, Jsoup can handle bad HTML. Dump the HTML as JSoup uses it (you can simply exit doc.toString() ).

Tip : use select() instead of getElementsByX() , faster and more flexible.

Elements elts = doc.select("a"); (change)

Here is an overview of the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

+4
source

Try this code

  String url = "http://test.com"; Document doc = null; try { doc = Jsoup.connect(url).get(); Elements links = doc.select(<i>"a[href]"<i>); Element link; for(int j=0;j<150;j++){ link=links.get(j); System.out.println("a= " link.attr("abs:href").toString() ); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } 
0
source

Source: https://habr.com/ru/post/1435457/


All Articles