Jsoup.parse () vs Jsoup.parse () - or How does URL detection work in Jsoup?

Jsoup has 2 html parse () methods:

  • parse (String html) - "Because the base URI is not specified, the absolute detection URL depends on the HTML, including the tag."
  • parse (String html, String baseUri) - "URL where HTML was extracted from. Used to resolve relative URLs for absolute URLs that occur before HTML declares a tag."

Itโ€™s hard for me to understand the meaning of the difference between the two:

  • In the second version of parse() , that "resolves relative URLs for absolute URLs that occur before the HTML announces the <base href> " means? What if a <base href> never appears on the page?
  • What is the purpose of absolute URLs? Why does Jsoup need to find an absolute URL?
  • Finally, but most importantly: is baseUri full URL of the HTML page (as stated in the original documentation), or is it the base URL of the HTML page?
+4
source share
1 answer

He used Element#absUrl() for others so that you can get the (supposed) absolute URL <a href> , <img src> , <link href> , <script src> etc. Eg

 for (Element link : document.select("a")) { System.out.println(link.absUrl("href")); } 

This is very useful if you want to download and / or analyze related resources.


In the second version of parse (), what does it mean "allow relative URLs for absolute URLs that occur before HTML declares the <base href> "? What should I do if the <base href> tag is never found on the page?

Some (poor) websites may declare <link> or <script> with relative URLs before the <base> . Or, if there is no <base> , then only this baseUri will be used to resolve the relative URLs of the entire document.


What is the purpose of absolute URL discovery? Why does Jsoup need to find an absolute URL?

To return the correct URL to Element#absUrl() . This is purely for the convenience of the end user. Jsoup does not need it to successfully parse HTML on its own.


Finally, but most importantly: Is BaseUri the full URL of the HTML page (as indicated in the original documentation) or is it the base URL of the HTML page?

First. If the latter, then the documentation will lie. baseUri should not be confused with <base href> .

+4
source

Source: https://habr.com/ru/post/892394/


All Articles