Convert HtmlPage to HTML string?

I use HtmlUnit to generate HTML for different pages, but now I can make the page into the raw HTML returned by the server have to convert the HtmlPage to an XML string.

This is somewhat annoying because the XML output is rendered by web browsers differently than raw HTML. Is there a way to convert HtmlPage to raw HTML instead of XML?

Thanks!

+6
source share
5 answers

page.asXml() will return HTML. page.asText() returns it in plain text.

+8
source

I'm not 100% sure that I understood the question correctly, but maybe this will affect your problem:

page.getWebResponse (). GetContentAsString ()

+5
source

I think there is no direct way to get the last page as HTML. asXml () returns the result as XML, asText () returns the extracted text content.

The best thing you can do is use asXml () and "convert" it to HTML:

 htmlPage.asXml().replaceFirst("<\\?xml version=\"1.0\" encoding=\"(.+)\"\\?>", "<!DOCTYPE html>") 

(Of course, you can apply more conversions, such as converting to <br> - it depends on your requirements.)

Even related Google documentation recommends this approach (although they do not apply any transformations):

 // return the snapshot out.println(page.asXml()); 
+1
source

I donโ€™t know the answer, not related to the switch on the page, and for XmlPage and SgmlPage you need to make innerHTML in the HTML element and manually write out the attributes. Not elegant and accurate (its doctype is missing), but it works.

Page.getWebResponse (). GetContentAsString ()

This is not true since it returns the text form of the original unrendered, no js bytes. If javascript executes and modifies the material, then this method will not see the changes.

page.asXml () will return HTML. page.asText () returns it in plain text.

I just want to confirm that this only returns text in text nodes and does not include tags and their attributes. If you want to take full HTML, this is not very good.

0
source

Perhaps you want to go with something similar, instead of using the HtmlUnit framework methods:

 try (InputStreamReader isr = new InputStreamReader(url.openConnection().getInputStream()); BufferedReader br = new BufferedReader(isr);){ String line =""; String htmlSource =""; while((line = br.readLine()) != null) { htmlSource += line + "\n"; } return htmlSource; } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } 
0
source

Source: https://habr.com/ru/post/891480/


All Articles