JSOUP adds additional html encoded content

In fact, JSOUP adds some extra coded values ​​to my HTML code in my jSOUP parser. I'm trying to take care of this

String url = "http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html"; System.out.println("Fetching %s..."+url); Document doc = Jsoup.connect(url).get(); //System.out.println(doc.html()); Document.OutputSettings settings = doc.outputSettings(); settings.prettyPrint(false); settings.escapeMode(Entities.EscapeMode.base); settings.charset("ASCII"); String html = doc.html(); System.out.println(html); 

But the Entities class for some reason was not found and gives an error. My included lib

 import org.jsoup.Jsoup; import org.jsoup.helper.Validate; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; 

Original html

 <!DOCTYPE html> <html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light"> <head> </head> <body> <div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;"> <div style="height:2058px; padding-left:0px; padding-top:36px;"> <iframe style="height:90px; width:728px;" /> </div> </div> </body> </html> 

doc.html() from JSOUP gives the following:

 <!DOCTYPE html> <html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light"> <head> <style> </style> </head> <body> <div style="background-image: url(aol.jpeg); background-repeat: no-repeat;-webkit-background-size:90720;height:720; width:90; text-align: center; margin: 0 auto;"> <div style="height:450; width:100; padding-left:681px; padding-top:200px;"> <iframe style="height:1050px; width:300px;"></iframe> &lt;/div&gt; &lt;/div&gt; &lt;/body&gt; &lt;/html&gt; </div> </div> </body> </html> 

Some coded material has been added to the iframe element.

Please, help.

Thanks Swaraj

0
java html-parsing jsoup
Jan 03 '14 at 17:04
source share
1 answer

In fact, jsoup does not add encoded material. Jsoup simply adds closing tags that seem to be missing. Let me explain.

First of all, jsoup is trying to format your html. In your case, this means that it will add closing tags that are missing. Example

 Document doc = Jsoup.parse("<div>test<span>test"); System.out.println(doc.html()); 

Exit:

 <html> <head></head> <body> <div> test <span>test</span> </div> </body> </html> 

If you check the encoded materials, you will understand that they close the tags.

 &lt;/div&gt; = </div> &lt;/div&gt; = </div> &lt;/body&gt; = </body> 

If you go to the site and press Ctrl + U (using chrome), you will see that jsoup will parse. Chrome will provide color to the actual html tags that it recognizes. For some odd reason, it will not recognize the tags below (the same ones that appear with escaped characters). For the same reason, jsoup has a problem with these closing tags. He does not consider them as closing tags, but as text, so he eludes them, and then he normalizes the html by adding those tags, as I explained earlier.

EDIT I was able to reproduce the behavior.

 Document doc = Jsoup.parse("<iframe /><span>test</span>"); System.out.println(doc.html()); 

You can see the same behavior. The problem is a self-closing iframe. Thus, this fixes the problem

 Document doc = Jsoup.parse("<iframe></iframe><span>test</span>"); System.out.println(doc.html()); 

EDIT 2 If you just want to get html without creating a document object, you can do it

 Connection.Response html = Jsoup.connect("http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html").execute(); System.out.println(html.body()); 

With the above, you can find the closing iframe itself and replace it with a valid view (or completely remove it). Then you can Jsoup.parse() this string with Jsoup.parse() This will fix the problem of not recognizing closing tags after the iframe, because it will be valid.

+3
Jan 04
source share



All Articles