How to make JTIdy for the correct formation of HTML documents?

I am using JTidy v. r938. I use this code to try to clear the page ...

final Tidy tidy = new Tidy(); tidy.setQuiet(false); tidy.setShowWarnings(true); tidy.setShowErrors(0); tidy.setMakeClean(true); Document document = tidy.parseDOM(conn.getInputStream(), null); 

But when I parse this URL - http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1 , everything is not cleared. For example, META tags on a page, for example

 <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 

remain

 <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 

instead of having a "</META>" tag or displaying as "<META http-equiv =" Content-Type "content =" text / html; encoding = UTF-8 "/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as String.

What can I do to make JTidy really clear the page, i.e. made her well formed? I understand that there are other tools, but this question is specifically related to using JTIdy.

+6
source share
4 answers

You need to specify several Tidy flags if you want XML format

 private String cleanData(String data) throws UnsupportedEncodingException { Tidy tidy = new Tidy(); tidy.setInputEncoding("UTF-8"); tidy.setOutputEncoding("UTF-8"); tidy.setWraplen(Integer.MAX_VALUE); tidy.setPrintBodyOnly(true); tidy.setXmlOut(true); tidy.setSmartIndent(true); ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8")); ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); tidy.parseDOM(inputStream, outputStream); return outputStream.toString("UTF-8"); } 

Or just if you want the XHTML form

 Tidy tidy = new Tidy(); tidy.setXHTML(true); 
+4
source

use tidy.setXmlTags (true); for parsing XML instead of HTML

+3
source

Use Tidy.setForceOutput(true) (at your own risk) to generate output, even if errors are detected.

+2
source

I parse HTML 2 times to get well formed xml

  BufferedReader br = new BufferedReader(new StringReader(str)); StringWriter sw = new StringWriter(); Tidy t = new Tidy(); t.setDropEmptyParas(true); t.setShowWarnings(false); //to hide errors t.setQuiet(true); //to hide warning t.setUpperCaseAttrs(false); t.setUpperCaseTags(false); t.parse(br,sw); StringBuffer sb = sw.getBuffer(); String strClean = sb.toString(); br.close(); sw.close(); //do another round of tidyness br = new BufferedReader(new StringReader(strClean)); sw = new StringWriter(); t = new Tidy(); t.setXmlTags(true); t.parse(br,sw); sb = sw.getBuffer(); String strClean2 = sb.toString(); br.close(); sw.close(); 
+1
source

Source: https://habr.com/ru/post/914581/


All Articles