Parsing an HTML file using Java

How to remove comments and comment content from an html file using Java, where comments are written as follows:

<!--

Any idea or help needed for this.

+3
source share
3 answers

Take a look at JTidy , the java port of HTML Tidy. You can override the printing methods of the PPrint object to ignore comment tags.

+5
source

Unless you have a valid xhtml that reminds me of a reminder, you should first apply jtidy to tidy up the html and make it valid xhtml.

See this , for example jtidy code.

html DOM.

:

final DocumentBuilderFactory newFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = newFactory.newDocumentBuilder();
Document document = documentBuilder.parse( new InputSource( new StringReader( string ) ) );

.

+4

try a simple regex like

String commentless = pageString.replaceAll("<!--[\w\W]*?-->", "");

edit: to explain the regex:

  • <!-- matches the beginning of a literal comment
  • [\w\W] matches every character (even a newline) that will be inside the comment
  • *? matches multiple characters "any character", but matches the smallest possible number (not greedy).
  • --> closes comment
+1
source

Source: https://habr.com/ru/post/1704883/


All Articles