Is there a proven HTML parser implemented in Java?

I need to parse HTML 4 in Java. Ideally, I need an implementation compatible with SAX.

I know that there are many HTML parsers for Java, but they all seem to do tidying. In other words, they will fix poorly formed HTML. I do not want it.

My requirements:

  • No tidying up.
  • If the input document is invalid, the HTML parsing should fail.
  • The document must be valid for HTML DTD.
  • The parser can create SAX2 events.

Is there a library that meets these requirements?

+3
source share
4 answers

HTML HTML Parsers. , , TagSoup ...

+2

, Jericho HTML Parser (" , HTML . ') , , , , HTML, .

html Jericho " Parser" :

http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp

, , tidying , , , , , net.htmlparser.jericho.Logger(, WriterLogger - ) , , . :

    Source source=new Source("<a>I forgot to close my link!");
    source.setLogger(myListeningLogger);

    source.getSourceFormatter().writeTo(new NullWriter());
    // myListeningLogger has now had all the HTML flaws written to it

logger info() : 'StartTag at (r1,c1,p0) missing required end tag', , HTML, , debug-in , " ", "" ( , , ).

Maven Central, :

http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html

!

+2

, http://lobobrowser.org/cobra.jsp. Java- (Lobo). (Cobra) . , , , " ", , , . , - Java.

+1

javax.swing.text.html.parser.Parser handleXXX(). , XML. . API

0

Source: https://habr.com/ru/post/1708991/


All Articles