Is there a proven HTML parser implemented in Java?

Question

Is there a proven HTML parser implemented in Java?

I need to parse HTML 4 in Java. Ideally, I need an implementation compatible with SAX.

I know that there are many HTML parsers for Java, but they all seem to do tidying. In other words, they will fix poorly formed HTML. I do not want it.

My requirements:

No tidying up.
If the input document is invalid, the HTML parsing should fail.
The document must be valid for HTML DTD.
The parser can create SAX2 events.

Is there a library that meets these requirements?

+3

java html xhtml

johnstok May 24, '09 at 17:45

source share

4 answers

adrian.tarau · Answer 1 · 2009-05-24T18:16:54+0000

HTML HTML Parsers. , , TagSoup ...

Roberto Tyley · Answer 2 · 2011-02-18T12:44:53+0000

, Jericho HTML Parser (" , HTML . ') , , , , HTML, .

html Jericho " Parser" :

http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp

, , tidying , , , , , net.htmlparser.jericho.Logger(, WriterLogger - ) , , . :

    Source source=new Source("<a>I forgot to close my link!");
    source.setLogger(myListeningLogger);

    source.getSourceFormatter().writeTo(new NullWriter());
    // myListeningLogger has now had all the HTML flaws written to it

logger info() : 'StartTag at (r1,c1,p0) missing required end tag', , HTML, , debug-in , " ", "" ( , , ).

Maven Central, :

http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html

!

monceaux · Answer 3 · 2009-05-25T08:34:36+0000

, http://lobobrowser.org/cobra.jsp. Java- (Lobo). (Cobra) . , , , " ", , , . , - Java.

David Rabinowitz · Answer 4 · 2009-05-25T10:12:10+0000

javax.swing.text.html.parser.Parser handleXXX(). , XML. . API

Is there a proven HTML parser implemented in Java?

More articles: