Is there a standard Java SE HTML parser? If so, why use non-standard ones?

Question

Is there a standard Java SE HTML parser? If so, why use non-standard ones?

I need to parse a simple HTML page with a simple form. Answers to such questions in StackOverflow involve the use of one of a large number of non-standard Java libraries, such as TagSoup, JSoup, HTMLParser and many others.

However, a web search showed that there was some standard functionality in Java SE: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/html/parser/ParserDelegator. html

My questions:

Is it really that the standard ParserDelegator class can parse usage similar to mine?
What are the limitations of the standard library that create the need for so many non-standard libraries?
Does the fact that ParserDelegator is within swing preclude its use on a regular EC2 cloud server for a web application? Do I have to jump over a lot of hoops to get around the headless aspect, or is it just a small configuration setting?
If standard is not recommended, which non-standard should I use, given: (a) my desire not to deviate from the standard; (b) my simple use case; (c) the pursuit of mature, reliable implementation; and (d) size or weight restrictions, as it is a server application and not an embedded client. The API is a much lower priority, so although I appreciate the JSoup CSS selector like API, other problems (a) - (d) override it.

Thanks.

+4

java html html-parsing html-parser

necromancer Jan 31 '12 at 7:14

source share

1 answer

Alexr · Accepted Answer · 2012-01-31T07:24:22+0000

The JDK has a built-in HTML parser that supports HTML 1.0 or so. It should support parsing of basic tags and text formatting forms.

The reason for using other third-party parsers is the need to support "real" HTML pages DHTML, JavaScript, etc.

JSoup is one of the popular parsers that can do this work. For more information about other implementations, please take a look at the following discussion:

Pure Java HTML viewer / renderer for use in scrolling

Is there a standard Java SE HTML parser? If so, why use non-standard ones?

More articles: