If the document is XHTML, you can parse it using any XML parser. E4X is likely to do the job well, as are the built-in XML XML analysis interfaces.
The env.js library is designed to emulate a browser environment under Rhino, but I believe your document should also be XHTML compatible:
http://ejohn.org/blog/bringing-the-browser-to-the-server/
http://www.envjs.com/
However, if it is HTML, it is more complicated, because browsers are designed to be extremely soft in the way parsing is drawn. See here for a list of HTML parsers in Java:
http://java-source.net/open-source/html-parsers
This is not an easy task. People have gone so far as to embed the Mozilla Gecko engine in Java through the JNI to take advantage of its parsing capabilities.
I would recommend you study the following pure-Java project:
http://lobobrowser.org/cobra.jsp
The goal of the Lobo project is to develop a web browser with pure Java. This is a pretty interesting project, and there are a lot of things, but I believe that you can easily use the parser in your own application, as described in the following link:
http://lobobrowser.org/cobra/java-html-parser.jsp
source share