Load the HTML string into the DOM tree using Javascript

Question

Load the HTML string into the DOM tree using Javascript

I am currently working with an automation infrastructure that pulls out a web page for analysis, which is then presented as a string for processing. The Javascript Rhino engine is available to parse the score of the returned web page.

It seems that if a string (which is a full web page) can be loaded into a DOM view, this will provide a very good interface for parsing and parsing content.

Using only Javascript, is this a possible and / or feasible concept?

Edit:

I will expand the question for clarification: Let's say I have a line in javascript that contains html, for example:

var $mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

is it possible / realistic to somehow load it into a dom object?

+4

javascript dom web-crawler web-scraping rhino

xelco52 Feb 04 '11 at 10:08

source share

3 answers

I accept the answer of JonDavidJohn, since it was useful in solving my problem, I thought that this additional answer for others may see this in the future.

It seems that although Javascript allows you to load html strings into a DOM element, the DOM is not part of the main ECMAScript and, as such, is not available for scripts running under Rhino.

As a side note, a good alternative that was implemented in Rhino 1.6 is the E4X. Although not a DOM implementation, it provides conceptually similar capabilities.

+1

xelco52 Feb 10 '11 at 18:43

source share

If the document is XHTML, you can parse it using any XML parser. E4X is likely to do the job well, as are the built-in XML XML analysis interfaces.

The env.js library is designed to emulate a browser environment under Rhino, but I believe your document should also be XHTML compatible:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

http://www.envjs.com/

However, if it is HTML, it is more complicated, because browsers are designed to be extremely soft in the way parsing is drawn. See here for a list of HTML parsers in Java:

http://java-source.net/open-source/html-parsers

This is not an easy task. People have gone so far as to embed the Mozilla Gecko engine in Java through the JNI to take advantage of its parsing capabilities.

I would recommend you study the following pure-Java project:

http://lobobrowser.org/cobra.jsp

The goal of the Lobo project is to develop a web browser with pure Java. This is a pretty interesting project, and there are a lot of things, but I believe that you can easily use the parser in your own application, as described in the following link:

http://lobobrowser.org/cobra/java-html-parser.jsp

+1

jbeard4 Feb 14 '11 at 5:08

source share

jondavidjohn · Accepted Answer · 2011-02-04T22:23:31+0000

if you have this variable containing html, you can load it into a DOM object, for example, by id.

 var mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>'; element = document.getElementById('dom-id'); //<-- element you are loading it into. element.innerHTML = mywebpage;

Load the HTML string into the DOM tree using Javascript

More articles: