If you want the contents of the entire page, you should be able to use
var allText = document.body.textContent;
Prior to IE9, Internet Explorer had an innerText property similar, but not identical. The MDN page about textContent contains more detailed information.
Now one problem is that textContent will provide you with the contents of any <style> or <script> that may or may not be what you want. If you do not want this, you can use something like this:
function getText(startingPoint) { var text = ""; function gt(start) { if (start.nodeType === 3) text += start.nodeValue; else if (start.nodeType === 1) if (start.tagName != "SCRIPT" && start.tagName != "STYLE") for (var i = 0; i < start.childNodes.length; ++i) gt(start.childNodes[i]); } gt(startingPoint); return text; }
Then:
var allText = getText(document.body);
Note: this (or document.body.innerText ) will deliver you all the text, but in depth order. Retrieving all the text from the page in the order in which the person actually sees it when the page is displayed is a much more difficult problem, because this requires that the code understand the visual effects (and visual semantics!) Of the layout as dictated by CSS (and so on). .d.).
edit - if you want the text to be "stored in an array", I assume that based on node -by-node (?) you just replace the addition of an array to concatenate the strings in the above example:
function getTextArray(startingPoint) { var text = []; function gt(start) { if (start.nodeType === 3) text.push(start.nodeValue); else if (start.nodeType === 1) if (start.tagName != "SCRIPT" && start.tagName != "STYLE") for (var i = 0; i < start.childNodes.length; ++i) gt(start.childNodes[i]); } gt(startingPoint); return text; }
source share