How to get all text from all tags in one array?

I need to create an array containing all the text from a page without jQuery. This is my html:

<html> <head> <title>Hello world!</title> </head> <body> <h1>Hello!</h1> <p> <div>What are you doing?</div> <div>Fine, and you?</div> </p> <a href="http://google.com">Thank you!</a> </body> </html> 

Here is what I want to get

 text[1] = "Hello world!"; text[2] = "Hello!"; text[3] = "What are you doing?"; text[4] = "Fine, and you?"; text[5] = "Thank you!"; 

Here is what I tried, but it doesn't seem to work correctly in my browser:

 var elements = document.getElementsByTagName('*'); console.log(elements); 

PS. I need to use document.getElementsByTagName ('*'); and exclude "script" and "style."

+4
source share
5 answers
  var array = []; var elements = document.body.getElementsByTagName("*"); for(var i = 0; i < elements.length; i++) { var current = elements[i]; if(current.children.length === 0 && current.textContent.replace(/ |\n/g,'') !== '') { // Check the element has no children && that it is not empty array.push(current.textContent); } } 

You can do something like this

Demo

result = ["What are you doing?", "Fine, and you?"]

or you can use document.documentElement.getElementsByTagName('*');

Also make sure your code is inside this

 document.addEventListener('DOMContentLoaded', function(){ /// Code... }); 

If this is just the name you need, you can also do it

 array.push(document.title); 

Saves a loop through scripts and styles

+3
source

If you want the contents of the entire page, you should be able to use

 var allText = document.body.textContent; 

Prior to IE9, Internet Explorer had an innerText property similar, but not identical. The MDN page about textContent contains more detailed information.

Now one problem is that textContent will provide you with the contents of any <style> or <script> that may or may not be what you want. If you do not want this, you can use something like this:

 function getText(startingPoint) { var text = ""; function gt(start) { if (start.nodeType === 3) text += start.nodeValue; else if (start.nodeType === 1) if (start.tagName != "SCRIPT" && start.tagName != "STYLE") for (var i = 0; i < start.childNodes.length; ++i) gt(start.childNodes[i]); } gt(startingPoint); return text; } 

Then:

 var allText = getText(document.body); 

Note: this (or document.body.innerText ) will deliver you all the text, but in depth order. Retrieving all the text from the page in the order in which the person actually sees it when the page is displayed is a much more difficult problem, because this requires that the code understand the visual effects (and visual semantics!) Of the layout as dictated by CSS (and so on). .d.).

edit - if you want the text to be "stored in an array", I assume that based on node -by-node (?) you just replace the addition of an array to concatenate the strings in the above example:

 function getTextArray(startingPoint) { var text = []; function gt(start) { if (start.nodeType === 3) text.push(start.nodeValue); else if (start.nodeType === 1) if (start.tagName != "SCRIPT" && start.tagName != "STYLE") for (var i = 0; i < start.childNodes.length; ++i) gt(start.childNodes[i]); } gt(startingPoint); return text; } 
+2
source
  <html> <head> <title>Hello world!</title> </head> <body> <h1>Hello!</h1> <p> <div>What are you doing?</div> <div>Fine, <span> and you? </span> </div> </p> <a href="http://google.com">Thank you!</a> <script type="text/javascript"> function getLeafNodesOfHTMLTree(root) { if (root.nodeType == 3) { return [root]; } else { var all = []; for (var i = 0; i < root.childNodes.length; i++) { var ret2 = getLeafNodesOfHTMLTree(root.childNodes[i]); all = all.concat(ret2); } return all; } } var allnodes = getLeafNodesOfHTMLTree(document.getElementsByTagName("html")[0]); console.log(allnodes); //in modern browsers that surport array filter and map allnodes = allnodes.filter(function (node) { return node && node.nodeValue && node.nodeValue.replace(/\s/g, '').length; }); allnodes = allnodes.map(function (node) { return node.nodeValue }) console.log(allnodes); </script> </body> </html> 
0
source

Go through the DOM tree, get all the text nodes, get the nodeValue of the text node.

 var result = []; var itr = document.createTreeWalker( document.getElementsByTagName("html")[0], NodeFilter.SHOW_TEXT, null, // no filter false); while(itr.nextNode()) { if(itr.currentNode.nodeValue != "") result.push(itr.currentNode.nodeValue); } alert(result); 

Alternative method: Split into HTML tag textContent.

 var result = document.getElementsByTagName("html")[0].textContent.split("\n"); for(var i=0; i<result.length; i++) if(result[i] == "") result.splice(i, 1); alert(result); 
0
source

This seems to be a single line solution ( fiddle ):

 document.body.innerHTML.replace(/^\s*<[^>]*>\s*|\s*<[^>]*>\s*$|>\s*</g,'').split(/<[^>]*>/g) 

This can be unsuccessful if the body has complex scripts, and I know that parsing HTML with regular expressions is not a very smart idea , but for simple cases or for demo purposes it can still be suitable, right? :)

0
source

Source: https://habr.com/ru/post/1492122/


All Articles