RegExp. Get only the text content of the tag (without internal tags)

Question

RegExp. Get only the text content of the tag (without internal tags)

I have a line with html code.

<h2 class="some-class"> <a href="#link" class="link" id="first-link" <span class="bold">link</span> </a> NEED TO GET THIS </h2>

I need to get only h2 text content. I create this regex:

 (?<=>)(.*)(?=<\/h2>)

But useful if h2 has no internal tags. Otherwise, I get the following:

  <a href="#link" class="link" id="first-link" <span class="bold">link</span> </a> NEED TO GET THIS

+5

javascript html regex

andreyb1990 Mar 04 '17 at 16:03

source share

3 answers

Never use a regular expression to parse HTML, check out these well-known answers:

Using Regular Expressions for HTML Parsing: Why Not?

RegEx matches open tags, with the exception of standalone XHTML tags

Instead, create a temporary element with text as HTML and get the content by filtering out text nodes.

 var str = `<h2 class="some-class"> <a href="#link" class="link" id="first-link" <span class="bold">link</span> </a> NEED TO GET THIS </h2>`; // generate a temporary DOM element var temp = document.createElement('div'); // set content temp.innerHTML = str; // get the h2 element var h2 = temp.querySelector('h2'); console.log( // get all child nodes and convert into array // for older browser use [].slice.call(h2...) Array.from(h2.childNodes) // iterate over elements .map(function(e) { // if text node then return the content, else return // empty string return e.nodeType === 3 ? e.textContent.trim() : ''; }) // join the string array .join('') // you can use reduce method instead of map // .reduce(function(s, e) { return s + (e.nodeType === 3 ? e.textContent.trim() : ''); }, '') )

Link:

Fastest way to convert JavaScript NodeList to an array?

+2

Pranav c balan Mar 04 '17 at 16:07

source share

demo

 var h2 = document.querySelector('h2') var h2_clone = h2.cloneNode(true) for (let el of h2_clone.children) { el.remove() } alert(h2_clone.innerText)

+1

Ahmet Şimşek Mar 04 '17 at 16:09

source share

Mohamad · Accepted Answer · 2017-03-04T16:12:26+0000

Rgex is not suitable for parsing HTML, but if your html is invalid or in any way you want to use a regex:

 (?!>)([^><]+)(?=<\/h2>)

try the demo

Getting the latest texts before the closing tag </h2> (IF EXISTS)
To avoid null results, * were changed to + .
This regex is completely limit and suitable for limited situations as the mentioned question.

RegExp. Get only the text content of the tag (without internal tags)

More articles: