Improve JavaScript regular expression to match content inside tags with or without a closing tag, excluding self

Foreword: I know about the general consensus against using regular expressions to parse HTML. By asking you in advance, avoid any advice in this regard.


explanations.

I have the following regex

/<div class="panel-body">([^]*?)(<\/div>|$)/gi

It matches all content, including self, inside divwith a class.panel-body

Full compliance:

<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
</div>

.. it also matches content without a closing tag div.

Full compliance:

<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
   Don't match after closing `div`...but match this and below in case closing `div` is removed.
   Line below 1
   Line below 2
   Line below 3

Question.

How can I improve my regex to do the following:

  • Do not include in full compliance <div class="panel-body">and close </div>(when closing the tag div)

  • Do it directly (if possible) in complete coincidence without using groups

regex101.com


1:

<div class="panel-body">,

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Webmin 1.851 on centos.centos (CentOS Linux 7.3.1611)</title>
</head>
<body>
<div>
<div>
<div class="panel-body">

* : .

2:

. , .

+4
4

DOM, :

function divContent(str) {
  // create a new dov container
  var div = document.createElement('div');

  // assign your HTML to div innerHTML
  div.innerHTML = '<html>' + str + '</html>';

  // find an element by given className
  var el = div.getElementsByClassName("panel-body");
  
  // return found element first innerHTML
  return (el.length > 0 ? el[el.length-1].innerHTML : "");
}

// extract text from a complete tag:
var html = `<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
</div>`;
console.log(divContent(html));

// extract text from an incomplete tag:
html = `<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
   Don't match after closing 'div'...but match this and below
   in case closing 'div' is removed.
   Line below 1
   Line below 2
   Line below 3`;   
console.log(divContent(html));

// OP'e edited HTML text
html = `<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Webmin 1.851 on centos.centos (CentOS Linux 7.3.1611)</title>
</head>
<body>
<div>
<div>
<div class="panel-body">`;
console.log(divContent(html));
Hide result

JS Fiddle

+3

, . , , , . 0.

(?:<div class="panel-body">)([^]*?)(?:<\/div>|$)

https://regex101.com/r/OJf1Rt/3

+2

? , , , :

function parseContent(input) {
  var openingTag = '<div class="panel-body">';

  var i = input.indexOf(openingTag);
  if (i == -1) {
    return ""; // Or something else
  }

  var closingTag = '</div>';
  var closingTagLength = closingTag.length;
  var end = input.length - (input.slice(-closingTagLength) === closingTag ? closingTagLength : 0);

  return input.slice(i + openingTag.length, end);
}

EDIT:

, indexOf:

function parseContent(input) {
  var openingTag = '<div class="panel-body">';

  var i = input.indexOf(openingTag);
  if (i == -1) {
    return ""; // Or something else
  }

  var closingTag = '</div>';

  var endIndex = input.indexOf(closingTag, i);
  var end = (endIndex === -1 ? input.length : endIndex);

  return input.slice(i + openingTag.length, end);
}
+2

, - <

(^|\r|\n|\r\n)[^<]+

,

\<[^div] ([^\r\n]*\n)+

If there are other lines after you need to put the last characters at the end:

\<[^div] ([^\r\n]*\n)+Line 3
+1
source

Source: https://habr.com/ru/post/1682638/


All Articles