How to get an identifier using a specific word in a regular expression?

My line:

<div class="sect1" id="s9781473910270.i101"> <div class="sect2" id="s9781473910270.i102"> <h1 class="title">1.2 Summations and Products[label*summation]</h1> <p>text</p> </div> </div> <div class="sect1" id="s9781473910270.i103"> <p>sometext [ref*summation]</p> </div> <div class="figure" id="s9781473910270.i220"> <div class="metadata" id="s9781473910270.i221"> </div> <p>fig1.2 [label*somefigure]</p> <p>sometext [ref*somefigure]</p> </div> 

Purpose: 1. In the line above label*string and ref*string cross-references are indicated. Instead of [ref*string] I need to replace a with the class and href attributes, href is the id div where the associated label* is located. And class a is a class div

  1. As I mentioned above, a element class and identifier are their relative names and identifiers of the div class. But if div class="metadata" exists, you need to ignore it, it should not accept the name and identifier of its class.

Expected Result:

 <div class="sect1" id="s9781473910270.i101"> <div class="sect2" id="s9781473910270.i102"> <h1 class="title">1.2 Summations and Products[label*summation]</h1> <p>text</p> </div> </div> <div class="sect1" id="s9781473910270.i103"> <p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p> </div> <div class="figure" id="s9781473910270.i220"> <div class="metadata" id="s9781473910270.i221"> <p>fig1.2 [label*somefigure]</p> </div> <p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p> </div> 

How to make it easier without using the DOM parser?

My idea is to store label* string and their identifier in an array and will quote on the ref line to match the label* string if label* string , then their sibling identifier and class should be replaced instead of ref* string , So I tried this regex to get label*string and the associated name and class name.

+6
source share
2 answers

This approach is to use the html structure to extract the required elements using DOMXPath. Regex is used a second time to extract information from text nodes or attributes:

 $classRel = ['sect2' => 'section-ref', 'figure' => 'fig-ref']; libxml_use_internal_errors(true); $dom = new DOMDocument; $dom->loadHTML($html); // or $dom->loadHTMLFile($url); $xp = new DOMXPath($dom); // make a custom php function available for the XPath query // (it isn't really necessary, but it is more rigorous than writing // "contains(@class, 'myClass')" ) $xp->registerNamespace("php", "http://php.net/xpath"); function hasClass($classNode, $className) { if (!empty($classNode)) return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY)); return false; } $xp->registerPHPFunctions('hasClass'); // The XPath query will find the first ancestor of a text node with '[label*' // that is a div tag with an id and a class attribute, // if the class attribute doesn't contain the "metadata" class. $labelQuery = <<<'EOD' //text()[contains(., 'label*')] /ancestor::div [@id and @class and not(php:function('hasClass', @class, 'metadata'))][1] EOD; $idNodeList = $xp->query($labelQuery); $links = []; // For each div node, a new link node is created in the associative array $links. // The keys are labels. foreach($idNodeList as $divNode) { // The pattern extract the first text part in group 1 and the label in group 2 if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) { $links[$m[2]] = $dom->createElement('a'); $links[$m[2]]->setAttribute('href', $divNode->getAttribute('id')); $links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]); $links[$m[2]]->nodeValue = $m[1]; } } if ($links) { // if $links is empty no need to do anything $refNodeList = $xp->query("//text()[contains(., '[ref*')]"); foreach ($refNodeList as $refNode) { // split the text with square brackets parts, the reference name is preserved in a capture $parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE); // create a fragment to receive text parts and links $frag = $dom->createDocumentFragment(); foreach ($parts as $k=>$part) { if ($k%2 && isset($links[$part])) { // delimiters are always odd items $clone = $links[$part]->cloneNode(true); $frag->appendChild($clone); } elseif ($part !== '') { $frag->appendChild($dom->createTextNode($part)); } } $refNode->parentNode->replaceChild($frag, $refNode); } } $result = ''; $childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes; foreach ($childNodes as $childNode) { $result .= $dom->saveXML($childNode); } echo $result; 
+2
source

This is not a problem for regular expressions. Regular expressions (usually) for common languages. And what you want to do is some work on a context-sensitive language (link to an identifier that was previously announced).

So, you should definitely go with the DOM parser. An algorithm for this would be very simple, because you can work with one node and its children.

So, the theoretical answer to your question is: you cannot. Although it can work with a lot of regular expressions in some shitty way.

-1
source

Source: https://habr.com/ru/post/988563/


All Articles