Regex matches strings with and without special / accent characters?

Is there a regular expression to match a particular string with and without special characters ? Special characters are insensitive, so to speak.

Like céra will match cera and vice versa.

Any ideas?

Edit: I want to match specific lines with and without special / accent characters. Not just a string / character.

Testing example:

 $clientName = 'céra'; $this->search = 'cera'; $compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName)); $this->search = strtolower($this->search); if (strpos($compareClientName, $this->search) !== false) { $clientName = preg_replace('/(.*?)('.$this->search.')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $clientName); } 

Exit: <span class="highlight">céra</span>

As you can see, I want to highlight a specific search string. However, I still want to display the original (underlined) characters of the matched string.

I need to combine this with the answer of Michael Sivolobov , as it seems to me.

I think I have to work with separate preg_match() and preg_replace() , right?

+6
source share
4 answers

You can use the \p{L} pattern to match any letter.

A source

You must use the u modifier after the regular expression to enable unicode mode.

Example: /\p{L}+/u

Edit:

Try something like this. It should replace each letter with an emphasis on a search pattern containing an accented letter (both a single character and a Unicode double) and an inconsistent letter. Then you can use the adjusted search pattern to highlight the text.

 function mbStringToArray($string) { $strlen = mb_strlen($string); while($strlen) { $array[] = mb_substr($string, 0, 1, "UTF-8"); $string = mb_substr($string, 1, $strlen, "UTF-8"); $strlen = mb_strlen($string); } return $array; } // I had to use this ugly function to remove accents as iconv didn't work properly on my test server. function stripAccents($stripAccents){ return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY')); } $clientName = 'céra'; $clientNameNoAccent = stripAccents($clientName); $clientNameArray = mbStringToArray($clientName); foreach($clientNameArray as $pos => &$char) { $charNA =$clientNameNoAccent[$pos]; if($char != $charNA) { $char = "(?:$char|$charNA|$charNA\p{M})"; } } $clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra $text = 'the client name is Céra but it could be Cera or céra too.'; $search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text); echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too. stripAccents), utf8_decode ( 'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'), 'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY')); function mbStringToArray($string) { $strlen = mb_strlen($string); while($strlen) { $array[] = mb_substr($string, 0, 1, "UTF-8"); $string = mb_substr($string, 1, $strlen, "UTF-8"); $strlen = mb_strlen($string); } return $array; } // I had to use this ugly function to remove accents as iconv didn't work properly on my test server. function stripAccents($stripAccents){ return utf8_encode(strtr(utf8_decode($stripAccents),utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'),'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY')); } $clientName = 'céra'; $clientNameNoAccent = stripAccents($clientName); $clientNameArray = mbStringToArray($clientName); foreach($clientNameArray as $pos => &$char) { $charNA =$clientNameNoAccent[$pos]; if($char != $charNA) { $char = "(?:$char|$charNA|$charNA\p{M})"; } } $clientSearchPattern = implode($clientNameArray); // c(?:é|e|e\p{M})ra $text = 'the client name is Céra but it could be Cera or céra too.'; $search = preg_replace('/(.*?)(' . $clientSearchPattern . ')(.*?)/iu', '$1<span class="highlight">$2</span>$3', $text); echo $search; // the client name is <span class="highlight">Céra</span> but it could be <span class="highlight">Cera</span> or <span class="highlight">céra</span> too. 
+8
source

If you want to know if there is any emphasis or other character on any letter, you can check it by matching the pattern \p{M}

UPDATE

You need to convert all your accented letters into a template into a group of alternatives:

eg. céra -> c(?:é|e|e\p{M})ra

Why did I add e\p{M} ? Since your letter é may be a single character in Unicode and may be a combination of two characters (e and a serious accent). e\p{M} matches e with serious accents (two separate Unicode characters)

As you transform your template to match all characters, you can use it in preg_match

+7
source

As you can see here , the POSIX equivalence class designed to match characters with the same sort order that can be done using the following regular expression:

 [=a=] 

This will match á and ä as well as a depending on your locale.

+2
source

As you noted in one of the comments, you do not need a regular expression for this, since the goal is to find specific lines. Why don't you use explode ? For instance:

 $clientName = 'céra'; $this->search = 'cera'; $compareClientName = strtolower(iconv('utf-8', 'ascii//TRANSLIT', $clientName)); $this->search = strtolower($this->search); $pieces = explode($compareClientName, $this->search); if (count($pieces) > 1) { $clientName = implode('<span class="highlight">'.$clientName.'</span>', $pieces); } 

Edit:

If your $search variable may contain special characters, why don't you use its translit and use mb_strpos with $offset ? eg:

 $offset = 0; $highlighted = ''; $len = mb_strlen($compareClientName, 'UTF-8'); while(($pos = mb_strpos($this->search, $compareClientName, $offset, 'UTF-8')) !== -1) { $highlighted .= mb_substr($this->search, $offset, $pos-$offset, 'UTF-8'). '<span class="highlight">'. mb_substr($this->search, $pos, $len, 'UTF-8').'</span>'; $offset = $pos + $len; } $highlighted .= mb_substr($this->search, $offset, 'UTF-8'); 

Update 2:

It is important to use the mb_ functions instead of plain strlen , etc. This is because accented characters are stored using two or more bytes; Also always make sure that you are using the correct encoding, look at this, for example:

 echo strlen('é'); > 2 echo mb_strlen('é'); > 2 echo mb_internal_encoding(); > ISO-8859-1 echo mb_strlen('é', 'UTF-8'); > 1 mb_internal_encoding('UTF-8'); echo mb_strlen('é'); > 1 
+2
source

Source: https://habr.com/ru/post/954659/


All Articles