How to remove duplicate whitespace characters from UTF8 string in PHP correctly with regular expression?

I am trying to remove duplicate space characters from a UTF8 string in PHP using regex. This is a regex

$txt = preg_replace( '/\s+/i' , ' ', $txt ); 

usually works fine, but some of the lines have the Cyrillic letter "P", which is screwed after replacement. After a little research, I realized that the letter is encoded as \ x {D0A0}, and since \ xA0 is an inextricable space in ASCII, the regular expression replaces it with \ x20, and the character is no longer valid.

Any ideas how to do this correctly in PHP with regex?

+4
source share
2 answers

described @ http://www.php.net/manual/en/function.preg-replace.php#106981

If you want to catch characters, as well as European, Russian, Chinese, Japanese, Korean from all, it is simple: - use mb_internal_encoding ('UTF-8'); - use preg_replace (' ... u', '...', $ string) with the u modifier (unicode)

For more information, a complete list of preg_ * modifiers can be found at: http://php.net/manual/en/reference.pcre.pattern.modifiers.php

+2
source

Try u modifier:

 $txt="UTF 字符串 with 空格符號"; var_dump(preg_replace("/\\s+/iu","",$txt)); 

Outputs:

 string(28) "UTF字符串with空格符號" 
+5
source

Source: https://habr.com/ru/post/1446894/


All Articles