How to avoid Chinese Unicode characters in a URL?

I have Chinese users of my PHP web application that inject products into our system. The information included here is the name of the product and the price.

We would like to use the product title to create a nice URL for this product. It looks like we can't just use the Chinese HREF attributes.

Does anyone know how we handle a header like "婴儿 服饰" so that we can create a clean url like http://www.site.com/婴儿服饰 ?

Everything works fine for “normal” languages, but the high languages ​​of UTF-8 give us problems.

In addition, when creating a clean URL, we want to keep SEO in mind, but I have no experience with the Chinese in this matter.

+6
source share
2 answers

If your string is already UTF-8, just use rawurlencode to properly encode the string:

 $path = '婴儿服饰'; $url = 'http://example.com/'.rawurlencode($path); 

UTF-8 is the preferred character encoding for non-ASCII characters (although only ASCII characters are allowed in the URI, so you need to use percent -encoding ). The result is the same as in the tchrists example:

 http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0 
+6
source

This code that uses the CPAN module, URI :: Escape :

 #!/usr/bin/env perl use v5.10; use utf8; use URI::Escape qw(uri_escape_utf8); my $url = "http://www.site.com/"; my $path = "婴儿服饰"; say $url, uri_escape_utf8($path); 

at startup, print:

 http://www.site.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0 

Is this what you are looking for?

By the way, these four characters:

 CJK UNIFIED IDEOGRAPH-5A74 CJK UNIFIED IDEOGRAPH-513F CJK UNIFIED IDEOGRAPH-670D CJK UNIFIED IDEOGRAPH-9970 

Which, according to the Unicode :: Unihan database, seems to be yīng ér fú shì, or maybe just ying er fú shi per Lingua :: ZH :: Romanize :: Pinyin . And maybe even jing¹ jan⁴ fuk⁶ sik¹ or jing˥ jan˨˩ fuk˨ sik˥, using the Cantonese version from Unicode :: Unihan .

+6
source

Source: https://habr.com/ru/post/889205/


All Articles