Correct character encoding

I am currently clearing the website for various pieces of text data (with permission, of course). The problem that I see is that certain characters are not encoded correctly in the process. This is especially noticeable with apostrophes ('): which leads to characters such as :.

I am currently using the following code to convert various HTML objects from cleared data:

htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE) 

Is there a better way to handle this?

+4
source share
4 answers

HTML objects have two purposes:

  • Escape characters that have special meaning in HTML, such as angle quotes, so they can be used as literals.
  • Display characters that are not supported by the character set used, such as the euro symbol in ISO-8859-1.

This is not really a coding tool.

If you want to convert from one encoding to another, I suggest you use iconv () . However, you must know both the source and target encoding. The source encoding should be mentioned in the Content-Type response header, and the target encoding is what you decided when you started the site (although in your case it looks like the most suitable version of UTF-8).

+3
source

You do not want to use htmlentities immediately, I would use this in the data in the last paragraph before you save it. One of the problems you will encounter is people who do not always code their objects properly. Not everyone uses and trades; they just copy the trademark. If you put some kind of logic to try to grab everything that they put in and encode correctly, you might be better off. For instance:

 $patterns = array(); $patterns[0] = '/—/'; $patterns[1] = '/&nsbsp;/'; $patterns[2] = '/®/'; $replacements = array(); $replacements[2] = '&151;'; $replacements[1] = '&160;'; $replacements[0] = '&174;'; $ourhtml = preg_replace($patterns, $replacements, $html); 

You can find all the gotcha characters, such as dashes and single quotes, apostrophes, etc., and encode them manually, as well as use the standard set for objects (text or numeric).

You can also use regular expressions to do the same, and would probably be a more elegant solution. But my suggestion would be to take some time, filtering out what you do not want manually, and then you know that your data will be prepared exactly the way you like.

0
source

It's a little difficult to suggest things based on the information provided. Can you provide an example of a piece of text, maybe?

Otherwise, I will use a shotgun approach (for example, offering a bunch of things and hoping one of them hits)

First of all, are you sure that the page you are accessing is encoded in UTF-8? What does mb_detect_encoding ?

One option (may not work depending on your needs) is to use iconv with the TRANSLIT option to convert characters to something more convenient for processing with PHP. You can also use the mb_* functions to work with multibyte strings.

Are you sure htmlentities problem? If the contents of UTF-8 and your site is configured to serve ISO-8859-1, you will see odd characters. Check the encoding your browser uses to make sure that it matches the character encoding you create.

0
source

I see no problems using htmlentities () while you pass false as the last parameter. This ensures that you don't encode anything twice (for example, turn & into & ).

0
source

Source: https://habr.com/ru/post/1303065/


All Articles