How to replace non-SGML characters in String using PHP?

Question

How to replace non-SGML characters in String using PHP?

I programmed the guestbook using PHP4 and HTML 4.01 (encoded with ISO-8859-15, i.e. Latin-9). Data is stored in a MySQL database using a character set (ISO-8859-1, i.e. Latin-1).

When someone enters characters from a different encoding, it seems that the browsers are sending encoded data (in fact, I did not check where it is encoded, ...).

In any case, in some cases, it seems that the characters are not stored in the database. Thus, the validator returns an error message when adding data to an HTML4.01 document:

non-SGML character 146
You used an illegal character in your text. HTML uses the UNICODE Consortium’s standard character repertoire, and it leaves undefined (among others) 65 character codes (from 0 to 31 inclusive and 127 to 159 inclusive), which are sometimes used for typographic quotes and similar characters in corporate character sets. The validator has found in your document one of these undefined characters. a symbol may appear in your browser as a curly quote or trademark symbol or some other bizarre symbol; on another computer, however, it will most likely look like a completely different character or nothing at all.
It is best to replace the character with the closest ASCII equivalent, or use the appropriate character object. For more information on character encoding on the Internet, see Alan Flavell's excellent HTML character set reference.
This error can also be caused by the formatting of characters embedded in documents by some word processors. If you use a word processor to edit your HTML documents, be sure to use "Save As ASCII" or similar to save the document without formatting.

Now I use PHP5.2.17 and played a bit with htmlspecialchars, but nothing worked. How can I encode thoses characters so that there are no more validation errors?

+6

html php validation character-encoding

R_User Mar 16 '12 at 12:08

source share

2 answers

A web page with a text input field must be encoded in UTF-8, as this is the only way to ensure that all characters entered by the user are correctly transmitted. How you deal with them on the server side (for example, rejecting characters outside a certain range) is another problem.

If you use any other encoding, and the user enters a character that has no representation in this encoding, this is an error condition that browsers can handle in any way. Modern browsers do something very unusual in principle, although useful in practice: they represent characters as links to characters, for example ’ for the correct single quote (). In this case, the received data is the same as if the user typed ’ (but it is so theoretical that browser developers seem to ignore the problem).

What happens on the server side in your case is unclear, but it could be due to many types of processing. In any case, you cannot generally store ISO-8859-15 in ISO-8859-1 encoding (ISO-8859-15 was designed to replace some characters in ISO-8859-1 with other characters). It's not clear what your symbolic link software does, like ’ . It would be a little strange, although, of course, it is possible for the software to replace them with character references, such as  (which are based on using Windows-1252 as a document character set, unlike HTML rules, technically undefined -not illegal in HTML, but so widely supported by browsers that HTML5 turns this into a rule).

+2

Jukka K. Korpela Mar 16 '12 at 13:52

source share

hakre · Accepted Answer · 2012-03-16T12:35:21+0000

In both ISO-8859-1 and ISO-8859-15, character number 146 is the control character MW (Message Waiting) from range C1 .

SGML refers to ISO 8859-1 (consider the space between ISO and 8859-1, which is not a hyphen, as in the character sets you use). It does not allow to control characters, but three (here: SGML in HTML ):

Only three control characters are allowed in the HTML document character set: Horizontal Tab, carriage return, and line feed (code positions 9, 13, and 10).

So you missed the illegal character. There is no SGML / HTML object for which you can replace it.

I suggest you check the input that is included in your application so that it does not allow you to control characters. If you think that these characters originally represented a useful thing, such as a letter that can actually be read (for example, a non-control character), it is likely that when processing data, encoding is interrupted at some point.

From the information given in your question, it is difficult to say where, since you only specify the input encoding and the encoding of the filed database - but the two do not match anymore (which should not lead to the problem that you are asking about, but this may cause others Problems). Next to these two places there is also a database client connection encoding (vague in your question), output encoding (vague in your question) and response content encoding (vague in your question).

It might seem that you will change your general encoding to UTF-8 to support a wider range of characters, but this is really possible.

Edit: The part above is a somewhat rigorous look. It occurred to me that the input you receive is not ISO-8859-1 (5), but something else, such as the Windows codepage. I would say Windows-1252 (cp1252) ^{& shy;} ^Wikipedia Compared to the C1 range of ISO-8859-1 (128-159), it has several uncontrollable characters.

The Wikipedia page also notes that most browsers view ISO-8859-1 as Windows-1252 / CP1252 / CP-1252. The PHP htmlentities() function cannot handle these characters, the translation table for HTML objects does not apply to code pages (PHP 5.3, not tested on 5.4). You need to create your own translation table and use it with strtr to replace characters not available in ISO 8859-15 for windows-1252:

 /* * mappings of Windows-1252 (cp1252) 128 (0x80) - 159 (0x9F) characters: * @link http://en.wikipedia.org/wiki/Windows-1252 * @link http://www.w3.org/TR/html4/sgml/entities.html */ $cp1252HTML401Entities = array( "\x80" => '&euro;', # 128 -> euro sign, U+20AC NEW "\x82" => '&sbquo;', # 130 -> single low-9 quotation mark, U+201A NEW "\x83" => '&fnof;', # 131 -> latin small f with hook = function = florin, U+0192 ISOtech "\x84" => '&bdquo;', # 132 -> double low-9 quotation mark, U+201E NEW "\x85" => '&hellip;', # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub "\x86" => '&dagger;', # 134 -> dagger, U+2020 ISOpub "\x87" => '&Dagger;', # 135 -> double dagger, U+2021 ISOpub "\x88" => '&circ;', # 136 -> modifier letter circumflex accent, U+02C6 ISOpub "\x89" => '&permil;', # 137 -> per mille sign, U+2030 ISOtech "\x8A" => '&Scaron;', # 138 -> latin capital letter S with caron, U+0160 ISOlat2 "\x8B" => '&lsaquo;', # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed "\x8C" => '&OElig;', # 140 -> latin capital ligature OE, U+0152 ISOlat2 "\x8E" => '&#381;', # 142 -> U+017D "\x91" => '&lsquo;', # 145 -> left single quotation mark, U+2018 ISOnum "\x92" => '&rsquo;', # 146 -> right single quotation mark, U+2019 ISOnum "\x93" => '&ldquo;', # 147 -> left double quotation mark, U+201C ISOnum "\x94" => '&rdquo;', # 148 -> right double quotation mark, U+201D ISOnum "\x95" => '&bull;', # 149 -> bullet = black small circle, U+2022 ISOpub "\x96" => '&ndash;', # 150 -> en dash, U+2013 ISOpub "\x97" => '&mdash;', # 151 -> em dash, U+2014 ISOpub "\x98" => '&tilde;', # 152 -> small tilde, U+02DC ISOdia "\x99" => '&trade;', # 153 -> trade mark sign, U+2122 ISOnum "\x9A" => '&scaron;', # 154 -> latin small letter s with caron, U+0161 ISOlat2 "\x9B" => '&rsaquo;', # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed "\x9C" => '&oelig;', # 156 -> latin small ligature oe, U+0153 ISOlat2 "\x9E" => '&#382;', # 158 -> U+017E "\x9F" => '&Yuml;', # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2 ); $outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more secure, you can save named objects and just select numeric ones that should also work in very old browsers:

 $cp1252HTMLNumericEntities = array( "\x80" => '&#8364;', # 128 -> euro sign, U+20AC NEW "\x82" => '&#8218;', # 130 -> single low-9 quotation mark, U+201A NEW "\x83" => '&#402;', # 131 -> latin small f with hook = function = florin, U+0192 ISOtech "\x84" => '&#8222;', # 132 -> double low-9 quotation mark, U+201E NEW "\x85" => '&#8230;', # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub "\x86" => '&#8224;', # 134 -> dagger, U+2020 ISOpub "\x87" => '&#8225;', # 135 -> double dagger, U+2021 ISOpub "\x88" => '&#710;', # 136 -> modifier letter circumflex accent, U+02C6 ISOpub "\x89" => '&#8240;', # 137 -> per mille sign, U+2030 ISOtech "\x8A" => '&#352;', # 138 -> latin capital letter S with caron, U+0160 ISOlat2 "\x8B" => '&#8249;', # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed "\x8C" => '&#338;', # 140 -> latin capital ligature OE, U+0152 ISOlat2 "\x8E" => '&#381;', # 142 -> U+017D "\x91" => '&#8216;', # 145 -> left single quotation mark, U+2018 ISOnum "\x92" => '&#8217;', # 146 -> right single quotation mark, U+2019 ISOnum "\x93" => '&#8220;', # 147 -> left double quotation mark, U+201C ISOnum "\x94" => '&#8221;', # 148 -> right double quotation mark, U+201D ISOnum "\x95" => '&#8226;', # 149 -> bullet = black small circle, U+2022 ISOpub "\x96" => '&#8211;', # 150 -> en dash, U+2013 ISOpub "\x97" => '&#8212;', # 151 -> em dash, U+2014 ISOpub "\x98" => '&#732;', # 152 -> small tilde, U+02DC ISOdia "\x99" => '&#8482;', # 153 -> trade mark sign, U+2122 ISOnum "\x9A" => '&#353;', # 154 -> latin small letter s with caron, U+0161 ISOlat2 "\x9B" => '&#8250;', # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed "\x9C" => '&#339;', # 156 -> latin small ligature oe, U+0153 ISOlat2 "\x9E" => '&#382;', # 158 -> U+017E "\x9F" => '&#376;', # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2 );

Hopefully this will become more useful now. See also the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15, but at different points. You should probably consider using UTF-8 on your website.

How to replace non-SGML characters in String using PHP?

More articles: