Allowed characters in submit forms (including UTF-8)

Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users sometimes use non-ASCII characters like Russian, Chinese, etc., so I use UTF-8 encodings in my database. The question is, should I really allow all possible UTF-8 characters? I looked at the ASCII table and saw that the characters from 0 to 31 have nothing to do with the text, except for newlines and spaces. Characters from 176 to 223 seem to be for decorative purposes: page Should I limit them?

+3
source share
4 answers

Make sure it is really UTF-8 and Unicode? Yes

Make sure it does not include certain characters, such as control codes? May not be required

You should know that even if you use UTF-8 in your form, you cannot receive valid UTF-8 from all user agents when they send you form data, and you will have to filter them as necessary. Invalid UTF-8 can take various forms, some of which

  • Advanced encodings (which can lead to security problems)
  • Other invalid UTF-8 byte sequences that may indicate that the user agent ignored the character encoding and instead sent something like Windows-1252 or ISO-8859-1.
  • ,

, .

HTML XHTML, Unicode, ( ):

  • C0 0x00 0x19 ( , , , )
  • 0x7F
  • C1 0x80 - 0xBF
  • () 0x10FFFF
+4

W3C :

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;
+6

.

, " " . , "", . - , .

+1

" ASCII", , ? - . 128 (.. 0,127) "ASCII"; , 128..255, - ASCII, cp437. " ASCII", cp437 - .

But I was distracted. Your question is not about character encoding, about filtering, but the filter should be based on the properties of the characters: is it a letter, number, control character? Most modern programming languages ​​provide methods or functions for obtaining such information, and most of them also provide support for regular expressions. As for what you should filter, or you should filter in general, only you can know.

It looks like you need to learn more about character encoding and Unicode. Start here.

+1
source

Source: https://habr.com/ru/post/1714929/


All Articles