mb_check_encoding, as suggested by another user, seems to be the case. At least the easiest way in PHP.
I really did a lot of this in C ++! There is no mb_check_encoding function, I had to write my own.
Do not use this code in PHP, it's just for the sake of curiosity ;) Use mb_check_encoding.
Also, this "what you call binary gibberish is still valid UTF-8" by another user, TOTALLY WRONG. You can check the UTF-8 with a high degree of accuracy. Assuming, of course, that this is not a tiny string like 4 bytes, and that it has many non-ascii characters. UTF-8 has a specific and “hard-to-get” pattern.
This code also checks for the “non-shortest form” of UTF-8, which is a security issue. UTF-8's “shortest form” can lead to a situation where one program designed to filter out unsuccessful commands actually allows them to possibly lead to SQL injections.
I don't know how PHP handles the not-so-shortest form of UTF-8, though;) It's best to check this out if that bothers you.
long VerifyUTF8(u8* source, u8* sourceEnd) { while (source < sourceEnd) { u8 c = *source++; if (c >= 0x80) { u8* PrevPos = source - 1; source = LegalUTF8_(c, source); if ( source > sourceEnd or !source ) { return sourceEnd - PrevPos; } } } return 0; } // returns 0 if it fails! source point to the 2nd byte of the UTF8! u8* LegalUTF8_(u8 FirstChar, u8* source) { if (FirstChar < 0xC2 or FirstChar > 0xF4) { return 0; // dissallows ASCII! No point calling this on ASCII! } u32 ch = FirstChar; u32 offset; u8 a = *source++; switch (FirstChar) { /* no fall-through in this inner switch */ case 0xE0: if (a < 0xA0) return 0; break; case 0xF0: if (a < 0x90) return 0; break; case 0xF4: if (a > 0x8F) return 0; break; } if (ch <= 0xDF) { offset = 0x00003080; goto case2; } else if (ch <= 0xEF) { offset = 0x000E2080; goto case3; } else { // case 4 offset = 0x03C82080; } ch <<= 6; ch += a; if (a < 0x80 or a > 0xBF) { return 0; } a = *source++; case3:; ch <<= 6; ch += a; if (a < 0x80 or a > 0xBF) { return 0; } a = *source++; case2:; ch <<= 6; ch += a; if (a < 0x80 or a > 0xBF) { return 0; } if (UniValid(ch-offset)) { return source; } return 0; } bool UniValid( u32 c ) { // negative c looks like > 2 billion, which is going to return false! if ( c < 0xD800 ) { // common case first return true; } else if ( c <= 0x0010FFFF and c > 0xDFFF and c != 0xFFFF and c != 0xFFFE ) { return true; } return false; }
source share