Sanitize foreign characters / accents from URL

I need to write a server side function to clear URL encoded strings.

Querystring example:

FirstName=John&LastName=B%F3th&Address=San+Endre+%FAt+12%2F14 

When I pass this through HttpUtility.UrlDecode() , I get:

 FirstName=John&LastName=B th&Address=San Endre  t 12/14 

The function from this SO post looks perfect, but expects decoded strings that already have accents:

 RemoveDiacritics('Bรณth`) ==> 'Both'; RemoveDiacritics('San Endre รบt 12/14`) ==> 'San Endre ut 12/14'; 

How can I decode a url without getting all these characters?

I canโ€™t do anything on the client side or change the way they enter my function.

+4
source share
3 answers

I agree with the arguments already made; however, if you always get your encoded strings from the same client, then you can match their encoding. In this case, they apparently use ISO / IEC 8859-1 , unofficially known as Latin-1 , which is one of the most popular 8-bit character sets. You can decode ISO / IEC 8859-1 using the following code (which will correctly decode the example string that you provided):

 HttpUtility.UrlDecode(encodedInput, Encoding.GetEncoding("iso-8859-1")); 

MSDN ensures that the above code page is supported by the .NET Framework regardless of the underlying platform; refer to the table of supported encodings for the Encoding Class .

+6
source

UrlDecode expects UTF-8 for its input, where every character greater than \ u007F is encoded with at least 2 bytes. So, the correct line (if the character \ u00F3, -) would contain %C3%B3 , not %F3 .

If the lines appear the way you get them, I'm not sure you can do much. Not with standard libraries, that is.

By the way, removing accents from foreign characters is fine, but I would not call it "disinfection."

+2
source

% F3 and% FA are not UTF8 and ASCII encoded. It appears that the client code encodes a string in the current language version of the page.

Depending on your needs, you can either simply cross out all characters above 127, or figure out how to decode the incoming Url correctly (I don't think that the built-in function exists to process it as it is).

I would copy the characters into a byte array (including manually decoded% -coded) and use the correct encoding to convert it to a string (using Encoding.GetString - http://msdn.microsoft.com/en-us/library/system.text. encoding.getstring.aspx ).

+1
source

Source: https://habr.com/ru/post/1392121/


All Articles