HTML encodes UTF-8 string, garbled in latin1

I am parsing my nginx logs and I want to find out some details from the HTTP_REFERER line, for example the query string used to search for a website. One user typed "MΓ©xico", which is encoded in the log as "query = M% E9xico".

Going through this Rack::Utils.parse_query('query=M%E9xico'), you get a hash,{"query" => "M?xico"}

When you add "M? Exico" to Postgres (but no more forgiving SQLite), it gets confused because the string doesn't match UTF-8. Looking at http://rack.rubyforge.org/doc/Rack/Utils.html#M000324 , unescape packs a hexadecimal string.

How to convert a string back to UTF-8, or I can make parse_query return UTF-8 in the first place.

+3
source share
2 answers

unescape decrypts the URL encoding:

Rack::Utils.parse_query(URI.unescape('query=M%E9xico'))

or

Rack::Utils.parse_query(Utils.unescape('query=M%E9xico'))
+1
source

The problem here is long before you get the data. You need to fix the problem upstream if you can, and if you cannot, then my suggestion will figure out the encoding and convert it to input, or use conversion libraries in Ruby (like iconv).

The problem is not PostgreSQL.

0
source

Source: https://habr.com/ru/post/1739407/


All Articles