URI encoding in UNICODE for apache httpclient 4

I work with apache http client 4 for all my web access. This means that every request I need to make must pass URI syntax checks. One of the sites I'm trying to access uses UNICODE as the encoding of the GET url parameters, i.e.:

http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from = 2009-06-01 & srh_until = 2010-02-16 & srh_anaf = -1 & srh_event = 9999 & is_urgent = 0 & srh_company_press =

(the parameter "srh_txt =% u05E0% u05D9% u05D1" encodes srh_txt = ניב in UNICODE)

The problem is that the URI does not support UNICODE encoding (it only supports UTF-8). The really big problem here is that this site expects the parameters to be encoded in UNICODE, so any attempts to convert url using String.format ( " http: //...srh_txt=%s& ...", URLEncoder.encode ("ניב", "UTF8")) leads to a URL that is legal and can be used to build a URI, but the site response to it with an error message, since it is not the expected encoding.

by the way, a URL object can be created and even used to connect to a website using an unprocessed URL. Is there a way to create URIs encoded without UTF-8? Is there a way to work with apache httpclient 4 with a regular URL (rather than a URI)?

thanks, Nive

+1
java uri encoding
Feb 17 2018-10-17
source share
1 answer

(the parameter "srh_txt =% u05E0% u05D9% u05D1" encodes srh_txt = ניב in UNICODE)

This is actually not the case. This is not a URL encoding and the sequence %u not valid in the URL.

%u05E0%u05D9%u05D1" encodes ניב only in JavaScript oddball escape syntax. escape matches the URL encoding for all ASCII characters except + , but the %u#### escape codes that it produces for Unicode characters are completely owned his own invention.

(In general, you should never use escape . Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב = %D7%A0%D7%99%D7%91 )

If the site requires the string %u#### in the query string, it is very broken.

Is there a way to create non-UTF-8 encoded URIs?

Yes, URIs can use any character encoding you like. This is usually UTF-8; what IRI is required and which browsers usually send if the user enters non-ASCII characters in the address bar, but the URI itself refers only to bytes.

So, you can convert ניב to %F0%E9%E1 . The web application cannot say that these bytes represent characters encoded on code page 1255 (Hebrew, similar to ISO-8859-8). But it seems that it works at the link above, which is not in the UTF-8 version. Oh dear!

+1
Feb 17 2018-10-17
source
— -



All Articles